date:20211122

[jira] [Commented] (SPARK-37446) hive-2.3.9 related API use invoke method

2021-11-22 Thread angerszhu (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-37446?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17447812#comment-17447812
 ] 

angerszhu commented on SPARK-37446:
---

Seems has been fixed, will close this

> hive-2.3.9 related API use invoke method
> 
>
> Key: SPARK-37446
> URL: https://issues.apache.org/jira/browse/SPARK-37446
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: angerszhu
>Priority: Major
>
> Here is a condition that when hive version >= 2.3.9, it will call
> {code}
> case (2, 3, v) if v >= 9 => Hive.getWithoutRegisterFns(conf)
> {code}
> This method means to use  getWithoutRegisterFns in 2.3.9 but if we use code 
> of 2.3.8 or lower version, this will cause build failed, we should use invoke 
> method here



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-37446) hive-2.3.9 related API use invoke method

2021-11-22 Thread angerszhu (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37446?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

angerszhu resolved SPARK-37446.
---
Resolution: Duplicate

> hive-2.3.9 related API use invoke method
> 
>
> Key: SPARK-37446
> URL: https://issues.apache.org/jira/browse/SPARK-37446
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: angerszhu
>Priority: Major
>
> Here is a condition that when hive version >= 2.3.9, it will call
> {code}
> case (2, 3, v) if v >= 9 => Hive.getWithoutRegisterFns(conf)
> {code}
> This method means to use  getWithoutRegisterFns in 2.3.9 but if we use code 
> of 2.3.8 or lower version, this will cause build failed, we should use invoke 
> method here



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-37446) hive-2.3.9 related API use invoke method

2021-11-22 Thread angerszhu (Jira)

angerszhu created SPARK-37446:
-

 Summary: hive-2.3.9 related API use invoke method
 Key: SPARK-37446
 URL: https://issues.apache.org/jira/browse/SPARK-37446
 Project: Spark
  Issue Type: Task
  Components: SQL
Affects Versions: 3.2.0
Reporter: angerszhu


Here is a condition that when hive version >= 2.3.9, it will call
{code}
case (2, 3, v) if v >= 9 => Hive.getWithoutRegisterFns(conf)
{code}

This method means to use  getWithoutRegisterFns in 2.3.9 but if we use code of 
2.3.8 or lower version, this will cause build failed, we should use invoke 
method here



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-37446) hive-2.3.9 related API use invoke method

2021-11-22 Thread angerszhu (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-37446?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17447811#comment-17447811
 ] 

angerszhu commented on SPARK-37446:
---

Raise a pr soon

> hive-2.3.9 related API use invoke method
> 
>
> Key: SPARK-37446
> URL: https://issues.apache.org/jira/browse/SPARK-37446
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: angerszhu
>Priority: Major
>
> Here is a condition that when hive version >= 2.3.9, it will call
> {code}
> case (2, 3, v) if v >= 9 => Hive.getWithoutRegisterFns(conf)
> {code}
> This method means to use  getWithoutRegisterFns in 2.3.9 but if we use code 
> of 2.3.8 or lower version, this will cause build failed, we should use invoke 
> method here



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-37445) Update hadoop-profile

2021-11-22 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-37445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17447805#comment-17447805
 ] 

Apache Spark commented on SPARK-37445:
--

User 'AngersZh' has created a pull request for this issue:
https://github.com/apache/spark/pull/34689

> Update hadoop-profile
> -
>
> Key: SPARK-37445
> URL: https://issues.apache.org/jira/browse/SPARK-37445
> Project: Spark
>  Issue Type: Task
>  Components: Build
>Affects Versions: 3.2.0
>Reporter: angerszhu
>Priority: Major
>
> Current hadoop profile is hadoop-3.2, update to hadoop-3.3,



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-37445) Update hadoop-profile

2021-11-22 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37445?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37445:


Assignee: (was: Apache Spark)

> Update hadoop-profile
> -
>
> Key: SPARK-37445
> URL: https://issues.apache.org/jira/browse/SPARK-37445
> Project: Spark
>  Issue Type: Task
>  Components: Build
>Affects Versions: 3.2.0
>Reporter: angerszhu
>Priority: Major
>
> Current hadoop profile is hadoop-3.2, update to hadoop-3.3,



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-37445) Update hadoop-profile

2021-11-22 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-37445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17447804#comment-17447804
 ] 

Apache Spark commented on SPARK-37445:
--

User 'AngersZh' has created a pull request for this issue:
https://github.com/apache/spark/pull/34689

> Update hadoop-profile
> -
>
> Key: SPARK-37445
> URL: https://issues.apache.org/jira/browse/SPARK-37445
> Project: Spark
>  Issue Type: Task
>  Components: Build
>Affects Versions: 3.2.0
>Reporter: angerszhu
>Priority: Major
>
> Current hadoop profile is hadoop-3.2, update to hadoop-3.3,



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-37445) Update hadoop-profile

2021-11-22 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37445?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37445:


Assignee: Apache Spark

> Update hadoop-profile
> -
>
> Key: SPARK-37445
> URL: https://issues.apache.org/jira/browse/SPARK-37445
> Project: Spark
>  Issue Type: Task
>  Components: Build
>Affects Versions: 3.2.0
>Reporter: angerszhu
>Assignee: Apache Spark
>Priority: Major
>
> Current hadoop profile is hadoop-3.2, update to hadoop-3.3,



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-37438) ANSI mode: Use store assignment rules for resolving function invocation

2021-11-22 Thread Gengliang Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37438?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gengliang Wang resolved SPARK-37438.

Fix Version/s: 3.3.0
   Resolution: Fixed

Issue resolved by pull request 34681
[https://github.com/apache/spark/pull/34681]

> ANSI mode: Use store assignment rules for resolving function invocation
> ---
>
> Key: SPARK-37438
> URL: https://issues.apache.org/jira/browse/SPARK-37438
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
> Fix For: 3.3.0
>
>
> Under ANSI mode(spark.sql.ansi.enabled=true), the function invocation of 
> Spark SQL:
> - In general, it follows the `Store assignment` rules as storing the input 
> values as the declared parameter type of the SQL functions
> - Special rules apply for string literals and untyped NULL. A NULL can be 
> promoted to any other type, while a string literal can be promoted to any 
> simple data type.
> {code:sql}
> > SET spark.sql.ansi.enabled=true;
> -- implicitly cast Int to String type
> > SELECT concat('total number: ', 1);
> total number: 1
> -- implicitly cast Timestamp to Date type
> > select datediff(now(), current_date);
> 0
> -- specialrule: implicitly cast String literal to Double type
> > SELECT ceil('0.1');
> 1
> -- specialrule: implicitly cast NULL to Date type
> > SELECT year(null);
> NULL
> > CREATE TABLE t(s string);
> -- Can't store String column as Numeric types.
> > SELECT ceil(s) from t;
> Error in query: cannot resolve 'CEIL(spark_catalog.default.t.s)' due to data 
> type mismatch
> -- Can't store String column as Date type.
> > select year(s) from t;
> Error in query: cannot resolve 'year(spark_catalog.default.t.s)' due to data 
> type mismatch
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32079) PySpark <> Beam pickling issues for collections.namedtuple

2021-11-22 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17447785#comment-17447785
 ] 

Apache Spark commented on SPARK-32079:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/34688

> PySpark <> Beam pickling issues for collections.namedtuple
> --
>
> Key: SPARK-32079
> URL: https://issues.apache.org/jira/browse/SPARK-32079
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Gerard Casas Saez
>Priority: Major
>
> PySpark monkeypatching namedtuple makes it difficult/impossible to depickle 
> collections.namedtuple instances from outside of a pyspark environment.
>  
> When PySpark has been loaded into the environment, any time that you try to 
> pickle a namedtuple, you are only able to unpickle it from an environment 
> where the 
> [hijack|https://github.com/apache/spark/blob/master/python/pyspark/serializers.py#L385]
>  has been applied. 
> This conflicts directly when trying to use Beam from a non-Spark environment 
> (namingly Flink or Dataflow) making it impossible to use the pipeline if it 
> has a namedtuple loaded somewhere. 
>  
> {code:python}
> import collections
> import dill
> ColumnInfo = collections.namedtuple(
> "ColumnInfo",
> [
> "name",  # type: ColumnName  # pytype: disable=ignored-type-comment
> "type",  # type: Optional[ColumnType]  # pytype: 
> disable=ignored-type-comment
> ])
> dill.dumps(ColumnInfo('test', int))
> {code}
> {{b'\x80\x03cdill._dill\n_create_namedtuple\nq\x00X\n\x00\x00\x00ColumnInfoq\x01X\x04\x00\x00\x00nameq\x02X\x04\x00\x00\x00typeq\x03\x86q\x04X\x08\x00\x00\x00__main__q\x05\x87q\x06Rq\x07X\x04\x00\x00\x00testq\x08cdill._dill\n_load_type\nq\tX\x03\x00\x00\x00intq\n\x85q\x0bRq\x0c\x86q\r\x81q\x0e.'}}
> {code:python}
> import pyspark
> import collections
> import dill
> ColumnInfo = collections.namedtuple(
> "ColumnInfo",
> [
> "name",  # type: ColumnName  # pytype: disable=ignored-type-comment
> "type",  # type: Optional[ColumnType]  # pytype: 
> disable=ignored-type-comment
> ])
> dill.dumps(ColumnInfo('test', int))
> {code}
> {{b'\x80\x03cpyspark.serializers\n_restore\nq\x00X\n\x00\x00\x00ColumnInfoq\x01X\x04\x00\x00\x00nameq\x02X\x04\x00\x00\x00typeq\x03\x86q\x04X\x04\x00\x00\x00testq\x05cdill._dill\n_load_type\nq\x06X\x03\x00\x00\x00intq\x07\x85q\x08Rq\t\x86q\n\x87q\x0bRq\x0c.'}}
> Second pickled object can only be used from an environment with PySpark. 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-32079) PySpark <> Beam pickling issues for collections.namedtuple

2021-11-22 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32079?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32079:


Assignee: (was: Apache Spark)

> PySpark <> Beam pickling issues for collections.namedtuple
> --
>
> Key: SPARK-32079
> URL: https://issues.apache.org/jira/browse/SPARK-32079
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Gerard Casas Saez
>Priority: Major
>
> PySpark monkeypatching namedtuple makes it difficult/impossible to depickle 
> collections.namedtuple instances from outside of a pyspark environment.
>  
> When PySpark has been loaded into the environment, any time that you try to 
> pickle a namedtuple, you are only able to unpickle it from an environment 
> where the 
> [hijack|https://github.com/apache/spark/blob/master/python/pyspark/serializers.py#L385]
>  has been applied. 
> This conflicts directly when trying to use Beam from a non-Spark environment 
> (namingly Flink or Dataflow) making it impossible to use the pipeline if it 
> has a namedtuple loaded somewhere. 
>  
> {code:python}
> import collections
> import dill
> ColumnInfo = collections.namedtuple(
> "ColumnInfo",
> [
> "name",  # type: ColumnName  # pytype: disable=ignored-type-comment
> "type",  # type: Optional[ColumnType]  # pytype: 
> disable=ignored-type-comment
> ])
> dill.dumps(ColumnInfo('test', int))
> {code}
> {{b'\x80\x03cdill._dill\n_create_namedtuple\nq\x00X\n\x00\x00\x00ColumnInfoq\x01X\x04\x00\x00\x00nameq\x02X\x04\x00\x00\x00typeq\x03\x86q\x04X\x08\x00\x00\x00__main__q\x05\x87q\x06Rq\x07X\x04\x00\x00\x00testq\x08cdill._dill\n_load_type\nq\tX\x03\x00\x00\x00intq\n\x85q\x0bRq\x0c\x86q\r\x81q\x0e.'}}
> {code:python}
> import pyspark
> import collections
> import dill
> ColumnInfo = collections.namedtuple(
> "ColumnInfo",
> [
> "name",  # type: ColumnName  # pytype: disable=ignored-type-comment
> "type",  # type: Optional[ColumnType]  # pytype: 
> disable=ignored-type-comment
> ])
> dill.dumps(ColumnInfo('test', int))
> {code}
> {{b'\x80\x03cpyspark.serializers\n_restore\nq\x00X\n\x00\x00\x00ColumnInfoq\x01X\x04\x00\x00\x00nameq\x02X\x04\x00\x00\x00typeq\x03\x86q\x04X\x04\x00\x00\x00testq\x05cdill._dill\n_load_type\nq\x06X\x03\x00\x00\x00intq\x07\x85q\x08Rq\t\x86q\n\x87q\x0bRq\x0c.'}}
> Second pickled object can only be used from an environment with PySpark. 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32079) PySpark <> Beam pickling issues for collections.namedtuple

2021-11-22 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17447784#comment-17447784
 ] 

Apache Spark commented on SPARK-32079:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/34688

> PySpark <> Beam pickling issues for collections.namedtuple
> --
>
> Key: SPARK-32079
> URL: https://issues.apache.org/jira/browse/SPARK-32079
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Gerard Casas Saez
>Priority: Major
>
> PySpark monkeypatching namedtuple makes it difficult/impossible to depickle 
> collections.namedtuple instances from outside of a pyspark environment.
>  
> When PySpark has been loaded into the environment, any time that you try to 
> pickle a namedtuple, you are only able to unpickle it from an environment 
> where the 
> [hijack|https://github.com/apache/spark/blob/master/python/pyspark/serializers.py#L385]
>  has been applied. 
> This conflicts directly when trying to use Beam from a non-Spark environment 
> (namingly Flink or Dataflow) making it impossible to use the pipeline if it 
> has a namedtuple loaded somewhere. 
>  
> {code:python}
> import collections
> import dill
> ColumnInfo = collections.namedtuple(
> "ColumnInfo",
> [
> "name",  # type: ColumnName  # pytype: disable=ignored-type-comment
> "type",  # type: Optional[ColumnType]  # pytype: 
> disable=ignored-type-comment
> ])
> dill.dumps(ColumnInfo('test', int))
> {code}
> {{b'\x80\x03cdill._dill\n_create_namedtuple\nq\x00X\n\x00\x00\x00ColumnInfoq\x01X\x04\x00\x00\x00nameq\x02X\x04\x00\x00\x00typeq\x03\x86q\x04X\x08\x00\x00\x00__main__q\x05\x87q\x06Rq\x07X\x04\x00\x00\x00testq\x08cdill._dill\n_load_type\nq\tX\x03\x00\x00\x00intq\n\x85q\x0bRq\x0c\x86q\r\x81q\x0e.'}}
> {code:python}
> import pyspark
> import collections
> import dill
> ColumnInfo = collections.namedtuple(
> "ColumnInfo",
> [
> "name",  # type: ColumnName  # pytype: disable=ignored-type-comment
> "type",  # type: Optional[ColumnType]  # pytype: 
> disable=ignored-type-comment
> ])
> dill.dumps(ColumnInfo('test', int))
> {code}
> {{b'\x80\x03cpyspark.serializers\n_restore\nq\x00X\n\x00\x00\x00ColumnInfoq\x01X\x04\x00\x00\x00nameq\x02X\x04\x00\x00\x00typeq\x03\x86q\x04X\x04\x00\x00\x00testq\x05cdill._dill\n_load_type\nq\x06X\x03\x00\x00\x00intq\x07\x85q\x08Rq\t\x86q\n\x87q\x0bRq\x0c.'}}
> Second pickled object can only be used from an environment with PySpark. 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-32079) PySpark <> Beam pickling issues for collections.namedtuple

2021-11-22 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32079?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32079:


Assignee: Apache Spark

> PySpark <> Beam pickling issues for collections.namedtuple
> --
>
> Key: SPARK-32079
> URL: https://issues.apache.org/jira/browse/SPARK-32079
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Gerard Casas Saez
>Assignee: Apache Spark
>Priority: Major
>
> PySpark monkeypatching namedtuple makes it difficult/impossible to depickle 
> collections.namedtuple instances from outside of a pyspark environment.
>  
> When PySpark has been loaded into the environment, any time that you try to 
> pickle a namedtuple, you are only able to unpickle it from an environment 
> where the 
> [hijack|https://github.com/apache/spark/blob/master/python/pyspark/serializers.py#L385]
>  has been applied. 
> This conflicts directly when trying to use Beam from a non-Spark environment 
> (namingly Flink or Dataflow) making it impossible to use the pipeline if it 
> has a namedtuple loaded somewhere. 
>  
> {code:python}
> import collections
> import dill
> ColumnInfo = collections.namedtuple(
> "ColumnInfo",
> [
> "name",  # type: ColumnName  # pytype: disable=ignored-type-comment
> "type",  # type: Optional[ColumnType]  # pytype: 
> disable=ignored-type-comment
> ])
> dill.dumps(ColumnInfo('test', int))
> {code}
> {{b'\x80\x03cdill._dill\n_create_namedtuple\nq\x00X\n\x00\x00\x00ColumnInfoq\x01X\x04\x00\x00\x00nameq\x02X\x04\x00\x00\x00typeq\x03\x86q\x04X\x08\x00\x00\x00__main__q\x05\x87q\x06Rq\x07X\x04\x00\x00\x00testq\x08cdill._dill\n_load_type\nq\tX\x03\x00\x00\x00intq\n\x85q\x0bRq\x0c\x86q\r\x81q\x0e.'}}
> {code:python}
> import pyspark
> import collections
> import dill
> ColumnInfo = collections.namedtuple(
> "ColumnInfo",
> [
> "name",  # type: ColumnName  # pytype: disable=ignored-type-comment
> "type",  # type: Optional[ColumnType]  # pytype: 
> disable=ignored-type-comment
> ])
> dill.dumps(ColumnInfo('test', int))
> {code}
> {{b'\x80\x03cpyspark.serializers\n_restore\nq\x00X\n\x00\x00\x00ColumnInfoq\x01X\x04\x00\x00\x00nameq\x02X\x04\x00\x00\x00typeq\x03\x86q\x04X\x04\x00\x00\x00testq\x05cdill._dill\n_load_type\nq\x06X\x03\x00\x00\x00intq\x07\x85q\x08Rq\t\x86q\n\x87q\x0bRq\x0c.'}}
> Second pickled object can only be used from an environment with PySpark. 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-37445) Update hadoop-profile

2021-11-22 Thread angerszhu (Jira)

angerszhu created SPARK-37445:
-

 Summary: Update hadoop-profile
 Key: SPARK-37445
 URL: https://issues.apache.org/jira/browse/SPARK-37445
 Project: Spark
  Issue Type: Task
  Components: Build
Affects Versions: 3.2.0
Reporter: angerszhu


Current hadoop profile is hadoop-3.2, update to hadoop-3.3,



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-37445) Update hadoop-profile

2021-11-22 Thread angerszhu (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-37445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17447752#comment-17447752
 ] 

angerszhu commented on SPARK-37445:
---

Raise a pr soon

> Update hadoop-profile
> -
>
> Key: SPARK-37445
> URL: https://issues.apache.org/jira/browse/SPARK-37445
> Project: Spark
>  Issue Type: Task
>  Components: Build
>Affects Versions: 3.2.0
>Reporter: angerszhu
>Priority: Major
>
> Current hadoop profile is hadoop-3.2, update to hadoop-3.3,



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-27810) PySpark breaks Cloudpickle serialization of collections.namedtuple objects

2021-11-22 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-27810?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-27810.
--
Resolution: Duplicate

> PySpark breaks Cloudpickle serialization of collections.namedtuple objects
> --
>
> Key: SPARK-27810
> URL: https://issues.apache.org/jira/browse/SPARK-27810
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.4.3
>Reporter: Travis Addair
>Priority: Major
>
> After importing pyspark, cloudpickle is no longer able to properly serialize 
> objects inheriting from collections.namedtuple, and drops all other class 
> data such that calls to isinstance will fail.
> Here's a minimal reproduction of the issue:
> {{import collections}}
>  {{import cloudpickle}}
>  {{import pyspark}}
> {{class A(object):}}
>  {{    pass}}
> {{class B(object):}}
>  {{    pass}}
> {{class C(A, B, collections.namedtuple('C', ['field'])):}}
>  {{    pass}}
> {{c = C(1)}}
> {{def print_bases(obj):}}
>  {{    bases = obj.__class__.__bases__}}
>  {{    for base in bases:}}
>  {{        print(base)}}
> {{print('original objects:')}}
>  {{print_bases(c)}}
> {{print('\ncloudpickled objects:')}}
>  {{c2 = cloudpickle.loads(cloudpickle.dumps(c))}}
>  {{print_bases(c2)}}
> This prints:
> {{original objects:}}
> {{}}
> {{}}
> {{}}
> {{cloudpickled objects:}}
> {{}}
> Effectively dropping all other types.  It appears this issue is being caused 
> by the 
> [_hijack_namedtuple|https://github.com/apache/spark/blob/master/python/pyspark/serializers.py#L600]
>  function, which replaces the namedtuple class with another one.
> Note that I can workaround this issue by setting 
> {{collections.namedtuple.__hijack = 1}} before importing pyspark, so I feel 
> pretty confident this is what's causing the issue.
> This issue comes up when working with [TensorFlow feature 
> columns|https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/feature_column/feature_column.py],
>  which derive from collections.namedtuple among other classes.
> Cloudpickle also 
> [supports|https://github.com/cloudpipe/cloudpickle/blob/3f4d9da8c567c8e0363880b760b789b40563f5c3/cloudpickle/cloudpickle.py#L900]
>  collections.namedtuple serialization, but doesn't appear to need to replace 
> the class.  Possibly PySpark can do something similar?
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-22674) PySpark breaks serialization of namedtuple subclasses

2021-11-22 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-22674?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-22674.
--
Resolution: Duplicate

> PySpark breaks serialization of namedtuple subclasses
> -
>
> Key: SPARK-22674
> URL: https://issues.apache.org/jira/browse/SPARK-22674
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.2.0, 2.3.0, 3.1.1
>Reporter: Jonas Amrich
>Priority: Major
>
> Pyspark monkey patches the namedtuple class to make it serializable, however 
> this breaks serialization of its subclasses. With current implementation, any 
> subclass will be serialized (and deserialized) as it's parent namedtuple. 
> Consider this code, which will fail with {{AttributeError: 'Point' object has 
> no attribute 'sum'}}:
> {code}
> from collections import namedtuple
> Point = namedtuple("Point", "x y")
> class PointSubclass(Point):
> def sum(self):
> return self.x + self.y
> rdd = spark.sparkContext.parallelize([[PointSubclass(1, 1)]])
> rdd.collect()[0][0].sum()
> {code}
> Moreover, as PySpark hijacks all namedtuples in the main module, importing 
> pyspark breaks serialization of namedtuple subclasses even in code which is 
> not related to spark / distributed execution. I don't see any clean solution 
> to this; a possible workaround may be to limit serialization hack only to 
> direct namedtuple subclasses like in 
> https://github.com/JonasAmrich/spark/commit/f3efecee28243380ecf6657fe54e1a165c1b7204



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32079) PySpark <> Beam pickling issues for collections.namedtuple

2021-11-22 Thread Hyukjin Kwon (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17447747#comment-17447747
 ] 

Hyukjin Kwon commented on SPARK-32079:
--

im working on this.

> PySpark <> Beam pickling issues for collections.namedtuple
> --
>
> Key: SPARK-32079
> URL: https://issues.apache.org/jira/browse/SPARK-32079
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.0.0
>Reporter: Gerard Casas Saez
>Priority: Major
>
> PySpark monkeypatching namedtuple makes it difficult/impossible to depickle 
> collections.namedtuple instances from outside of a pyspark environment.
>  
> When PySpark has been loaded into the environment, any time that you try to 
> pickle a namedtuple, you are only able to unpickle it from an environment 
> where the 
> [hijack|https://github.com/apache/spark/blob/master/python/pyspark/serializers.py#L385]
>  has been applied. 
> This conflicts directly when trying to use Beam from a non-Spark environment 
> (namingly Flink or Dataflow) making it impossible to use the pipeline if it 
> has a namedtuple loaded somewhere. 
>  
> {code:python}
> import collections
> import dill
> ColumnInfo = collections.namedtuple(
> "ColumnInfo",
> [
> "name",  # type: ColumnName  # pytype: disable=ignored-type-comment
> "type",  # type: Optional[ColumnType]  # pytype: 
> disable=ignored-type-comment
> ])
> dill.dumps(ColumnInfo('test', int))
> {code}
> {{b'\x80\x03cdill._dill\n_create_namedtuple\nq\x00X\n\x00\x00\x00ColumnInfoq\x01X\x04\x00\x00\x00nameq\x02X\x04\x00\x00\x00typeq\x03\x86q\x04X\x08\x00\x00\x00__main__q\x05\x87q\x06Rq\x07X\x04\x00\x00\x00testq\x08cdill._dill\n_load_type\nq\tX\x03\x00\x00\x00intq\n\x85q\x0bRq\x0c\x86q\r\x81q\x0e.'}}
> {code:python}
> import pyspark
> import collections
> import dill
> ColumnInfo = collections.namedtuple(
> "ColumnInfo",
> [
> "name",  # type: ColumnName  # pytype: disable=ignored-type-comment
> "type",  # type: Optional[ColumnType]  # pytype: 
> disable=ignored-type-comment
> ])
> dill.dumps(ColumnInfo('test', int))
> {code}
> {{b'\x80\x03cpyspark.serializers\n_restore\nq\x00X\n\x00\x00\x00ColumnInfoq\x01X\x04\x00\x00\x00nameq\x02X\x04\x00\x00\x00typeq\x03\x86q\x04X\x04\x00\x00\x00testq\x05cdill._dill\n_load_type\nq\x06X\x03\x00\x00\x00intq\x07\x85q\x08Rq\t\x86q\n\x87q\x0bRq\x0c.'}}
> Second pickled object can only be used from an environment with PySpark. 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-32079) PySpark <> Beam pickling issues for collections.namedtuple

2021-11-22 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32079?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-32079:
-
Affects Version/s: 3.3.0
   (was: 3.0.0)

> PySpark <> Beam pickling issues for collections.namedtuple
> --
>
> Key: SPARK-32079
> URL: https://issues.apache.org/jira/browse/SPARK-32079
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Gerard Casas Saez
>Priority: Major
>
> PySpark monkeypatching namedtuple makes it difficult/impossible to depickle 
> collections.namedtuple instances from outside of a pyspark environment.
>  
> When PySpark has been loaded into the environment, any time that you try to 
> pickle a namedtuple, you are only able to unpickle it from an environment 
> where the 
> [hijack|https://github.com/apache/spark/blob/master/python/pyspark/serializers.py#L385]
>  has been applied. 
> This conflicts directly when trying to use Beam from a non-Spark environment 
> (namingly Flink or Dataflow) making it impossible to use the pipeline if it 
> has a namedtuple loaded somewhere. 
>  
> {code:python}
> import collections
> import dill
> ColumnInfo = collections.namedtuple(
> "ColumnInfo",
> [
> "name",  # type: ColumnName  # pytype: disable=ignored-type-comment
> "type",  # type: Optional[ColumnType]  # pytype: 
> disable=ignored-type-comment
> ])
> dill.dumps(ColumnInfo('test', int))
> {code}
> {{b'\x80\x03cdill._dill\n_create_namedtuple\nq\x00X\n\x00\x00\x00ColumnInfoq\x01X\x04\x00\x00\x00nameq\x02X\x04\x00\x00\x00typeq\x03\x86q\x04X\x08\x00\x00\x00__main__q\x05\x87q\x06Rq\x07X\x04\x00\x00\x00testq\x08cdill._dill\n_load_type\nq\tX\x03\x00\x00\x00intq\n\x85q\x0bRq\x0c\x86q\r\x81q\x0e.'}}
> {code:python}
> import pyspark
> import collections
> import dill
> ColumnInfo = collections.namedtuple(
> "ColumnInfo",
> [
> "name",  # type: ColumnName  # pytype: disable=ignored-type-comment
> "type",  # type: Optional[ColumnType]  # pytype: 
> disable=ignored-type-comment
> ])
> dill.dumps(ColumnInfo('test', int))
> {code}
> {{b'\x80\x03cpyspark.serializers\n_restore\nq\x00X\n\x00\x00\x00ColumnInfoq\x01X\x04\x00\x00\x00nameq\x02X\x04\x00\x00\x00typeq\x03\x86q\x04X\x04\x00\x00\x00testq\x05cdill._dill\n_load_type\nq\x06X\x03\x00\x00\x00intq\x07\x85q\x08Rq\t\x86q\n\x87q\x0bRq\x0c.'}}
> Second pickled object can only be used from an environment with PySpark. 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-36231) Support arithmetic operations of Series containing Decimal(np.nan)

2021-11-22 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-36231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17447744#comment-17447744
 ] 

Apache Spark commented on SPARK-36231:
--

User 'Yikun' has created a pull request for this issue:
https://github.com/apache/spark/pull/34687

> Support arithmetic operations of Series containing Decimal(np.nan) 
> ---
>
> Key: SPARK-36231
> URL: https://issues.apache.org/jira/browse/SPARK-36231
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Xinrong Meng
>Assignee: Yikun Jiang
>Priority: Major
>
> Arithmetic operations of Series containing Decimal(np.nan) raise 
> java.lang.NullPointerException in driver. An example is shown as below:
> {code:java}
> >>> pser = pd.Series([decimal.Decimal(1.0), decimal.Decimal(2.0), 
> >>> decimal.Decimal(np.nan)])
> >>> psser = ps.from_pandas(pser)
> >>> pser + 1
> 0 2
>  1 3
>  2 NaN
> >>> psser + 1
>  Driver stacktrace:
>  at 
> org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2259)
>  at 
> org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2208)
>  at 
> org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2207)
>  at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
>  at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
>  at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
>  at 
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2207)
>  at 
> org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1084)
>  at 
> org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1084)
>  at scala.Option.foreach(Option.scala:407)
>  at 
> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1084)
>  at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2446)
>  at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2388)
>  at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2377)
>  at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
>  at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:873)
>  at org.apache.spark.SparkContext.runJob(SparkContext.scala:2208)
>  at org.apache.spark.SparkContext.runJob(SparkContext.scala:2303)
>  at 
> org.apache.spark.sql.Dataset.$anonfun$collectAsArrowToPython$5(Dataset.scala:3648)
>  at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
>  at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1437)
>  at 
> org.apache.spark.sql.Dataset.$anonfun$collectAsArrowToPython$2(Dataset.scala:3652)
>  at 
> org.apache.spark.sql.Dataset.$anonfun$collectAsArrowToPython$2$adapted(Dataset.scala:3629)
>  at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3706)
>  at 
> org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:103)
>  at 
> org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163)
>  at 
> org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:90)
>  at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:774)
>  at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
>  at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3704)
>  at 
> org.apache.spark.sql.Dataset.$anonfun$collectAsArrowToPython$1(Dataset.scala:3629)
>  at 
> org.apache.spark.sql.Dataset.$anonfun$collectAsArrowToPython$1$adapted(Dataset.scala:3628)
>  at 
> org.apache.spark.security.SocketAuthServer$.$anonfun$serveToStream$2(SocketAuthServer.scala:139)
>  at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
>  at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1437)
>  at 
> org.apache.spark.security.SocketAuthServer$.$anonfun$serveToStream$1(SocketAuthServer.scala:141)
>  at 
> org.apache.spark.security.SocketAuthServer$.$anonfun$serveToStream$1$adapted(SocketAuthServer.scala:136)
>  at 
> org.apache.spark.security.SocketFuncServer.handleConnection(SocketAuthServer.scala:113)
>  at 
> org.apache.spark.security.SocketFuncServer.handleConnection(SocketAuthServer.scala:107)
>  at 
> org.apache.spark.security.SocketAuthServer$$anon$1.$anonfun$run$4(SocketAuthServer.scala:68)
>  at scala.util.Try$.apply(Try.scala:213)
>  at 
> org.apache.spark.security.SocketAuthServer$$anon$1.run(SocketAuthServer.scala:68)
>  Caused by: java.lang.NullPointerException
>  at 
>

[jira] [Commented] (SPARK-36231) Support arithmetic operations of Series containing Decimal(np.nan)

2021-11-22 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-36231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17447743#comment-17447743
 ] 

Apache Spark commented on SPARK-36231:
--

User 'Yikun' has created a pull request for this issue:
https://github.com/apache/spark/pull/34687

> Support arithmetic operations of Series containing Decimal(np.nan) 
> ---
>
> Key: SPARK-36231
> URL: https://issues.apache.org/jira/browse/SPARK-36231
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Xinrong Meng
>Assignee: Yikun Jiang
>Priority: Major
>
> Arithmetic operations of Series containing Decimal(np.nan) raise 
> java.lang.NullPointerException in driver. An example is shown as below:
> {code:java}
> >>> pser = pd.Series([decimal.Decimal(1.0), decimal.Decimal(2.0), 
> >>> decimal.Decimal(np.nan)])
> >>> psser = ps.from_pandas(pser)
> >>> pser + 1
> 0 2
>  1 3
>  2 NaN
> >>> psser + 1
>  Driver stacktrace:
>  at 
> org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2259)
>  at 
> org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2208)
>  at 
> org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2207)
>  at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
>  at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
>  at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
>  at 
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2207)
>  at 
> org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1084)
>  at 
> org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1084)
>  at scala.Option.foreach(Option.scala:407)
>  at 
> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1084)
>  at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2446)
>  at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2388)
>  at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2377)
>  at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
>  at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:873)
>  at org.apache.spark.SparkContext.runJob(SparkContext.scala:2208)
>  at org.apache.spark.SparkContext.runJob(SparkContext.scala:2303)
>  at 
> org.apache.spark.sql.Dataset.$anonfun$collectAsArrowToPython$5(Dataset.scala:3648)
>  at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
>  at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1437)
>  at 
> org.apache.spark.sql.Dataset.$anonfun$collectAsArrowToPython$2(Dataset.scala:3652)
>  at 
> org.apache.spark.sql.Dataset.$anonfun$collectAsArrowToPython$2$adapted(Dataset.scala:3629)
>  at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3706)
>  at 
> org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:103)
>  at 
> org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163)
>  at 
> org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:90)
>  at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:774)
>  at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
>  at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3704)
>  at 
> org.apache.spark.sql.Dataset.$anonfun$collectAsArrowToPython$1(Dataset.scala:3629)
>  at 
> org.apache.spark.sql.Dataset.$anonfun$collectAsArrowToPython$1$adapted(Dataset.scala:3628)
>  at 
> org.apache.spark.security.SocketAuthServer$.$anonfun$serveToStream$2(SocketAuthServer.scala:139)
>  at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
>  at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1437)
>  at 
> org.apache.spark.security.SocketAuthServer$.$anonfun$serveToStream$1(SocketAuthServer.scala:141)
>  at 
> org.apache.spark.security.SocketAuthServer$.$anonfun$serveToStream$1$adapted(SocketAuthServer.scala:136)
>  at 
> org.apache.spark.security.SocketFuncServer.handleConnection(SocketAuthServer.scala:113)
>  at 
> org.apache.spark.security.SocketFuncServer.handleConnection(SocketAuthServer.scala:107)
>  at 
> org.apache.spark.security.SocketAuthServer$$anon$1.$anonfun$run$4(SocketAuthServer.scala:68)
>  at scala.util.Try$.apply(Try.scala:213)
>  at 
> org.apache.spark.security.SocketAuthServer$$anon$1.run(SocketAuthServer.scala:68)
>  Caused by: java.lang.NullPointerException
>  at 
>

[jira] [Commented] (SPARK-37444) ALTER NAMESPACE ... SET LOCATION should handle empty location consistently across v1 and v2 command

2021-11-22 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-37444?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17447727#comment-17447727
 ] 

Apache Spark commented on SPARK-37444:
--

User 'imback82' has created a pull request for this issue:
https://github.com/apache/spark/pull/34686

> ALTER NAMESPACE ... SET LOCATION should handle empty location consistently 
> across v1 and v2 command
> ---
>
> Key: SPARK-37444
> URL: https://issues.apache.org/jira/browse/SPARK-37444
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Terry Kim
>Priority: Major
>
> ALTER NAMESPACE ... SET LOCATION should handle empty location consistently 
> across v1 and v2 command



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-37444) ALTER NAMESPACE ... SET LOCATION should handle empty location consistently across v1 and v2 command

2021-11-22 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37444?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37444:


Assignee: (was: Apache Spark)

> ALTER NAMESPACE ... SET LOCATION should handle empty location consistently 
> across v1 and v2 command
> ---
>
> Key: SPARK-37444
> URL: https://issues.apache.org/jira/browse/SPARK-37444
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Terry Kim
>Priority: Major
>
> ALTER NAMESPACE ... SET LOCATION should handle empty location consistently 
> across v1 and v2 command



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-37444) ALTER NAMESPACE ... SET LOCATION should handle empty location consistently across v1 and v2 command

2021-11-22 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37444?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37444:


Assignee: Apache Spark

> ALTER NAMESPACE ... SET LOCATION should handle empty location consistently 
> across v1 and v2 command
> ---
>
> Key: SPARK-37444
> URL: https://issues.apache.org/jira/browse/SPARK-37444
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Terry Kim
>Assignee: Apache Spark
>Priority: Major
>
> ALTER NAMESPACE ... SET LOCATION should handle empty location consistently 
> across v1 and v2 command



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-37444) ALTER NAMESPACE ... SET LOCATION should handle empty location consistently across v1 and v2 command

2021-11-22 Thread Terry Kim (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37444?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Terry Kim updated SPARK-37444:
--
Description: ALTER NAMESPACE ... SET LOCATION should handle empty location 
consistently across v1 and v2 command  (was: ALTER NAMESPACE ... SET LOCATION 
handles empty location consistently across v1 and v2 command)

> ALTER NAMESPACE ... SET LOCATION should handle empty location consistently 
> across v1 and v2 command
> ---
>
> Key: SPARK-37444
> URL: https://issues.apache.org/jira/browse/SPARK-37444
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Terry Kim
>Priority: Major
>
> ALTER NAMESPACE ... SET LOCATION should handle empty location consistently 
> across v1 and v2 command



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-37444) ALTER NAMESPACE ... SET LOCATION should handle empty location consistently across v1 and v2 command

2021-11-22 Thread Terry Kim (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37444?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Terry Kim updated SPARK-37444:
--
Summary: ALTER NAMESPACE ... SET LOCATION should handle empty location 
consistently across v1 and v2 command  (was: ALTER NAMESPACE ... SET LOCATION 
handles empty location consistently across v1 and v2 command)

> ALTER NAMESPACE ... SET LOCATION should handle empty location consistently 
> across v1 and v2 command
> ---
>
> Key: SPARK-37444
> URL: https://issues.apache.org/jira/browse/SPARK-37444
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Terry Kim
>Priority: Major
>
> ALTER NAMESPACE ... SET LOCATION handles empty location consistently across 
> v1 and v2 command



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-37444) ALTER NAMESPACE ... SET LOCATION handles empty location consistently across v1 and v2 command

2021-11-22 Thread Terry Kim (Jira)

Terry Kim created SPARK-37444:
-

 Summary: ALTER NAMESPACE ... SET LOCATION handles empty location 
consistently across v1 and v2 command
 Key: SPARK-37444
 URL: https://issues.apache.org/jira/browse/SPARK-37444
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.3.0
Reporter: Terry Kim


ALTER NAMESPACE ... SET LOCATION handles empty location consistently across v1 
and v2 command



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-30537) toPandas gets wrong dtypes when applied on empty DF when Arrow enabled

2021-11-22 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30537?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-30537.
--
Fix Version/s: 3.3.0
   Resolution: Fixed

Issue resolved by pull request 34401
[https://github.com/apache/spark/pull/34401]

> toPandas gets wrong dtypes when applied on empty DF when Arrow enabled
> --
>
> Key: SPARK-30537
> URL: https://issues.apache.org/jira/browse/SPARK-30537
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 2.4.4, 3.0.0
>Reporter: Hyukjin Kwon
>Assignee: pralabhkumar
>Priority: Major
> Fix For: 3.3.0
>
>
> Same issue with SPARK-29188 persists when Arrow optimization is enabled.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-30537) toPandas gets wrong dtypes when applied on empty DF when Arrow enabled

2021-11-22 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30537?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-30537:


Assignee: pralabhkumar

> toPandas gets wrong dtypes when applied on empty DF when Arrow enabled
> --
>
> Key: SPARK-30537
> URL: https://issues.apache.org/jira/browse/SPARK-30537
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 2.4.4, 3.0.0
>Reporter: Hyukjin Kwon
>Assignee: pralabhkumar
>Priority: Major
>
> Same issue with SPARK-29188 persists when Arrow optimization is enabled.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-37391) SIGNIFICANT bottleneck introduced by fix for SPARK-32001

2021-11-22 Thread Hyukjin Kwon (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-37391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17447702#comment-17447702
 ] 

Hyukjin Kwon commented on SPARK-37391:
--

cc [~gaborgsomogyi] FYI

> SIGNIFICANT bottleneck introduced by fix for SPARK-32001
> 
>
> Key: SPARK-37391
> URL: https://issues.apache.org/jira/browse/SPARK-37391
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.0, 3.1.1, 3.1.2, 3.2.0
> Environment: N/A
>Reporter: Danny Guinther
>Priority: Major
> Attachments: so-much-blocking.jpg, spark-regression-dashes.jpg
>
>
> The fix for https://issues.apache.org/jira/browse/SPARK-32001 ( 
> [https://github.com/apache/spark/pull/29024/files#diff-345beef18081272d77d91eeca2d9b5534ff6e642245352f40f4e9c9b8922b085R58]
>  ) does not seem to have consider the reality that some apps may rely on 
> being able to establish many JDBC connections simultaneously for performance 
> reasons.
> The fix forces concurrency to 1 when establishing database connections and 
> that strikes me as a *significant* user impacting change and a *significant* 
> bottleneck.
> Can anyone propose a workaround for this? I have an app that makes 
> connections to thousands of databases and I can't upgrade to any version 
> >3.1.x because of this significant bottleneck.
>  
> Thanks in advance for your help!



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-37337) Improve the API of Spark DataFrame to pandas-on-Spark DataFrame conversion

2021-11-22 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37337?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-37337.
--
Fix Version/s: 3.3.0
   Resolution: Fixed

Issue resolved by pull request 34608
[https://github.com/apache/spark/pull/34608]

> Improve the API of Spark DataFrame to pandas-on-Spark DataFrame conversion
> --
>
> Key: SPARK-37337
> URL: https://issues.apache.org/jira/browse/SPARK-37337
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Xinrong Meng
>Assignee: Xinrong Meng
>Priority: Major
> Fix For: 3.3.0
>
>
> - Undeprecate (Spark)DataFrame.to_koalas 
> - Deprecate (Spark)DataFrame.to_pandas_like and introduce 
> (Spark)DataFrame.pandas_api instead.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-37337) Improve the API of Spark DataFrame to pandas-on-Spark DataFrame conversion

2021-11-22 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37337?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-37337:


Assignee: Xinrong Meng

> Improve the API of Spark DataFrame to pandas-on-Spark DataFrame conversion
> --
>
> Key: SPARK-37337
> URL: https://issues.apache.org/jira/browse/SPARK-37337
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Xinrong Meng
>Assignee: Xinrong Meng
>Priority: Major
>
> - Undeprecate (Spark)DataFrame.to_koalas 
> - Deprecate (Spark)DataFrame.to_pandas_like and introduce 
> (Spark)DataFrame.pandas_api instead.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-37443) Provide a profiler for Python/Pandas UDFs

2021-11-22 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-37443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17447682#comment-17447682
 ] 

Apache Spark commented on SPARK-37443:
--

User 'ueshin' has created a pull request for this issue:
https://github.com/apache/spark/pull/34685

> Provide a profiler for Python/Pandas UDFs
> -
>
> Key: SPARK-37443
> URL: https://issues.apache.org/jira/browse/SPARK-37443
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Takuya Ueshin
>Priority: Major
>
> Currently a profiler is provided for only {{RDD}} operations, but providing a 
> profiler for Python/Pandas UDFs would be great.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-37443) Provide a profiler for Python/Pandas UDFs

2021-11-22 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37443:


Assignee: (was: Apache Spark)

> Provide a profiler for Python/Pandas UDFs
> -
>
> Key: SPARK-37443
> URL: https://issues.apache.org/jira/browse/SPARK-37443
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Takuya Ueshin
>Priority: Major
>
> Currently a profiler is provided for only {{RDD}} operations, but providing a 
> profiler for Python/Pandas UDFs would be great.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-37443) Provide a profiler for Python/Pandas UDFs

2021-11-22 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37443:


Assignee: Apache Spark

> Provide a profiler for Python/Pandas UDFs
> -
>
> Key: SPARK-37443
> URL: https://issues.apache.org/jira/browse/SPARK-37443
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Takuya Ueshin
>Assignee: Apache Spark
>Priority: Major
>
> Currently a profiler is provided for only {{RDD}} operations, but providing a 
> profiler for Python/Pandas UDFs would be great.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-37443) Provide a profiler for Python/Pandas UDFs

2021-11-22 Thread Takuya Ueshin (Jira)

Takuya Ueshin created SPARK-37443:
-

 Summary: Provide a profiler for Python/Pandas UDFs
 Key: SPARK-37443
 URL: https://issues.apache.org/jira/browse/SPARK-37443
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Affects Versions: 3.3.0
Reporter: Takuya Ueshin


Currently a profiler is provided for only {{RDD}} operations, but providing a 
profiler for Python/Pandas UDFs would be great.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-37439) org.apache.spark.sql.AnalysisException: Non-time-based windows are not supported on streaming DataFrames/Datasets;; despite of time-based window

2021-11-22 Thread Jungtaek Lim (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37439?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jungtaek Lim resolved SPARK-37439.
--
Resolution: Not A Problem

> org.apache.spark.sql.AnalysisException: Non-time-based windows are not 
> supported on streaming DataFrames/Datasets;; despite of time-based window
> 
>
> Key: SPARK-37439
> URL: https://issues.apache.org/jira/browse/SPARK-37439
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 3.0.1
>Reporter: Ilya
>Priority: Major
>
> Initially posted here: 
> [https://stackoverflow.com/questions/70062355/org-apache-spark-sql-analysisexception-non-time-based-windows-are-not-supported]
>  
> 'm doing the window-based sorting for the Spark Structured Streaming:
>  
> {{val filterWindow: WindowSpec = Window  .partitionBy("key")
>   .orderBy($"time")
> controlDataFrame=controlDataFrame.withColumn("Make Coffee", $"value").
>   withColumn("datetime", date_trunc("second", current_timestamp())).
>   withColumn("time", current_timestamp()).
>   withColumn("temp_rank", rank().over(filterWindow))
>   .filter(col("temp_rank") === 1)
>   .drop("temp_rank").
>   withColumn("digitalTwinId", lit(digitalTwinId)).
>   withWatermark("datetime", "10 seconds")}}
> I'm obtaining {{time}} as {{current_timestamp()}} and in schemat I see its 
> type as {{StructField(time,TimestampType,true)}}
> Why Spark 3.0 doesn't allow me to do the window operation based on it with 
> the following exception, as the filed is clearly time-based?
>  
> {{21/11/22 10:34:03 WARN SparkSession$Builder: Using an existing 
> SparkSession; some spark core configurations may not take effect.
> org.apache.spark.sql.AnalysisException: Non-time-based windows are not 
> supported on streaming DataFrames/Datasets;;Window [rank(time#163) 
> windowspecdefinition(key#150, time#163 ASC NULLS FIRST, 
> specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$())) AS 
> temp_rank#171], [key#150], [time#163 ASC NULLS FIRST]
> +- Project [key#150, value#151, Make Coffee#154, datetime#158, time#163]}}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-37439) org.apache.spark.sql.AnalysisException: Non-time-based windows are not supported on streaming DataFrames/Datasets;; despite of time-based window

2021-11-22 Thread Jungtaek Lim (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-37439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17447616#comment-17447616
 ] 

Jungtaek Lim commented on SPARK-37439:
--

Hi,

By time-window we described what time windows are supported in SS natively.

[http://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#types-of-time-windows]

Window spec is not supported. This defines the boundary of window as non-timed 
manner, the offset(s) of the row, which is hard to track in streaming context.

We have mailing list group for users. Please go through users mailing list if 
you have questions.

[http://spark.apache.org/community.html]

Thanks!

> org.apache.spark.sql.AnalysisException: Non-time-based windows are not 
> supported on streaming DataFrames/Datasets;; despite of time-based window
> 
>
> Key: SPARK-37439
> URL: https://issues.apache.org/jira/browse/SPARK-37439
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 3.0.1
>Reporter: Ilya
>Priority: Major
>
> Initially posted here: 
> [https://stackoverflow.com/questions/70062355/org-apache-spark-sql-analysisexception-non-time-based-windows-are-not-supported]
>  
> 'm doing the window-based sorting for the Spark Structured Streaming:
>  
> {{val filterWindow: WindowSpec = Window  .partitionBy("key")
>   .orderBy($"time")
> controlDataFrame=controlDataFrame.withColumn("Make Coffee", $"value").
>   withColumn("datetime", date_trunc("second", current_timestamp())).
>   withColumn("time", current_timestamp()).
>   withColumn("temp_rank", rank().over(filterWindow))
>   .filter(col("temp_rank") === 1)
>   .drop("temp_rank").
>   withColumn("digitalTwinId", lit(digitalTwinId)).
>   withWatermark("datetime", "10 seconds")}}
> I'm obtaining {{time}} as {{current_timestamp()}} and in schemat I see its 
> type as {{StructField(time,TimestampType,true)}}
> Why Spark 3.0 doesn't allow me to do the window operation based on it with 
> the following exception, as the filed is clearly time-based?
>  
> {{21/11/22 10:34:03 WARN SparkSession$Builder: Using an existing 
> SparkSession; some spark core configurations may not take effect.
> org.apache.spark.sql.AnalysisException: Non-time-based windows are not 
> supported on streaming DataFrames/Datasets;;Window [rank(time#163) 
> windowspecdefinition(key#150, time#163 ASC NULLS FIRST, 
> specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$())) AS 
> temp_rank#171], [key#150], [time#163 ASC NULLS FIRST]
> +- Project [key#150, value#151, Make Coffee#154, datetime#158, time#163]}}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-37441) org.apache.spark.sql.AnalysisException: Non-time-based windows are not supported on streaming DataFrames/Datasets;; despite of time-based window

2021-11-22 Thread Jungtaek Lim (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37441?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jungtaek Lim resolved SPARK-37441.
--
Resolution: Duplicate

> org.apache.spark.sql.AnalysisException: Non-time-based windows are not 
> supported on streaming DataFrames/Datasets;; despite of time-based window
> 
>
> Key: SPARK-37441
> URL: https://issues.apache.org/jira/browse/SPARK-37441
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 3.0.1
>Reporter: Ilya
>Priority: Major
>
> Initially posted here: 
> [https://stackoverflow.com/questions/70062355/org-apache-spark-sql-analysisexception-non-time-based-windows-are-not-supported]
>  
> 'm doing the window-based sorting for the Spark Structured Streaming:
>  
> {{val filterWindow: WindowSpec = Window  .partitionBy("key")
>   .orderBy($"time")
> controlDataFrame=controlDataFrame.withColumn("Make Coffee", $"value").
>   withColumn("datetime", date_trunc("second", current_timestamp())).
>   withColumn("time", current_timestamp()).
>   withColumn("temp_rank", rank().over(filterWindow))
>   .filter(col("temp_rank") === 1)
>   .drop("temp_rank").
>   withColumn("digitalTwinId", lit(digitalTwinId)).
>   withWatermark("datetime", "10 seconds")}}
> I'm obtaining {{time}} as {{current_timestamp()}} and in schemat I see its 
> type as {{StructField(time,TimestampType,true)}}
> Why Spark 3.0 doesn't allow me to do the window operation based on it with 
> the following exception, as the filed is clearly time-based?
>  
> {{21/11/22 10:34:03 WARN SparkSession$Builder: Using an existing 
> SparkSession; some spark core configurations may not take effect.
> org.apache.spark.sql.AnalysisException: Non-time-based windows are not 
> supported on streaming DataFrames/Datasets;;Window [rank(time#163) 
> windowspecdefinition(key#150, time#163 ASC NULLS FIRST, 
> specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$())) AS 
> temp_rank#171], [key#150], [time#163 ASC NULLS FIRST]
> +- Project [key#150, value#151, Make Coffee#154, datetime#158, time#163]}}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-37440) org.apache.spark.sql.AnalysisException: Non-time-based windows are not supported on streaming DataFrames/Datasets;; despite of time-based window

2021-11-22 Thread Jungtaek Lim (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37440?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jungtaek Lim resolved SPARK-37440.
--
Resolution: Duplicate

> org.apache.spark.sql.AnalysisException: Non-time-based windows are not 
> supported on streaming DataFrames/Datasets;; despite of time-based window
> 
>
> Key: SPARK-37440
> URL: https://issues.apache.org/jira/browse/SPARK-37440
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 3.0.1
>Reporter: Ilya
>Priority: Major
>
> Initially posted here: 
> [https://stackoverflow.com/questions/70062355/org-apache-spark-sql-analysisexception-non-time-based-windows-are-not-supported]
>  
> 'm doing the window-based sorting for the Spark Structured Streaming:
>  
> {{val filterWindow: WindowSpec = Window  .partitionBy("key")
>   .orderBy($"time")
> controlDataFrame=controlDataFrame.withColumn("Make Coffee", $"value").
>   withColumn("datetime", date_trunc("second", current_timestamp())).
>   withColumn("time", current_timestamp()).
>   withColumn("temp_rank", rank().over(filterWindow))
>   .filter(col("temp_rank") === 1)
>   .drop("temp_rank").
>   withColumn("digitalTwinId", lit(digitalTwinId)).
>   withWatermark("datetime", "10 seconds")}}
> I'm obtaining {{time}} as {{current_timestamp()}} and in schemat I see its 
> type as {{StructField(time,TimestampType,true)}}
> Why Spark 3.0 doesn't allow me to do the window operation based on it with 
> the following exception, as the filed is clearly time-based?
>  
> {{21/11/22 10:34:03 WARN SparkSession$Builder: Using an existing 
> SparkSession; some spark core configurations may not take effect.
> org.apache.spark.sql.AnalysisException: Non-time-based windows are not 
> supported on streaming DataFrames/Datasets;;Window [rank(time#163) 
> windowspecdefinition(key#150, time#163 ASC NULLS FIRST, 
> specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$())) AS 
> temp_rank#171], [key#150], [time#163 ASC NULLS FIRST]
> +- Project [key#150, value#151, Make Coffee#154, datetime#158, time#163]}}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-37259) JDBC read is always going to wrap the query in a select statement

2021-11-22 Thread Kevin Appel (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-37259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17447578#comment-17447578
 ] 

Kevin Appel commented on SPARK-37259:
-

It would be difficult to be able to actually split up the query into the parts, 
and to align one of the selects to match the one hard coded in the query; then 
the other issues issue about needing to patch into the dialect and handle how 
it passes that query today to get the schema and having a way to get that, 
without running the query twice.

The other query that uses temp tables, in the sql server it is either 
#temptable or ##temptable is also still an issue because of how it getting 
wrapped in the select and the similar item if that runs the query to get the 
schema, then it actually creates the tables and the query fails when it runs 
since the table exists

The other item is the query is going to do something to the query you pass in, 
so it would need to be based on dbtable being used that is only doing a trim; 
the query is wrapping:
s"(${subquery}) SPARK_GEN_SUBQ_${curId.getAndIncrement()}"
 

 

> JDBC read is always going to wrap the query in a select statement
> -
>
> Key: SPARK-37259
> URL: https://issues.apache.org/jira/browse/SPARK-37259
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.2
>Reporter: Kevin Appel
>Priority: Major
>
> The read jdbc is wrapping the query it sends to the database server inside a 
> select statement and there is no way to override this currently.
> Initially I ran into this issue when trying to run a CTE query against SQL 
> server and it fails, the details of the failure is in these cases:
> [https://github.com/microsoft/mssql-jdbc/issues/1340]
> [https://github.com/microsoft/mssql-jdbc/issues/1657]
> [https://github.com/microsoft/sql-spark-connector/issues/147]
> https://issues.apache.org/jira/browse/SPARK-32825
> https://issues.apache.org/jira/browse/SPARK-34928
> I started to patch the code to get the query to run and ran into a few 
> different items, if there is a way to add these features to allow this code 
> path to run, this would be extremely helpful to running these type of edge 
> case queries.  These are basic examples here the actual queries are much more 
> complex and would require significant time to rewrite.
> Inside JDBCOptions.scala the query is being set to either, using the dbtable 
> this allows the query to be passed without modification
>  
> {code:java}
> name.trim
> or
> s"(${subquery}) SPARK_GEN_SUBQ_${curId.getAndIncrement()}"
> {code}
>  
> Inside JDBCRelation.scala this is going to try to get the schema for this 
> query, and this ends up running dialect.getSchemaQuery which is doing:
> {code:java}
> s"SELECT * FROM $table WHERE 1=0"{code}
> Overriding the dialect here and initially just passing back the $table gets 
> passed here and to the next issue which is in the compute function in 
> JDBCRDD.scala
>  
> {code:java}
> val sqlText = s"SELECT $columnList FROM ${options.tableOrQuery} 
> $myTableSampleClause" + s" $myWhereClause $getGroupByClause $myLimitClause"
>  
> {code}
>  
> For these two queries, about a CTE query and using temp tables, finding out 
> the schema is difficult without actually running the query and for the temp 
> table if you run it in the schema check that will have the table now exist 
> and fail when it runs the actual query.
>  
> The way I patched these is by doing these two items:
> JDBCRDD.scala (compute)
>  
> {code:java}
>     val runQueryAsIs = options.parameters.getOrElse("runQueryAsIs", 
> "false").toBoolean
>     val sqlText = if (runQueryAsIs) {
>       s"${options.tableOrQuery}"
>     } else {
>       s"SELECT $columnList FROM ${options.tableOrQuery} $myWhereClause"
>     }
> {code}
> JDBCRelation.scala (getSchema)
> {code:java}
> val useCustomSchema = jdbcOptions.parameters.getOrElse("useCustomSchema", 
> "false").toBoolean
>     if (useCustomSchema) {
>       val myCustomSchema = jdbcOptions.parameters.getOrElse("customSchema", 
> "").toString
>       val newSchema = CatalystSqlParser.parseTableSchema(myCustomSchema)
>       logInfo(s"Going to return the new $newSchema because useCustomSchema is 
> $useCustomSchema and passed in $myCustomSchema")
>       newSchema
>     } else {
>       val tableSchema = JDBCRDD.resolveTable(jdbcOptions)
>       jdbcOptions.customSchema match {
>       case Some(customSchema) => JdbcUtils.getCustomSchema(
>         tableSchema, customSchema, resolver)
>       case None => tableSchema
>       }
>     }{code}
>  
> This is allowing the query to run as is, by using the dbtable option and then 
> provide a custom schema that will bypass the dialect schema check
>  
> Test queries
>  
> {code:java}
> query1 = """ 
> SELECT 1 as DummyCOL
> """
>

[jira] [Assigned] (SPARK-37442) In AQE, wrong InMemoryRelation size estimation causes "Cannot broadcast the table that is larger than 8GB: 8 GB" failure

2021-11-22 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37442?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37442:


Assignee: (was: Apache Spark)

> In AQE, wrong InMemoryRelation size estimation causes "Cannot broadcast the 
> table that is larger than 8GB: 8 GB" failure
> 
>
> Key: SPARK-37442
> URL: https://issues.apache.org/jira/browse/SPARK-37442
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer, SQL
>Affects Versions: 3.1.1, 3.2.0
>Reporter: Michael Chen
>Priority: Major
>
> There is a period in time where an InMemoryRelation will have the cached 
> buffers loaded, but the statistics will be inaccurate (anywhere between 0 -> 
> size in bytes reported by accumulators). When AQE is enabled, it is possible 
> that join planning strategies will happen in this window. In this scenario, 
> join children sizes including InMemoryRelation are greatly underestimated and 
> a broadcast join can be planned when it shouldn't be. We have seen scenarios 
> where a broadcast join is planned with the builder size greater than 8GB 
> because at planning time, the optimizer believes the InMemoryRelation is 0 
> bytes.
> Here is an example test case where the broadcast threshold is being ignored. 
> It can mimic the 8GB error by increasing the size of the tables.
> {code:java}
> withSQLConf(
>   SQLConf.ADAPTIVE_EXECUTION_ENABLED.key -> "true",
>   SQLConf.AUTO_BROADCASTJOIN_THRESHOLD.key -> "1048584") {
>   // Spark estimates a string column as 20 bytes so with 60k rows, these 
> relations should be
>   // estimated at ~120m bytes which is greater than the broadcast join 
> threshold
>   Seq.fill(6)("a").toDF("key")
> .createOrReplaceTempView("temp")
>   Seq.fill(6)("b").toDF("key")
> .createOrReplaceTempView("temp2")
>   Seq("a").toDF("key").createOrReplaceTempView("smallTemp")
>   spark.sql("SELECT key as newKey FROM temp").persist()
>   val query =
>   s"""
>  |SELECT t3.newKey
>  |FROM
>  |  (SELECT t1.newKey
>  |  FROM (SELECT key as newKey FROM temp) as t1
>  |JOIN
>  |(SELECT key FROM smallTemp) as t2
>  |ON t1.newKey = t2.key
>  |  ) as t3
>  |  JOIN
>  |  (SELECT key FROM temp2) as t4
>  |  ON t3.newKey = t4.key
>  |UNION
>  |SELECT t1.newKey
>  |FROM
>  |(SELECT key as newKey FROM temp) as t1
>  |JOIN
>  |(SELECT key FROM temp2) as t2
>  |ON t1.newKey = t2.key
>  |""".stripMargin
>   val df = spark.sql(query)
>   df.collect()
>   val adaptivePlan = df.queryExecution.executedPlan
>   val bhj = findTopLevelBroadcastHashJoin(adaptivePlan)
>   assert(bhj.length == 1) {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-37442) In AQE, wrong InMemoryRelation size estimation causes "Cannot broadcast the table that is larger than 8GB: 8 GB" failure

2021-11-22 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37442?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37442:


Assignee: Apache Spark

> In AQE, wrong InMemoryRelation size estimation causes "Cannot broadcast the 
> table that is larger than 8GB: 8 GB" failure
> 
>
> Key: SPARK-37442
> URL: https://issues.apache.org/jira/browse/SPARK-37442
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer, SQL
>Affects Versions: 3.1.1, 3.2.0
>Reporter: Michael Chen
>Assignee: Apache Spark
>Priority: Major
>
> There is a period in time where an InMemoryRelation will have the cached 
> buffers loaded, but the statistics will be inaccurate (anywhere between 0 -> 
> size in bytes reported by accumulators). When AQE is enabled, it is possible 
> that join planning strategies will happen in this window. In this scenario, 
> join children sizes including InMemoryRelation are greatly underestimated and 
> a broadcast join can be planned when it shouldn't be. We have seen scenarios 
> where a broadcast join is planned with the builder size greater than 8GB 
> because at planning time, the optimizer believes the InMemoryRelation is 0 
> bytes.
> Here is an example test case where the broadcast threshold is being ignored. 
> It can mimic the 8GB error by increasing the size of the tables.
> {code:java}
> withSQLConf(
>   SQLConf.ADAPTIVE_EXECUTION_ENABLED.key -> "true",
>   SQLConf.AUTO_BROADCASTJOIN_THRESHOLD.key -> "1048584") {
>   // Spark estimates a string column as 20 bytes so with 60k rows, these 
> relations should be
>   // estimated at ~120m bytes which is greater than the broadcast join 
> threshold
>   Seq.fill(6)("a").toDF("key")
> .createOrReplaceTempView("temp")
>   Seq.fill(6)("b").toDF("key")
> .createOrReplaceTempView("temp2")
>   Seq("a").toDF("key").createOrReplaceTempView("smallTemp")
>   spark.sql("SELECT key as newKey FROM temp").persist()
>   val query =
>   s"""
>  |SELECT t3.newKey
>  |FROM
>  |  (SELECT t1.newKey
>  |  FROM (SELECT key as newKey FROM temp) as t1
>  |JOIN
>  |(SELECT key FROM smallTemp) as t2
>  |ON t1.newKey = t2.key
>  |  ) as t3
>  |  JOIN
>  |  (SELECT key FROM temp2) as t4
>  |  ON t3.newKey = t4.key
>  |UNION
>  |SELECT t1.newKey
>  |FROM
>  |(SELECT key as newKey FROM temp) as t1
>  |JOIN
>  |(SELECT key FROM temp2) as t2
>  |ON t1.newKey = t2.key
>  |""".stripMargin
>   val df = spark.sql(query)
>   df.collect()
>   val adaptivePlan = df.queryExecution.executedPlan
>   val bhj = findTopLevelBroadcastHashJoin(adaptivePlan)
>   assert(bhj.length == 1) {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-37442) In AQE, wrong InMemoryRelation size estimation causes "Cannot broadcast the table that is larger than 8GB: 8 GB" failure

2021-11-22 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-37442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17447577#comment-17447577
 ] 

Apache Spark commented on SPARK-37442:
--

User 'ChenMichael' has created a pull request for this issue:
https://github.com/apache/spark/pull/34684

> In AQE, wrong InMemoryRelation size estimation causes "Cannot broadcast the 
> table that is larger than 8GB: 8 GB" failure
> 
>
> Key: SPARK-37442
> URL: https://issues.apache.org/jira/browse/SPARK-37442
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer, SQL
>Affects Versions: 3.1.1, 3.2.0
>Reporter: Michael Chen
>Priority: Major
>
> There is a period in time where an InMemoryRelation will have the cached 
> buffers loaded, but the statistics will be inaccurate (anywhere between 0 -> 
> size in bytes reported by accumulators). When AQE is enabled, it is possible 
> that join planning strategies will happen in this window. In this scenario, 
> join children sizes including InMemoryRelation are greatly underestimated and 
> a broadcast join can be planned when it shouldn't be. We have seen scenarios 
> where a broadcast join is planned with the builder size greater than 8GB 
> because at planning time, the optimizer believes the InMemoryRelation is 0 
> bytes.
> Here is an example test case where the broadcast threshold is being ignored. 
> It can mimic the 8GB error by increasing the size of the tables.
> {code:java}
> withSQLConf(
>   SQLConf.ADAPTIVE_EXECUTION_ENABLED.key -> "true",
>   SQLConf.AUTO_BROADCASTJOIN_THRESHOLD.key -> "1048584") {
>   // Spark estimates a string column as 20 bytes so with 60k rows, these 
> relations should be
>   // estimated at ~120m bytes which is greater than the broadcast join 
> threshold
>   Seq.fill(6)("a").toDF("key")
> .createOrReplaceTempView("temp")
>   Seq.fill(6)("b").toDF("key")
> .createOrReplaceTempView("temp2")
>   Seq("a").toDF("key").createOrReplaceTempView("smallTemp")
>   spark.sql("SELECT key as newKey FROM temp").persist()
>   val query =
>   s"""
>  |SELECT t3.newKey
>  |FROM
>  |  (SELECT t1.newKey
>  |  FROM (SELECT key as newKey FROM temp) as t1
>  |JOIN
>  |(SELECT key FROM smallTemp) as t2
>  |ON t1.newKey = t2.key
>  |  ) as t3
>  |  JOIN
>  |  (SELECT key FROM temp2) as t4
>  |  ON t3.newKey = t4.key
>  |UNION
>  |SELECT t1.newKey
>  |FROM
>  |(SELECT key as newKey FROM temp) as t1
>  |JOIN
>  |(SELECT key FROM temp2) as t2
>  |ON t1.newKey = t2.key
>  |""".stripMargin
>   val df = spark.sql(query)
>   df.collect()
>   val adaptivePlan = df.queryExecution.executedPlan
>   val bhj = findTopLevelBroadcastHashJoin(adaptivePlan)
>   assert(bhj.length == 1) {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-37442) In AQE, wrong InMemoryRelation size estimation causes "Cannot broadcast the table that is larger than 8GB: 8 GB" failure

2021-11-22 Thread Michael Chen (Jira)

Michael Chen created SPARK-37442:


 Summary: In AQE, wrong InMemoryRelation size estimation causes 
"Cannot broadcast the table that is larger than 8GB: 8 GB" failure
 Key: SPARK-37442
 URL: https://issues.apache.org/jira/browse/SPARK-37442
 Project: Spark
  Issue Type: Bug
  Components: Optimizer, SQL
Affects Versions: 3.2.0, 3.1.1
Reporter: Michael Chen


There is a period in time where an InMemoryRelation will have the cached 
buffers loaded, but the statistics will be inaccurate (anywhere between 0 -> 
size in bytes reported by accumulators). When AQE is enabled, it is possible 
that join planning strategies will happen in this window. In this scenario, 
join children sizes including InMemoryRelation are greatly underestimated and a 
broadcast join can be planned when it shouldn't be. We have seen scenarios 
where a broadcast join is planned with the builder size greater than 8GB 
because at planning time, the optimizer believes the InMemoryRelation is 0 
bytes.

Here is an example test case where the broadcast threshold is being ignored. It 
can mimic the 8GB error by increasing the size of the tables.
{code:java}
withSQLConf(
  SQLConf.ADAPTIVE_EXECUTION_ENABLED.key -> "true",
  SQLConf.AUTO_BROADCASTJOIN_THRESHOLD.key -> "1048584") {
  // Spark estimates a string column as 20 bytes so with 60k rows, these 
relations should be
  // estimated at ~120m bytes which is greater than the broadcast join threshold
  Seq.fill(6)("a").toDF("key")
.createOrReplaceTempView("temp")
  Seq.fill(6)("b").toDF("key")
.createOrReplaceTempView("temp2")

  Seq("a").toDF("key").createOrReplaceTempView("smallTemp")
  spark.sql("SELECT key as newKey FROM temp").persist()

  val query =
  s"""
 |SELECT t3.newKey
 |FROM
 |  (SELECT t1.newKey
 |  FROM (SELECT key as newKey FROM temp) as t1
 |JOIN
 |(SELECT key FROM smallTemp) as t2
 |ON t1.newKey = t2.key
 |  ) as t3
 |  JOIN
 |  (SELECT key FROM temp2) as t4
 |  ON t3.newKey = t4.key
 |UNION
 |SELECT t1.newKey
 |FROM
 |(SELECT key as newKey FROM temp) as t1
 |JOIN
 |(SELECT key FROM temp2) as t2
 |ON t1.newKey = t2.key
 |""".stripMargin
  val df = spark.sql(query)
  df.collect()
  val adaptivePlan = df.queryExecution.executedPlan
  val bhj = findTopLevelBroadcastHashJoin(adaptivePlan)
  assert(bhj.length == 1) {code}
 

 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-37391) SIGNIFICANT bottleneck introduced by fix for SPARK-32001

2021-11-22 Thread Danny Guinther (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-37391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17447529#comment-17447529
 ] 

Danny Guinther commented on SPARK-37391:


Here's an example stacktrace for one of the blocked threads:

{{org.apache.spark.sql.execution.datasources.jdbc.connection.ConnectionProviderBase.create(ConnectionProvider.scala:92)}}
{{org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.$anonfun$createConnectionFactory$1(JdbcUtils.scala:63)}}
{{org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$$Lambda$6294/1994845663.apply(Unknown
 Source)}}
{{org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$.resolveTable(JDBCRDD.scala:56)}}
{{org.apache.spark.sql.execution.datasources.jdbc.JDBCRelation$.getSchema(JDBCRelation.scala:226)}}
{{org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:35)}}
{{org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:390)}}
{{org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:444)}}
{{org.apache.spark.sql.DataFrameReader.$anonfun$load$3(DataFrameReader.scala:400)}}
{{org.apache.spark.sql.DataFrameReader$$Lambda$6224/1118373872.apply(Unknown 
Source)}}
{{scala.Option.getOrElse(Option.scala:189)}}
{{org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:400)}}
{{org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:273)}}
{{}}
{{scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)}}
{{scala.concurrent.Future$.$anonfun$apply$1(Future.scala:659)}}
{{scala.concurrent.Future$$$Lambda$442/341778327.apply(Unknown Source)}}
{{scala.util.Success.$anonfun$map$1(Try.scala:255)}}
{{scala.util.Success.map(Try.scala:213)}}
{{scala.concurrent.Future.$anonfun$map$1(Future.scala:292)}}
{{scala.concurrent.Future$$Lambda$443/424848797.apply(Unknown Source)}}
{{scala.concurrent.impl.Promise.liftedTree1$1(Promise.scala:33)}}
{{scala.concurrent.impl.Promise.$anonfun$transform$1(Promise.scala:33)}}
{{scala.concurrent.impl.Promise$$Lambda$444/1710905079.apply(Unknown Source)}}
{{scala.concurrent.impl.CallbackRunnable.run(Promise.scala:64)}}
{{java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)}}
{{java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)}}
{{java.lang.Thread.run(Thread.java:748)}}

 

 

The stacktrace from the thread that is holding the lock looks like so:

{{java.net.SocketInputStream.socketRead0(Native Method)}}
{{java.net.SocketInputStream.socketRead(SocketInputStream.java:116)}}
{{java.net.SocketInputStream.read(SocketInputStream.java:171)}}
{{java.net.SocketInputStream.read(SocketInputStream.java:141)}}
{{com.microsoft.sqlserver.jdbc.TDSChannel$ProxyInputStream.readInternal(IOBuffer.java:1019)}}
{{com.microsoft.sqlserver.jdbc.TDSChannel$ProxyInputStream.read(IOBuffer.java:1009)}}
{{sun.security.ssl.SSLSocketInputRecord.read(SSLSocketInputRecord.java:476)}}
{{sun.security.ssl.SSLSocketInputRecord.readHeader(SSLSocketInputRecord.java:470)}}
{{sun.security.ssl.SSLSocketInputRecord.bytesInCompletePacket(SSLSocketInputRecord.java:70)}}
{{sun.security.ssl.SSLSocketImpl.readApplicationRecord(SSLSocketImpl.java:1364)}}
{{sun.security.ssl.SSLSocketImpl.access$300(SSLSocketImpl.java:73)}}
{{sun.security.ssl.SSLSocketImpl$AppInputStream.read(SSLSocketImpl.java:973)}}
{{com.microsoft.sqlserver.jdbc.TDSChannel.read(IOBuffer.java:2058)}}
{{com.microsoft.sqlserver.jdbc.TDSReader.readPacket(IOBuffer.java:6617) => 
holding Monitor(com.microsoft.sqlserver.jdbc.TDSReader@1035497922})}}
{{com.microsoft.sqlserver.jdbc.TDSCommand.startResponse(IOBuffer.java:7805)}}
{{com.microsoft.sqlserver.jdbc.TDSCommand.startResponse(IOBuffer.java:7768)}}
{{com.microsoft.sqlserver.jdbc.SQLServerConnection.sendLogon(SQLServerConnection.java:5332)}}
{{com.microsoft.sqlserver.jdbc.SQLServerConnection.logon(SQLServerConnection.java:4066)}}
{{com.microsoft.sqlserver.jdbc.SQLServerConnection.access$000(SQLServerConnection.java:85)}}
{{com.microsoft.sqlserver.jdbc.SQLServerConnection$LogonCommand.doExecute(SQLServerConnection.java:4004)}}
{{com.microsoft.sqlserver.jdbc.TDSCommand.execute(IOBuffer.java:7418)}}
{{com.microsoft.sqlserver.jdbc.SQLServerConnection.executeCommand(SQLServerConnection.java:3272)
 => holding Monitor(java.lang.Object@564746804})}}
{{com.microsoft.sqlserver.jdbc.SQLServerConnection.connectHelper(SQLServerConnection.java:2768)}}
{{com.microsoft.sqlserver.jdbc.SQLServerConnection.login(SQLServerConnection.java:2418)}}
{{com.microsoft.sqlserver.jdbc.SQLServerConnection.connectInternal(SQLServerConnection.java:2265)}}
{{com.microsoft.sqlserver.jdbc.SQLServerConnection.connect(SQLServerConnection.java:1291)}}
{{com.microsoft.sqlserver.jdbc.SQLServerDriver.connect(SQLServerDriver.java:881)}}

[jira] [Commented] (SPARK-37391) SIGNIFICANT bottleneck introduced by fix for SPARK-32001

2021-11-22 Thread Danny Guinther (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-37391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17447512#comment-17447512
 ] 

Danny Guinther commented on SPARK-37391:


[~hyukjin.kwon] , sorry, I seem to have gotten confused when identifying the 
source of the regression. I have updated the title and description to reflect 
the true source of the issue. I'm inclined to blame this change: 
https://github.com/apache/spark/pull/29024/files#diff-345beef18081272d77d91eeca2d9b5534ff6e642245352f40f4e9c9b8922b085R58

 

I'm sorry, but I don't have the capacity to provide a self-contained 
reproduction of the issue. Hopefully the problem is obvious enough that you 
will be able to see what is going on from the anecdotal evidence I can provide.

The introduction of SecurityConfigurationLock.synchronized prevents a given 
JDBC Driver from establishing more than one connection at a time (or at least 
severely limits the concurrency). This is a significant bottleneck for 
applications that use a single JDBC driver to establish many database 
connections.

The anecdotal evidence I can offer to support this claim:

1. I've attached a screenshot of some dashboards we use to monitor the QA 
deployment of the application in question. These graphs come from a 4.5 hour 
window where I had spark 3.1.2 deployed to QA. On the left side of the graph we 
were running Spark 2.4.5; in the middle we were running spark 3.1.2; and on the 
right side of the graph we are running spark 3.0.1.
 # The "Success Rate", "CountActiveTasks", "CountActiveJobs", 
"CountTableTenantJobStart", "CountTableTenantJobEnd" graphs all aim to 
demonstrate that with the deployment of spark 3.1.2 the throughput of the 
application was significantly reduced across the board.
 # The "Overall Active Thread Count", "Count Active Executors", and 
"CountDeadExecutors" graphs all aim to evidence that there was no change in the 
number of resources allocated to do work.
 # The "Max MinsSinceLastAttempt" graph should normally be a flat line unless 
the application is falling behind on the work that it is scheduled to do. It 
can be seen during the period of the spark 3.1.2 deployment the application is 
falling behind at a linear rate and begins to recover once spark 3.0.1 is 
deployed.

!spark-regression-dashes.jpg!

 

2. I've attached a screenshot of the thread dump from the spark driver process. 
It can be seen that many, many threads are blocked waiting for 
SecurityConfigurationLock. The screenshot only shows a handful of threads but 
there are 98 threads in total blocked wiating for the SecurityConfigurationLock.

!so-much-blocking.jpg!

 

It's worth noting that our QA deployment does significantly less work than our 
production deployment; if the QA deployment can't keep up then the production 
deployment has no chance. On the bright side, I had success updating the 
production deployment to spark 3.0.1 and that seems to be stable. 
Unfortunately, we use Databricks for our spark vendor and the LTS release they 
have that supports spark 3.0.1 is only scheduled to be maintained until 
September 2022, so we can't avoid this regression forever.

 

If I can answer any questions or provide any more info, please let me know. 
Thanks in advance!

 

> SIGNIFICANT bottleneck introduced by fix for SPARK-32001
> 
>
> Key: SPARK-37391
> URL: https://issues.apache.org/jira/browse/SPARK-37391
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.0, 3.1.1, 3.1.2, 3.2.0
> Environment: N/A
>Reporter: Danny Guinther
>Priority: Major
> Attachments: so-much-blocking.jpg, spark-regression-dashes.jpg
>
>
> The fix for https://issues.apache.org/jira/browse/SPARK-32001 ( 
> [https://github.com/apache/spark/pull/29024/files#diff-345beef18081272d77d91eeca2d9b5534ff6e642245352f40f4e9c9b8922b085R58]
>  ) does not seem to have consider the reality that some apps may rely on 
> being able to establish many JDBC connections simultaneously for performance 
> reasons.
> The fix forces concurrency to 1 when establishing database connections and 
> that strikes me as a *significant* user impacting change and a *significant* 
> bottleneck.
> Can anyone propose a workaround for this? I have an app that makes 
> connections to thousands of databases and I can't upgrade to any version 
> >3.1.x because of this significant bottleneck.
>  
> Thanks in advance for your help!



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-37391) SIGNIFICANT bottleneck introduced by fix for SPARK-32001

2021-11-22 Thread Danny Guinther (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37391?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Danny Guinther updated SPARK-37391:
---
Attachment: so-much-blocking.jpg

> SIGNIFICANT bottleneck introduced by fix for SPARK-32001
> 
>
> Key: SPARK-37391
> URL: https://issues.apache.org/jira/browse/SPARK-37391
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.0, 3.1.1, 3.1.2, 3.2.0
> Environment: N/A
>Reporter: Danny Guinther
>Priority: Major
> Attachments: so-much-blocking.jpg, spark-regression-dashes.jpg
>
>
> The fix for https://issues.apache.org/jira/browse/SPARK-32001 ( 
> [https://github.com/apache/spark/pull/29024/files#diff-345beef18081272d77d91eeca2d9b5534ff6e642245352f40f4e9c9b8922b085R58]
>  ) does not seem to have consider the reality that some apps may rely on 
> being able to establish many JDBC connections simultaneously for performance 
> reasons.
> The fix forces concurrency to 1 when establishing database connections and 
> that strikes me as a *significant* user impacting change and a *significant* 
> bottleneck.
> Can anyone propose a workaround for this? I have an app that makes 
> connections to thousands of databases and I can't upgrade to any version 
> >3.1.x because of this significant bottleneck.
>  
> Thanks in advance for your help!



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-37391) SIGNIFICANT bottleneck introduced by fix for SPARK-32001

2021-11-22 Thread Danny Guinther (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37391?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Danny Guinther updated SPARK-37391:
---
Attachment: spark-regression-dashes.jpg

> SIGNIFICANT bottleneck introduced by fix for SPARK-32001
> 
>
> Key: SPARK-37391
> URL: https://issues.apache.org/jira/browse/SPARK-37391
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.0, 3.1.1, 3.1.2, 3.2.0
> Environment: N/A
>Reporter: Danny Guinther
>Priority: Major
> Attachments: spark-regression-dashes.jpg
>
>
> The fix for https://issues.apache.org/jira/browse/SPARK-32001 ( 
> [https://github.com/apache/spark/pull/29024/files#diff-345beef18081272d77d91eeca2d9b5534ff6e642245352f40f4e9c9b8922b085R58]
>  ) does not seem to have consider the reality that some apps may rely on 
> being able to establish many JDBC connections simultaneously for performance 
> reasons.
> The fix forces concurrency to 1 when establishing database connections and 
> that strikes me as a *significant* user impacting change and a *significant* 
> bottleneck.
> Can anyone propose a workaround for this? I have an app that makes 
> connections to thousands of databases and I can't upgrade to any version 
> >3.1.x because of this significant bottleneck.
>  
> Thanks in advance for your help!



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-37441) org.apache.spark.sql.AnalysisException: Non-time-based windows are not supported on streaming DataFrames/Datasets;; despite of time-based window

2021-11-22 Thread Ilya (Jira)

Ilya created SPARK-37441:


 Summary: org.apache.spark.sql.AnalysisException: Non-time-based 
windows are not supported on streaming DataFrames/Datasets;; despite of 
time-based window
 Key: SPARK-37441
 URL: https://issues.apache.org/jira/browse/SPARK-37441
 Project: Spark
  Issue Type: Bug
  Components: Structured Streaming
Affects Versions: 3.0.1
Reporter: Ilya


Initially posted here: 
[https://stackoverflow.com/questions/70062355/org-apache-spark-sql-analysisexception-non-time-based-windows-are-not-supported]

 

'm doing the window-based sorting for the Spark Structured Streaming:
 
{{val filterWindow: WindowSpec = Window  .partitionBy("key")
  .orderBy($"time")

controlDataFrame=controlDataFrame.withColumn("Make Coffee", $"value").
  withColumn("datetime", date_trunc("second", current_timestamp())).
  withColumn("time", current_timestamp()).
  withColumn("temp_rank", rank().over(filterWindow))
  .filter(col("temp_rank") === 1)
  .drop("temp_rank").
  withColumn("digitalTwinId", lit(digitalTwinId)).
  withWatermark("datetime", "10 seconds")}}

I'm obtaining {{time}} as {{current_timestamp()}} and in schemat I see its type 
as {{StructField(time,TimestampType,true)}}

Why Spark 3.0 doesn't allow me to do the window operation based on it with the 
following exception, as the filed is clearly time-based?
 
{{21/11/22 10:34:03 WARN SparkSession$Builder: Using an existing SparkSession; 
some spark core configurations may not take effect.

org.apache.spark.sql.AnalysisException: Non-time-based windows are not 
supported on streaming DataFrames/Datasets;;Window [rank(time#163) 
windowspecdefinition(key#150, time#163 ASC NULLS FIRST, 
specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$())) AS 
temp_rank#171], [key#150], [time#163 ASC NULLS FIRST]
+- Project [key#150, value#151, Make Coffee#154, datetime#158, time#163]}}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-37440) org.apache.spark.sql.AnalysisException: Non-time-based windows are not supported on streaming DataFrames/Datasets;; despite of time-based window

2021-11-22 Thread Ilya (Jira)

Ilya created SPARK-37440:


 Summary: org.apache.spark.sql.AnalysisException: Non-time-based 
windows are not supported on streaming DataFrames/Datasets;; despite of 
time-based window
 Key: SPARK-37440
 URL: https://issues.apache.org/jira/browse/SPARK-37440
 Project: Spark
  Issue Type: Bug
  Components: Structured Streaming
Affects Versions: 3.0.1
Reporter: Ilya


Initially posted here: 
[https://stackoverflow.com/questions/70062355/org-apache-spark-sql-analysisexception-non-time-based-windows-are-not-supported]

 

'm doing the window-based sorting for the Spark Structured Streaming:
 
{{val filterWindow: WindowSpec = Window  .partitionBy("key")
  .orderBy($"time")

controlDataFrame=controlDataFrame.withColumn("Make Coffee", $"value").
  withColumn("datetime", date_trunc("second", current_timestamp())).
  withColumn("time", current_timestamp()).
  withColumn("temp_rank", rank().over(filterWindow))
  .filter(col("temp_rank") === 1)
  .drop("temp_rank").
  withColumn("digitalTwinId", lit(digitalTwinId)).
  withWatermark("datetime", "10 seconds")}}

I'm obtaining {{time}} as {{current_timestamp()}} and in schemat I see its type 
as {{StructField(time,TimestampType,true)}}

Why Spark 3.0 doesn't allow me to do the window operation based on it with the 
following exception, as the filed is clearly time-based?
 
{{21/11/22 10:34:03 WARN SparkSession$Builder: Using an existing SparkSession; 
some spark core configurations may not take effect.

org.apache.spark.sql.AnalysisException: Non-time-based windows are not 
supported on streaming DataFrames/Datasets;;Window [rank(time#163) 
windowspecdefinition(key#150, time#163 ASC NULLS FIRST, 
specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$())) AS 
temp_rank#171], [key#150], [time#163 ASC NULLS FIRST]
+- Project [key#150, value#151, Make Coffee#154, datetime#158, time#163]}}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-37439) org.apache.spark.sql.AnalysisException: Non-time-based windows are not supported on streaming DataFrames/Datasets;; despite of time-based window

2021-11-22 Thread Ilya (Jira)

Ilya created SPARK-37439:


 Summary: org.apache.spark.sql.AnalysisException: Non-time-based 
windows are not supported on streaming DataFrames/Datasets;; despite of 
time-based window
 Key: SPARK-37439
 URL: https://issues.apache.org/jira/browse/SPARK-37439
 Project: Spark
  Issue Type: Bug
  Components: Structured Streaming
Affects Versions: 3.0.1
Reporter: Ilya


Initially posted here: 
[https://stackoverflow.com/questions/70062355/org-apache-spark-sql-analysisexception-non-time-based-windows-are-not-supported]

 

'm doing the window-based sorting for the Spark Structured Streaming:
 
{{val filterWindow: WindowSpec = Window  .partitionBy("key")
  .orderBy($"time")

controlDataFrame=controlDataFrame.withColumn("Make Coffee", $"value").
  withColumn("datetime", date_trunc("second", current_timestamp())).
  withColumn("time", current_timestamp()).
  withColumn("temp_rank", rank().over(filterWindow))
  .filter(col("temp_rank") === 1)
  .drop("temp_rank").
  withColumn("digitalTwinId", lit(digitalTwinId)).
  withWatermark("datetime", "10 seconds")}}

I'm obtaining {{time}} as {{current_timestamp()}} and in schemat I see its type 
as {{StructField(time,TimestampType,true)}}

Why Spark 3.0 doesn't allow me to do the window operation based on it with the 
following exception, as the filed is clearly time-based?
 
{{21/11/22 10:34:03 WARN SparkSession$Builder: Using an existing SparkSession; 
some spark core configurations may not take effect.

org.apache.spark.sql.AnalysisException: Non-time-based windows are not 
supported on streaming DataFrames/Datasets;;Window [rank(time#163) 
windowspecdefinition(key#150, time#163 ASC NULLS FIRST, 
specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$())) AS 
temp_rank#171], [key#150], [time#163 ASC NULLS FIRST]
+- Project [key#150, value#151, Make Coffee#154, datetime#158, time#163]}}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-37192) Migrate SHOW TBLPROPERTIES to use V2 command by default

2021-11-22 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37192?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-37192:
---

Assignee: PengLei

> Migrate SHOW TBLPROPERTIES to use V2 command by default
> ---
>
> Key: SPARK-37192
> URL: https://issues.apache.org/jira/browse/SPARK-37192
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: PengLei
>Assignee: PengLei
>Priority: Major
> Fix For: 3.3.0
>
>
> Migrate SHOW TBLPROPERTIES to use V2 command by default



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-37192) Migrate SHOW TBLPROPERTIES to use V2 command by default

2021-11-22 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37192?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-37192.
-
Resolution: Fixed

Issue resolved by pull request 34666
[https://github.com/apache/spark/pull/34666]

> Migrate SHOW TBLPROPERTIES to use V2 command by default
> ---
>
> Key: SPARK-37192
> URL: https://issues.apache.org/jira/browse/SPARK-37192
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: PengLei
>Assignee: PengLei
>Priority: Major
> Fix For: 3.3.0
>
>
> Migrate SHOW TBLPROPERTIES to use V2 command by default



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-37391) SIGNIFICANT bottleneck introduced by fix for SPARK-32001

2021-11-22 Thread Danny Guinther (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37391?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Danny Guinther updated SPARK-37391:
---
Summary: SIGNIFICANT bottleneck introduced by fix for SPARK-32001  (was: 
SIGNIFICANT bottleneck introduced by fix for SPARK-34497)

> SIGNIFICANT bottleneck introduced by fix for SPARK-32001
> 
>
> Key: SPARK-37391
> URL: https://issues.apache.org/jira/browse/SPARK-37391
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.0, 3.1.1, 3.1.2, 3.2.0
> Environment: N/A
>Reporter: Danny Guinther
>Priority: Major
>
> The fix for https://issues.apache.org/jira/browse/SPARK-32001 ( 
> [https://github.com/apache/spark/pull/29024/files#diff-345beef18081272d77d91eeca2d9b5534ff6e642245352f40f4e9c9b8922b085R58]
>  ) does not seem to have consider the reality that some apps may rely on 
> being able to establish many JDBC connections simultaneously for performance 
> reasons.
> The fix forces concurrency to 1 when establishing database connections and 
> that strikes me as a *significant* user impacting change and a *significant* 
> bottleneck.
> Can anyone propose a workaround for this? I have an app that makes 
> connections to thousands of databases and I can't upgrade to any version 
> >3.1.x because of this significant bottleneck.
>  
> Thanks in advance for your help!



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-37391) SIGNIFICANT bottleneck introduced by fix for SPARK-34497

2021-11-22 Thread Hyukjin Kwon (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-37391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17447162#comment-17447162
 ] 

Hyukjin Kwon edited comment on SPARK-37391 at 11/22/21, 3:21 PM:
-

[~danny-seismic], it would be great to assess this issue futher with better 
problem description and, e.g.) preferably self-contained reproducer


was (Author: hyukjin.kwon):
[~danny-seismic], it would be great to assess this issue futher with problem 
description and preferably self-contained reproducer

> SIGNIFICANT bottleneck introduced by fix for SPARK-34497
> 
>
> Key: SPARK-37391
> URL: https://issues.apache.org/jira/browse/SPARK-37391
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.0, 3.1.1, 3.1.2, 3.2.0
> Environment: N/A
>Reporter: Danny Guinther
>Priority: Major
>
> The fix for https://issues.apache.org/jira/browse/SPARK-32001 ( 
> [https://github.com/apache/spark/pull/29024/files#diff-345beef18081272d77d91eeca2d9b5534ff6e642245352f40f4e9c9b8922b085R58]
>  ) does not seem to have consider the reality that some apps may rely on 
> being able to establish many JDBC connections simultaneously for performance 
> reasons.
> The fix forces concurrency to 1 when establishing database connections and 
> that strikes me as a *significant* user impacting change and a *significant* 
> bottleneck.
> Can anyone propose a workaround for this? I have an app that makes 
> connections to thousands of databases and I can't upgrade to any version 
> >3.1.x because of this significant bottleneck.
>  
> Thanks in advance for your help!



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-37391) SIGNIFICANT bottleneck introduced by fix for SPARK-34497

2021-11-22 Thread Danny Guinther (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37391?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Danny Guinther updated SPARK-37391:
---
Description: 
The fix for https://issues.apache.org/jira/browse/SPARK-32001 ( 
[https://github.com/apache/spark/pull/29024/files#diff-345beef18081272d77d91eeca2d9b5534ff6e642245352f40f4e9c9b8922b085R58]
 ) does not seem to have consider the reality that some apps may rely on being 
able to establish many JDBC connections simultaneously for performance reasons.

The fix forces concurrency to 1 when establishing database connections and that 
strikes me as a *significant* user impacting change and a *significant* 
bottleneck.

Can anyone propose a workaround for this? I have an app that makes connections 
to thousands of databases and I can't upgrade to any version >3.1.x because of 
this significant bottleneck.

 

Thanks in advance for your help!

  was:
The fix for SPARK-34497 ( 
[https://github.com/apache/spark/pull/29024/files#diff-345beef18081272d77d91eeca2d9b5534ff6e642245352f40f4e9c9b8922b085R58|https://github.com/apache/spark/pull/29024/files#diff-345beef18081272d77d91eeca2d9b5534ff6e642245352f40f4e9c9b8922b085R58]
 ) does not seem to have consider the reality that some apps may rely on being 
able to establish many JDBC connections simultaneously for performance reasons.

The fix forces concurrency to 1 when establishing database connections and that 
strikes me as a *significant* user impacting change and a *significant* 
bottleneck.

Can anyone propose a workaround for this? I have an app that makes connections 
to thousands of databases and I can't upgrade to any version >3.1.x because of 
this significant bottleneck.

 

Thanks in advance for your help!


> SIGNIFICANT bottleneck introduced by fix for SPARK-34497
> 
>
> Key: SPARK-37391
> URL: https://issues.apache.org/jira/browse/SPARK-37391
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.0, 3.1.1, 3.1.2, 3.2.0
> Environment: N/A
>Reporter: Danny Guinther
>Priority: Major
>
> The fix for https://issues.apache.org/jira/browse/SPARK-32001 ( 
> [https://github.com/apache/spark/pull/29024/files#diff-345beef18081272d77d91eeca2d9b5534ff6e642245352f40f4e9c9b8922b085R58]
>  ) does not seem to have consider the reality that some apps may rely on 
> being able to establish many JDBC connections simultaneously for performance 
> reasons.
> The fix forces concurrency to 1 when establishing database connections and 
> that strikes me as a *significant* user impacting change and a *significant* 
> bottleneck.
> Can anyone propose a workaround for this? I have an app that makes 
> connections to thousands of databases and I can't upgrade to any version 
> >3.1.x because of this significant bottleneck.
>  
> Thanks in advance for your help!



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-37391) SIGNIFICANT bottleneck introduced by fix for SPARK-34497

2021-11-22 Thread Danny Guinther (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37391?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Danny Guinther updated SPARK-37391:
---
Description: 
The fix for SPARK-34497 ( 
[https://github.com/apache/spark/pull/29024/files#diff-345beef18081272d77d91eeca2d9b5534ff6e642245352f40f4e9c9b8922b085R58|https://github.com/apache/spark/pull/29024/files#diff-345beef18081272d77d91eeca2d9b5534ff6e642245352f40f4e9c9b8922b085R58]
 ) does not seem to have consider the reality that some apps may rely on being 
able to establish many JDBC connections simultaneously for performance reasons.

The fix forces concurrency to 1 when establishing database connections and that 
strikes me as a *significant* user impacting change and a *significant* 
bottleneck.

Can anyone propose a workaround for this? I have an app that makes connections 
to thousands of databases and I can't upgrade to any version >3.1.x because of 
this significant bottleneck.

 

Thanks in advance for your help!

  was:
The fix for SPARK-34497 ( [https://github.com/apache/spark/pull/31622] ) does 
not seem to have consider the reality that some apps may rely on being able to 
establish many JDBC connections simultaneously for performance reasons.

The fix forces concurrency to 1 when establishing database connections and that 
strikes me as a *significant* user impacting change and a *significant* 
bottleneck.

Can anyone propose a workaround for this? I have an app that makes connections 
to thousands of databases and I can't upgrade to any version >3.1.x because of 
this significant bottleneck.

 

Thanks in advance for your help!


> SIGNIFICANT bottleneck introduced by fix for SPARK-34497
> 
>
> Key: SPARK-37391
> URL: https://issues.apache.org/jira/browse/SPARK-37391
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.0, 3.1.1, 3.1.2, 3.2.0
> Environment: N/A
>Reporter: Danny Guinther
>Priority: Major
>
> The fix for SPARK-34497 ( 
> [https://github.com/apache/spark/pull/29024/files#diff-345beef18081272d77d91eeca2d9b5534ff6e642245352f40f4e9c9b8922b085R58|https://github.com/apache/spark/pull/29024/files#diff-345beef18081272d77d91eeca2d9b5534ff6e642245352f40f4e9c9b8922b085R58]
>  ) does not seem to have consider the reality that some apps may rely on 
> being able to establish many JDBC connections simultaneously for performance 
> reasons.
> The fix forces concurrency to 1 when establishing database connections and 
> that strikes me as a *significant* user impacting change and a *significant* 
> bottleneck.
> Can anyone propose a workaround for this? I have an app that makes 
> connections to thousands of databases and I can't upgrade to any version 
> >3.1.x because of this significant bottleneck.
>  
> Thanks in advance for your help!



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-37393) Inline annotations for {ml, mllib}/common.py

2021-11-22 Thread Nicholas Chammas (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37393?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Chammas resolved SPARK-37393.
--
Resolution: Duplicate

> Inline annotations for {ml, mllib}/common.py
> 
>
> Key: SPARK-37393
> URL: https://issues.apache.org/jira/browse/SPARK-37393
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib, PySpark
>Affects Versions: 3.2.0
>Reporter: Nicholas Chammas
>Priority: Minor
>
> This will allow us to run type checks against those files themselves.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-37283) Don't try to store a V1 table which contains ANSI intervals in Hive compatible format

2021-11-22 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-37283?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17447436#comment-17447436
 ] 

Apache Spark commented on SPARK-37283:
--

User 'sarutak' has created a pull request for this issue:
https://github.com/apache/spark/pull/34683

> Don't try to store a V1 table which contains ANSI intervals in Hive 
> compatible format
> -
>
> Key: SPARK-37283
> URL: https://issues.apache.org/jira/browse/SPARK-37283
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Major
> Fix For: 3.3.0
>
>
> If, a table being created contains a column of ANSI interval types and the 
> underlying file format has a corresponding Hive SerDe (e.g. Parquet),
> `HiveExternalcatalog` tries to store the table in Hive compatible format.
> But, as ANSI interval types in Spark and interval type in Hive are not 
> compatible (Hive only supports interval_year_month and interval_day_time), 
> the following warning with stack trace will be logged.
> {code}
> spark-sql> CREATE TABLE tbl1(a INTERVAL YEAR TO MONTH) USING Parquet;
> 21/11/11 14:39:29 WARN SessionState: METASTORE_FILTER_HOOK will be ignored, 
> since hive.security.authorization.manager is set to instance of 
> HiveAuthorizerFactory.
> 21/11/11 14:39:29 WARN HiveExternalCatalog: Could not persist 
> `default`.`tbl1` in a Hive compatible way. Persisting it into Hive metastore 
> in Spark SQL specific format.
> org.apache.hadoop.hive.ql.metadata.HiveException: 
> java.lang.IllegalArgumentException: Error: type expected at the position 0 of 
> 'interval year to month' but 'interval year to month' is found.
>   at org.apache.hadoop.hive.ql.metadata.Hive.createTable(Hive.java:869)
>   at org.apache.hadoop.hive.ql.metadata.Hive.createTable(Hive.java:874)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$createTable$1(HiveClientImpl.scala:553)
>   at 
> scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$withHiveState$1(HiveClientImpl.scala:303)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:234)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:233)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:283)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl.createTable(HiveClientImpl.scala:551)
>   at 
> org.apache.spark.sql.hive.HiveExternalCatalog.saveTableIntoHive(HiveExternalCatalog.scala:499)
>   at 
> org.apache.spark.sql.hive.HiveExternalCatalog.createDataSourceTable(HiveExternalCatalog.scala:397)
>   at 
> org.apache.spark.sql.hive.HiveExternalCatalog.$anonfun$createTable$1(HiveExternalCatalog.scala:274)
>   at 
> scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
>   at 
> org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:102)
>   at 
> org.apache.spark.sql.hive.HiveExternalCatalog.createTable(HiveExternalCatalog.scala:245)
>   at 
> org.apache.spark.sql.catalyst.catalog.ExternalCatalogWithListener.createTable(ExternalCatalogWithListener.scala:94)
>   at 
> org.apache.spark.sql.catalyst.catalog.SessionCatalog.createTable(SessionCatalog.scala:376)
>   at 
> org.apache.spark.sql.execution.command.CreateDataSourceTableCommand.run(createDataSourceTables.scala:120)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:75)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:73)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:84)
>   at 
> org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.$anonfun$applyOrElse$1(QueryExecution.scala:97)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:103)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:90)
>   at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
>   at 
> org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:97)
>   at 
>

[jira] [Commented] (SPARK-37283) Don't try to store a V1 table which contains ANSI intervals in Hive compatible format

2021-11-22 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-37283?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17447435#comment-17447435
 ] 

Apache Spark commented on SPARK-37283:
--

User 'sarutak' has created a pull request for this issue:
https://github.com/apache/spark/pull/34683

> Don't try to store a V1 table which contains ANSI intervals in Hive 
> compatible format
> -
>
> Key: SPARK-37283
> URL: https://issues.apache.org/jira/browse/SPARK-37283
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Major
> Fix For: 3.3.0
>
>
> If, a table being created contains a column of ANSI interval types and the 
> underlying file format has a corresponding Hive SerDe (e.g. Parquet),
> `HiveExternalcatalog` tries to store the table in Hive compatible format.
> But, as ANSI interval types in Spark and interval type in Hive are not 
> compatible (Hive only supports interval_year_month and interval_day_time), 
> the following warning with stack trace will be logged.
> {code}
> spark-sql> CREATE TABLE tbl1(a INTERVAL YEAR TO MONTH) USING Parquet;
> 21/11/11 14:39:29 WARN SessionState: METASTORE_FILTER_HOOK will be ignored, 
> since hive.security.authorization.manager is set to instance of 
> HiveAuthorizerFactory.
> 21/11/11 14:39:29 WARN HiveExternalCatalog: Could not persist 
> `default`.`tbl1` in a Hive compatible way. Persisting it into Hive metastore 
> in Spark SQL specific format.
> org.apache.hadoop.hive.ql.metadata.HiveException: 
> java.lang.IllegalArgumentException: Error: type expected at the position 0 of 
> 'interval year to month' but 'interval year to month' is found.
>   at org.apache.hadoop.hive.ql.metadata.Hive.createTable(Hive.java:869)
>   at org.apache.hadoop.hive.ql.metadata.Hive.createTable(Hive.java:874)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$createTable$1(HiveClientImpl.scala:553)
>   at 
> scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$withHiveState$1(HiveClientImpl.scala:303)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:234)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:233)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:283)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl.createTable(HiveClientImpl.scala:551)
>   at 
> org.apache.spark.sql.hive.HiveExternalCatalog.saveTableIntoHive(HiveExternalCatalog.scala:499)
>   at 
> org.apache.spark.sql.hive.HiveExternalCatalog.createDataSourceTable(HiveExternalCatalog.scala:397)
>   at 
> org.apache.spark.sql.hive.HiveExternalCatalog.$anonfun$createTable$1(HiveExternalCatalog.scala:274)
>   at 
> scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
>   at 
> org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:102)
>   at 
> org.apache.spark.sql.hive.HiveExternalCatalog.createTable(HiveExternalCatalog.scala:245)
>   at 
> org.apache.spark.sql.catalyst.catalog.ExternalCatalogWithListener.createTable(ExternalCatalogWithListener.scala:94)
>   at 
> org.apache.spark.sql.catalyst.catalog.SessionCatalog.createTable(SessionCatalog.scala:376)
>   at 
> org.apache.spark.sql.execution.command.CreateDataSourceTableCommand.run(createDataSourceTables.scala:120)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:75)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:73)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:84)
>   at 
> org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.$anonfun$applyOrElse$1(QueryExecution.scala:97)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:103)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:90)
>   at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
>   at 
> org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:97)
>   at 
>

[jira] [Commented] (SPARK-35885) Use keyserver.ubuntu.com as a keyserver for CRAN

2021-11-22 Thread Marek Novotny (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-35885?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17447386#comment-17447386
 ] 

Marek Novotny commented on SPARK-35885:
---

[~dongjoon] FYI, buster-cran35 is signed by a different key (fingerprint 
'95C0FAF38DB3CCAD0C080A7BDC78B2DDEABC47B7') since 17th November (see 
[http://cloud.r-project.org/bin/linux/debian/]). According to the current 
repository state, this change might affect the branch 
[branch-3.0|https://github.com/apache/spark/blob/branch-3.0/resource-managers/kubernetes/docker/src/main/dockerfiles/spark/bindings/R/Dockerfile#L32].

> Use keyserver.ubuntu.com as a keyserver for CRAN
> 
>
> Key: SPARK-35885
> URL: https://issues.apache.org/jira/browse/SPARK-35885
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes, R
>Affects Versions: 3.0.3, 3.1.2, 3.2.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Blocker
> Fix For: 3.2.0, 3.1.3, 3.0.4
>
>
> This issue aims to use `keyserver.ubuntu.com` as a keyserver for CRAN.
> K8s SparkR docker image build fails because both servers don't work correctly.
> {code}
> $ docker run -it --rm openjdk:11 /bin/bash
> root@3e89a8d05378:/# echo "deb http://cloud.r-project.org/bin/linux/debian 
> buster-cran35/" >> /etc/apt/sources.list
> root@3e89a8d05378:/# (apt-key adv --keyserver keys.gnupg.net --recv-key 
> 'E19F5F87128899B192B1A2C2AD5F960A256A04AF' || apt-key adv --keyserver 
> keys.openpgp.org --recv-key 'E19F5F87128899B192B1A2C2AD5F960A256A04AF')
> Executing: /tmp/apt-key-gpghome.8lNIiUuhoE/gpg.1.sh --keyserver 
> keys.gnupg.net --recv-key E19F5F87128899B192B1A2C2AD5F960A256A04AF
> gpg: keyserver receive failed: No name
> Executing: /tmp/apt-key-gpghome.stxb8XUlx8/gpg.1.sh --keyserver 
> keys.openpgp.org --recv-key E19F5F87128899B192B1A2C2AD5F960A256A04AF
> gpg: key AD5F960A256A04AF: new key but contains no user ID - skipped
> gpg: Total number processed: 1
> gpg:   w/o user IDs: 1
> root@3e89a8d05378:/# apt-get update
> ...
> Err:3 http://cloud.r-project.org/bin/linux/debian buster-cran35/ InRelease
>   The following signatures couldn't be verified because the public key is not 
> available: NO_PUBKEY FCAE2A0E115C3D8A
> ...
> W: GPG error: http://cloud.r-project.org/bin/linux/debian buster-cran35/ 
> InRelease: The following signatures couldn't be verified because the public 
> key is not available: NO_PUBKEY FCAE2A0E115C3D8A
> E: The repository 'http://cloud.r-project.org/bin/linux/debian buster-cran35/ 
> InRelease' is not signed.
> N: Updating from such a repository can't be done securely, and is therefore 
> disabled by default.
> N: See apt-secure(8) manpage for repository creation and user configuration 
> details.
> {code}
> `keyserver.ubuntu.com` is a recommended backup server in CRAN document.
> - http://cloud.r-project.org/bin/linux/debian/
> {code}
> $ docker run -it --rm openjdk:11 /bin/bash
> root@c9b183e45ffe:/# echo "deb http://cloud.r-project.org/bin/linux/debian 
> buster-cran35/" >> /etc/apt/sources.list
> root@c9b183e45ffe:/# apt-key adv --keyserver keyserver.ubuntu.com --recv-key 
> 'E19F5F87128899B192B1A2C2AD5F960A256A04AF'
> Executing: /tmp/apt-key-gpghome.P6cxYkOge7/gpg.1.sh --keyserver 
> keyserver.ubuntu.com --recv-key E19F5F87128899B192B1A2C2AD5F960A256A04AF
> gpg: key AD5F960A256A04AF: public key "Johannes Ranke (Wissenschaftlicher 
> Berater) " imported
> gpg: Total number processed: 1
> gpg:   imported: 1
> root@c9b183e45ffe:/# apt-get update
> Get:1 http://deb.debian.org/debian buster InRelease [122 kB]
> Get:2 http://security.debian.org/debian-security buster/updates InRelease 
> [65.4 kB]
> Get:3 http://cloud.r-project.org/bin/linux/debian buster-cran35/ InRelease 
> [4375 B]
> Get:4 http://deb.debian.org/debian buster-updates InRelease [51.9 kB]
> Get:5 http://cloud.r-project.org/bin/linux/debian buster-cran35/ Packages 
> [53.3 kB]
> Get:6 http://security.debian.org/debian-security buster/updates/main arm64 
> Packages [287 kB]
> Get:7 http://deb.debian.org/debian buster/main arm64 Packages [7735 kB]
> Get:8 http://deb.debian.org/debian buster-updates/main arm64 Packages [14.5 
> kB]
> Fetched 8334 kB in 2s (4537 kB/s)
> Reading package lists... Done
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-36357) Support pushdown Timestamp with local time zone for orc

2021-11-22 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36357?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-36357.
-
Fix Version/s: 3.3.0
 Assignee: jiaan.geng
   Resolution: Fixed

> Support pushdown Timestamp with local time zone for orc
> ---
>
> Key: SPARK-36357
> URL: https://issues.apache.org/jira/browse/SPARK-36357
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: jiaan.geng
>Assignee: jiaan.geng
>Priority: Major
> Fix For: 3.3.0
>
>
> Now that ORC datasources support timestampNTZ, it's great to be able to push 
> down filters with timestampNTZ values.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-37382) `with as` clause got inconsistent results

2021-11-22 Thread caican (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-37382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17447280#comment-17447280
 ] 

caican commented on SPARK-37382:


[~victor-wong] Does the images display nomally now?

> `with as` clause got inconsistent results
> -
>
> Key: SPARK-37382
> URL: https://issues.apache.org/jira/browse/SPARK-37382
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.2
>Reporter: caican
>Priority: Major
> Attachments: spark2.3.png, spark3.1.png
>
>
> In Spark3.1, the `with as` clause in the same SQL is executed multiple times, 
>  got different results
> `
> with tab as (
>  select 'Withas' as name, rand() as rand_number
> )
> select name, rand_number
> from tab
> union all
> select name, rand_number
> from tab
> `
> !spark3.1.png!
> But In spark2.3, it got consistent results
> `
> with tab as (
>  select 'Withas' as name, rand() as rand_number
> )
> select name, rand_number
> from tab
> union all
> select name, rand_number
> from tab
> `
> !spark2.3.png!
> Why does Spark3.1.2 return different results?
> Has anyone encountered this problem?



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-37382) `with as` clause got inconsistent results

2021-11-22 Thread caican (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37382?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

caican updated SPARK-37382:
---
Description: 
In Spark3.1, the `with as` clause in the same SQL is executed multiple times,  
got different results

`

with tab as (
 select 'Withas' as name, rand() as rand_number
)
select name, rand_number
from tab
union all
select name, rand_number
from tab

`

!spark3.1.png!

But In spark2.3, it got consistent results

`

with tab as (
 select 'Withas' as name, rand() as rand_number
)
select name, rand_number
from tab
union all
select name, rand_number
from tab

`

!spark2.3.png!

Why does Spark3.1.2 return different results?

Has anyone encountered this problem?

  was:
In Spark3.1, the `with as` clause in the same SQL is executed multiple times,  
got different results

`

with tab as (
 select 'Withas' as name, rand() as rand_number
)
select name, rand_number
from tab
union all
select name, rand_number
from tab

`

!https://internal-api-lark-file.f.mioffice.cn/api/image/keys/img_bcf6f867-6aee-4afe-bc43-30bf4f2dbdel?message_id=7032102765711097965!

But In spark2.3, it got consistent results

`

with tab as (
 select 'Withas' as name, rand() as rand_number
)
select name, rand_number
from tab
union all
select name, rand_number
from tab

`

!https://internal-api-lark-file.f.mioffice.cn/api/image/keys/img_6dc6e44b-d4a5-4b0d-bd2c-00859ec80a1l?message_id=7032104202756751468!

Why does Spark3.1.2 return different results?

Has anyone encountered this problem?


> `with as` clause got inconsistent results
> -
>
> Key: SPARK-37382
> URL: https://issues.apache.org/jira/browse/SPARK-37382
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.2
>Reporter: caican
>Priority: Major
> Attachments: spark2.3.png, spark3.1.png
>
>
> In Spark3.1, the `with as` clause in the same SQL is executed multiple times, 
>  got different results
> `
> with tab as (
>  select 'Withas' as name, rand() as rand_number
> )
> select name, rand_number
> from tab
> union all
> select name, rand_number
> from tab
> `
> !spark3.1.png!
> But In spark2.3, it got consistent results
> `
> with tab as (
>  select 'Withas' as name, rand() as rand_number
> )
> select name, rand_number
> from tab
> union all
> select name, rand_number
> from tab
> `
> !spark2.3.png!
> Why does Spark3.1.2 return different results?
> Has anyone encountered this problem?



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-37382) `with as` clause got inconsistent results

2021-11-22 Thread caican (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37382?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

caican updated SPARK-37382:
---
Attachment: spark2.3.png

> `with as` clause got inconsistent results
> -
>
> Key: SPARK-37382
> URL: https://issues.apache.org/jira/browse/SPARK-37382
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.2
>Reporter: caican
>Priority: Major
> Attachments: spark2.3.png, spark3.1.png
>
>
> In Spark3.1, the `with as` clause in the same SQL is executed multiple times, 
>  got different results
> `
> with tab as (
>  select 'Withas' as name, rand() as rand_number
> )
> select name, rand_number
> from tab
> union all
> select name, rand_number
> from tab
> `
> !https://internal-api-lark-file.f.mioffice.cn/api/image/keys/img_bcf6f867-6aee-4afe-bc43-30bf4f2dbdel?message_id=7032102765711097965!
> But In spark2.3, it got consistent results
> `
> with tab as (
>  select 'Withas' as name, rand() as rand_number
> )
> select name, rand_number
> from tab
> union all
> select name, rand_number
> from tab
> `
> !https://internal-api-lark-file.f.mioffice.cn/api/image/keys/img_6dc6e44b-d4a5-4b0d-bd2c-00859ec80a1l?message_id=7032104202756751468!
> Why does Spark3.1.2 return different results?
> Has anyone encountered this problem?



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-37382) `with as` clause got inconsistent results

2021-11-22 Thread caican (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37382?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

caican updated SPARK-37382:
---
Attachment: spark3.1.png

> `with as` clause got inconsistent results
> -
>
> Key: SPARK-37382
> URL: https://issues.apache.org/jira/browse/SPARK-37382
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.2
>Reporter: caican
>Priority: Major
> Attachments: spark3.1.png
>
>
> In Spark3.1, the `with as` clause in the same SQL is executed multiple times, 
>  got different results
> `
> with tab as (
>  select 'Withas' as name, rand() as rand_number
> )
> select name, rand_number
> from tab
> union all
> select name, rand_number
> from tab
> `
> !https://internal-api-lark-file.f.mioffice.cn/api/image/keys/img_bcf6f867-6aee-4afe-bc43-30bf4f2dbdel?message_id=7032102765711097965!
> But In spark2.3, it got consistent results
> `
> with tab as (
>  select 'Withas' as name, rand() as rand_number
> )
> select name, rand_number
> from tab
> union all
> select name, rand_number
> from tab
> `
> !https://internal-api-lark-file.f.mioffice.cn/api/image/keys/img_6dc6e44b-d4a5-4b0d-bd2c-00859ec80a1l?message_id=7032104202756751468!
> Why does Spark3.1.2 return different results?
> Has anyone encountered this problem?



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-37382) `with as` clause got inconsistent results

2021-11-22 Thread caican (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-37382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17447277#comment-17447277
 ] 

caican commented on SPARK-37382:


[~zhenw] Thank you for your reply, i will test it out.

> `with as` clause got inconsistent results
> -
>
> Key: SPARK-37382
> URL: https://issues.apache.org/jira/browse/SPARK-37382
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.2
>Reporter: caican
>Priority: Major
>
> In Spark3.1, the `with as` clause in the same SQL is executed multiple times, 
>  got different results
> `
> with tab as (
>  select 'Withas' as name, rand() as rand_number
> )
> select name, rand_number
> from tab
> union all
> select name, rand_number
> from tab
> `
> !https://internal-api-lark-file.f.mioffice.cn/api/image/keys/img_bcf6f867-6aee-4afe-bc43-30bf4f2dbdel?message_id=7032102765711097965!
> But In spark2.3, it got consistent results
> `
> with tab as (
>  select 'Withas' as name, rand() as rand_number
> )
> select name, rand_number
> from tab
> union all
> select name, rand_number
> from tab
> `
> !https://internal-api-lark-file.f.mioffice.cn/api/image/keys/img_6dc6e44b-d4a5-4b0d-bd2c-00859ec80a1l?message_id=7032104202756751468!
> Why does Spark3.1.2 return different results?
> Has anyone encountered this problem?



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-37438) ANSI mode: Use store assignment rules for resolving function invocation

2021-11-22 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37438?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37438:


Assignee: Apache Spark  (was: Gengliang Wang)

> ANSI mode: Use store assignment rules for resolving function invocation
> ---
>
> Key: SPARK-37438
> URL: https://issues.apache.org/jira/browse/SPARK-37438
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Gengliang Wang
>Assignee: Apache Spark
>Priority: Major
>
> Under ANSI mode(spark.sql.ansi.enabled=true), the function invocation of 
> Spark SQL:
> - In general, it follows the `Store assignment` rules as storing the input 
> values as the declared parameter type of the SQL functions
> - Special rules apply for string literals and untyped NULL. A NULL can be 
> promoted to any other type, while a string literal can be promoted to any 
> simple data type.
> {code:sql}
> > SET spark.sql.ansi.enabled=true;
> -- implicitly cast Int to String type
> > SELECT concat('total number: ', 1);
> total number: 1
> -- implicitly cast Timestamp to Date type
> > select datediff(now(), current_date);
> 0
> -- specialrule: implicitly cast String literal to Double type
> > SELECT ceil('0.1');
> 1
> -- specialrule: implicitly cast NULL to Date type
> > SELECT year(null);
> NULL
> > CREATE TABLE t(s string);
> -- Can't store String column as Numeric types.
> > SELECT ceil(s) from t;
> Error in query: cannot resolve 'CEIL(spark_catalog.default.t.s)' due to data 
> type mismatch
> -- Can't store String column as Date type.
> > select year(s) from t;
> Error in query: cannot resolve 'year(spark_catalog.default.t.s)' due to data 
> type mismatch
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-37438) ANSI mode: Use store assignment rules for resolving function invocation

2021-11-22 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-37438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17447267#comment-17447267
 ] 

Apache Spark commented on SPARK-37438:
--

User 'gengliangwang' has created a pull request for this issue:
https://github.com/apache/spark/pull/34681

> ANSI mode: Use store assignment rules for resolving function invocation
> ---
>
> Key: SPARK-37438
> URL: https://issues.apache.org/jira/browse/SPARK-37438
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
>
> Under ANSI mode(spark.sql.ansi.enabled=true), the function invocation of 
> Spark SQL:
> - In general, it follows the `Store assignment` rules as storing the input 
> values as the declared parameter type of the SQL functions
> - Special rules apply for string literals and untyped NULL. A NULL can be 
> promoted to any other type, while a string literal can be promoted to any 
> simple data type.
> {code:sql}
> > SET spark.sql.ansi.enabled=true;
> -- implicitly cast Int to String type
> > SELECT concat('total number: ', 1);
> total number: 1
> -- implicitly cast Timestamp to Date type
> > select datediff(now(), current_date);
> 0
> -- specialrule: implicitly cast String literal to Double type
> > SELECT ceil('0.1');
> 1
> -- specialrule: implicitly cast NULL to Date type
> > SELECT year(null);
> NULL
> > CREATE TABLE t(s string);
> -- Can't store String column as Numeric types.
> > SELECT ceil(s) from t;
> Error in query: cannot resolve 'CEIL(spark_catalog.default.t.s)' due to data 
> type mismatch
> -- Can't store String column as Date type.
> > select year(s) from t;
> Error in query: cannot resolve 'year(spark_catalog.default.t.s)' due to data 
> type mismatch
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-37438) ANSI mode: Use store assignment rules for resolving function invocation

2021-11-22 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37438?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37438:


Assignee: Gengliang Wang  (was: Apache Spark)

> ANSI mode: Use store assignment rules for resolving function invocation
> ---
>
> Key: SPARK-37438
> URL: https://issues.apache.org/jira/browse/SPARK-37438
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
>
> Under ANSI mode(spark.sql.ansi.enabled=true), the function invocation of 
> Spark SQL:
> - In general, it follows the `Store assignment` rules as storing the input 
> values as the declared parameter type of the SQL functions
> - Special rules apply for string literals and untyped NULL. A NULL can be 
> promoted to any other type, while a string literal can be promoted to any 
> simple data type.
> {code:sql}
> > SET spark.sql.ansi.enabled=true;
> -- implicitly cast Int to String type
> > SELECT concat('total number: ', 1);
> total number: 1
> -- implicitly cast Timestamp to Date type
> > select datediff(now(), current_date);
> 0
> -- specialrule: implicitly cast String literal to Double type
> > SELECT ceil('0.1');
> 1
> -- specialrule: implicitly cast NULL to Date type
> > SELECT year(null);
> NULL
> > CREATE TABLE t(s string);
> -- Can't store String column as Numeric types.
> > SELECT ceil(s) from t;
> Error in query: cannot resolve 'CEIL(spark_catalog.default.t.s)' due to data 
> type mismatch
> -- Can't store String column as Date type.
> > select year(s) from t;
> Error in query: cannot resolve 'year(spark_catalog.default.t.s)' due to data 
> type mismatch
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-37438) ANSI mode: Use store assignment rules for resolving function invocation

2021-11-22 Thread Gengliang Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37438?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gengliang Wang updated SPARK-37438:
---
Description: 
Under ANSI mode(spark.sql.ansi.enabled=true), the function invocation of Spark 
SQL:
- In general, it follows the `Store assignment` rules as storing the input 
values as the declared parameter type of the SQL functions
- Special rules apply for string literals and untyped NULL. A NULL can be 
promoted to any other type, while a string literal can be promoted to any 
simple data type.


{code:sql}
> SET spark.sql.ansi.enabled=true;
-- implicitly cast Int to String type
> SELECT concat('total number: ', 1);
total number: 1
-- implicitly cast Timestamp to Date type
> select datediff(now(), current_date);
0

-- specialrule: implicitly cast String literal to Double type
> SELECT ceil('0.1');
1
-- specialrule: implicitly cast NULL to Date type
> SELECT year(null);
NULL

> CREATE TABLE t(s string);
-- Can't store String column as Numeric types.
> SELECT ceil(s) from t;
Error in query: cannot resolve 'CEIL(spark_catalog.default.t.s)' due to data 
type mismatch
-- Can't store String column as Date type.
> select year(s) from t;
Error in query: cannot resolve 'year(spark_catalog.default.t.s)' due to data 
type mismatch
{code}



  was:
Under ANSI mode(spark.sql.ansi.enabled=true), the function invocation of Spark 
SQL:
- In general, it follows the `Store assignment` rules as storing the input 
values as the declared parameter type of the SQL functions
- Special rules apply for string literals and untyped NULL. A NULL can be 
promoted to any other type, while a string literal can be promoted to any 
simple data type.

```sql
> SET spark.sql.ansi.enabled=true;
-- implicitly cast Int to String type
> SELECT concat('total number: ', 1);
total number: 1
-- implicitly cast Timestamp to Date type
> select datediff(now(), current_date);
0

-- specialrule: implicitly cast String literal to Double type
> SELECT ceil('0.1');
1
-- specialrule: implicitly cast NULL to Date type
> SELECT year(null);
NULL

> CREATE TABLE t(s string);
-- Can't store assign String column as Numeric types.
> SELECT ceil(s) from t;
Error in query: cannot resolve 'CEIL(spark_catalog.default.t.s)' due to data 
type mismatch
-- Can't store assign String column as Date type.
> select year(s) from t;
Error in query: cannot resolve 'year(spark_catalog.default.t.s)' due to data 
type mismatch
```


> ANSI mode: Use store assignment rules for resolving function invocation
> ---
>
> Key: SPARK-37438
> URL: https://issues.apache.org/jira/browse/SPARK-37438
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
>
> Under ANSI mode(spark.sql.ansi.enabled=true), the function invocation of 
> Spark SQL:
> - In general, it follows the `Store assignment` rules as storing the input 
> values as the declared parameter type of the SQL functions
> - Special rules apply for string literals and untyped NULL. A NULL can be 
> promoted to any other type, while a string literal can be promoted to any 
> simple data type.
> {code:sql}
> > SET spark.sql.ansi.enabled=true;
> -- implicitly cast Int to String type
> > SELECT concat('total number: ', 1);
> total number: 1
> -- implicitly cast Timestamp to Date type
> > select datediff(now(), current_date);
> 0
> -- specialrule: implicitly cast String literal to Double type
> > SELECT ceil('0.1');
> 1
> -- specialrule: implicitly cast NULL to Date type
> > SELECT year(null);
> NULL
> > CREATE TABLE t(s string);
> -- Can't store String column as Numeric types.
> > SELECT ceil(s) from t;
> Error in query: cannot resolve 'CEIL(spark_catalog.default.t.s)' due to data 
> type mismatch
> -- Can't store String column as Date type.
> > select year(s) from t;
> Error in query: cannot resolve 'year(spark_catalog.default.t.s)' due to data 
> type mismatch
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-37438) ANSI mode: Use store assignment rules for resolving function invocation

2021-11-22 Thread Gengliang Wang (Jira)

Gengliang Wang created SPARK-37438:
--

 Summary: ANSI mode: Use store assignment rules for resolving 
function invocation
 Key: SPARK-37438
 URL: https://issues.apache.org/jira/browse/SPARK-37438
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.3.0
Reporter: Gengliang Wang
Assignee: Gengliang Wang


Under ANSI mode(spark.sql.ansi.enabled=true), the function invocation of Spark 
SQL:
- In general, it follows the `Store assignment` rules as storing the input 
values as the declared parameter type of the SQL functions
- Special rules apply for string literals and untyped NULL. A NULL can be 
promoted to any other type, while a string literal can be promoted to any 
simple data type.

```sql
> SET spark.sql.ansi.enabled=true;
-- implicitly cast Int to String type
> SELECT concat('total number: ', 1);
total number: 1
-- implicitly cast Timestamp to Date type
> select datediff(now(), current_date);
0

-- specialrule: implicitly cast String literal to Double type
> SELECT ceil('0.1');
1
-- specialrule: implicitly cast NULL to Date type
> SELECT year(null);
NULL

> CREATE TABLE t(s string);
-- Can't store assign String column as Numeric types.
> SELECT ceil(s) from t;
Error in query: cannot resolve 'CEIL(spark_catalog.default.t.s)' due to data 
type mismatch
-- Can't store assign String column as Date type.
> select year(s) from t;
Error in query: cannot resolve 'year(spark_catalog.default.t.s)' due to data 
type mismatch
```



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-37388) WidthBucket throws NullPointerException in WholeStageCodegenExec

2021-11-22 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37388?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-37388.
-
Fix Version/s: 3.1.3
   3.2.1
   3.3.0
 Assignee: Tom van Bussel  (was: Apache Spark)
   Resolution: Fixed

> WidthBucket throws NullPointerException in WholeStageCodegenExec
> 
>
> Key: SPARK-37388
> URL: https://issues.apache.org/jira/browse/SPARK-37388
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Tom van Bussel
>Assignee: Tom van Bussel
>Priority: Major
> Fix For: 3.1.3, 3.2.1, 3.3.0
>
>
> Repro: Disable ConstantFolding and run
> {code:java}
> SELECT width_bucket(3.5, 3.0, 3.0, 888) {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

74 matches

Mail list logo