[jira] [Commented] (SPARK-33813) JDBC datasource fails when reading spatial datatypes with the MS SQL driver

2021-01-21 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33813?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17269964#comment-17269964
 ] 

Apache Spark commented on SPARK-33813:
--

User 'sarutak' has created a pull request for this issue:
https://github.com/apache/spark/pull/31290

> JDBC datasource fails when reading spatial datatypes with the MS SQL driver
> ---
>
> Key: SPARK-33813
> URL: https://issues.apache.org/jira/browse/SPARK-33813
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0, 3.1.0
>Reporter: Michał Świtakowski
>Assignee: Kousuke Saruta
>Priority: Major
> Fix For: 3.2.0
>
>
> The MS SQL JDBC driver introduced support for spatial types since version 
> 7.0. The JDBC data source lacks mappings for these types which results in an 
> exception below. It seems that a mapping in 
> MsSqlServerDialect.getCatalystType that maps -157 and -158 typecode to 
> VARBINARY should address the issue.
>  
> {noformat}
> java.sql.SQLException: Unrecognized SQL type -157
>  at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.getCatalystType(JdbcUtils.scala:251)
>  at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.$anonfun$getSchema$1(JdbcUtils.scala:321)
>  at scala.Option.getOrElse(Option.scala:189)
>  at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.getSchema(JdbcUtils.scala:321)
>  at 
> org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$.resolveTable(JDBCRDD.scala:63)
>  at 
> org.apache.spark.sql.execution.datasources.jdbc.JDBCRelation$.getSchema(JDBCRelation.scala:226)
>  at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:35)
>  at 
> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:364)
>  at 
> org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:366)
>  at 
> org.apache.spark.sql.DataFrameReader.$anonfun$load$2(DataFrameReader.scala:355)
>  at scala.Option.getOrElse(Option.scala:189)
>  at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:355)
>  at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:240)
>  at 
> org.apache.spark.sql.DataFrameReader.jdbc(DataFrameReader.scala:381){noformat}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33813) JDBC datasource fails when reading spatial datatypes with the MS SQL driver

2021-01-21 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33813?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17269963#comment-17269963
 ] 

Apache Spark commented on SPARK-33813:
--

User 'sarutak' has created a pull request for this issue:
https://github.com/apache/spark/pull/31290

> JDBC datasource fails when reading spatial datatypes with the MS SQL driver
> ---
>
> Key: SPARK-33813
> URL: https://issues.apache.org/jira/browse/SPARK-33813
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0, 3.1.0
>Reporter: Michał Świtakowski
>Assignee: Kousuke Saruta
>Priority: Major
> Fix For: 3.2.0
>
>
> The MS SQL JDBC driver introduced support for spatial types since version 
> 7.0. The JDBC data source lacks mappings for these types which results in an 
> exception below. It seems that a mapping in 
> MsSqlServerDialect.getCatalystType that maps -157 and -158 typecode to 
> VARBINARY should address the issue.
>  
> {noformat}
> java.sql.SQLException: Unrecognized SQL type -157
>  at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.getCatalystType(JdbcUtils.scala:251)
>  at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.$anonfun$getSchema$1(JdbcUtils.scala:321)
>  at scala.Option.getOrElse(Option.scala:189)
>  at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.getSchema(JdbcUtils.scala:321)
>  at 
> org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$.resolveTable(JDBCRDD.scala:63)
>  at 
> org.apache.spark.sql.execution.datasources.jdbc.JDBCRelation$.getSchema(JDBCRelation.scala:226)
>  at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:35)
>  at 
> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:364)
>  at 
> org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:366)
>  at 
> org.apache.spark.sql.DataFrameReader.$anonfun$load$2(DataFrameReader.scala:355)
>  at scala.Option.getOrElse(Option.scala:189)
>  at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:355)
>  at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:240)
>  at 
> org.apache.spark.sql.DataFrameReader.jdbc(DataFrameReader.scala:381){noformat}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33813) JDBC datasource fails when reading spatial datatypes with the MS SQL driver

2021-01-21 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33813?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17269962#comment-17269962
 ] 

Apache Spark commented on SPARK-33813:
--

User 'sarutak' has created a pull request for this issue:
https://github.com/apache/spark/pull/31289

> JDBC datasource fails when reading spatial datatypes with the MS SQL driver
> ---
>
> Key: SPARK-33813
> URL: https://issues.apache.org/jira/browse/SPARK-33813
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0, 3.1.0
>Reporter: Michał Świtakowski
>Assignee: Kousuke Saruta
>Priority: Major
> Fix For: 3.2.0
>
>
> The MS SQL JDBC driver introduced support for spatial types since version 
> 7.0. The JDBC data source lacks mappings for these types which results in an 
> exception below. It seems that a mapping in 
> MsSqlServerDialect.getCatalystType that maps -157 and -158 typecode to 
> VARBINARY should address the issue.
>  
> {noformat}
> java.sql.SQLException: Unrecognized SQL type -157
>  at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.getCatalystType(JdbcUtils.scala:251)
>  at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.$anonfun$getSchema$1(JdbcUtils.scala:321)
>  at scala.Option.getOrElse(Option.scala:189)
>  at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.getSchema(JdbcUtils.scala:321)
>  at 
> org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$.resolveTable(JDBCRDD.scala:63)
>  at 
> org.apache.spark.sql.execution.datasources.jdbc.JDBCRelation$.getSchema(JDBCRelation.scala:226)
>  at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:35)
>  at 
> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:364)
>  at 
> org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:366)
>  at 
> org.apache.spark.sql.DataFrameReader.$anonfun$load$2(DataFrameReader.scala:355)
>  at scala.Option.getOrElse(Option.scala:189)
>  at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:355)
>  at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:240)
>  at 
> org.apache.spark.sql.DataFrameReader.jdbc(DataFrameReader.scala:381){noformat}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33813) JDBC datasource fails when reading spatial datatypes with the MS SQL driver

2021-01-21 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33813?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17269961#comment-17269961
 ] 

Apache Spark commented on SPARK-33813:
--

User 'sarutak' has created a pull request for this issue:
https://github.com/apache/spark/pull/31289

> JDBC datasource fails when reading spatial datatypes with the MS SQL driver
> ---
>
> Key: SPARK-33813
> URL: https://issues.apache.org/jira/browse/SPARK-33813
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0, 3.1.0
>Reporter: Michał Świtakowski
>Assignee: Kousuke Saruta
>Priority: Major
> Fix For: 3.2.0
>
>
> The MS SQL JDBC driver introduced support for spatial types since version 
> 7.0. The JDBC data source lacks mappings for these types which results in an 
> exception below. It seems that a mapping in 
> MsSqlServerDialect.getCatalystType that maps -157 and -158 typecode to 
> VARBINARY should address the issue.
>  
> {noformat}
> java.sql.SQLException: Unrecognized SQL type -157
>  at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.getCatalystType(JdbcUtils.scala:251)
>  at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.$anonfun$getSchema$1(JdbcUtils.scala:321)
>  at scala.Option.getOrElse(Option.scala:189)
>  at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.getSchema(JdbcUtils.scala:321)
>  at 
> org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$.resolveTable(JDBCRDD.scala:63)
>  at 
> org.apache.spark.sql.execution.datasources.jdbc.JDBCRelation$.getSchema(JDBCRelation.scala:226)
>  at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:35)
>  at 
> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:364)
>  at 
> org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:366)
>  at 
> org.apache.spark.sql.DataFrameReader.$anonfun$load$2(DataFrameReader.scala:355)
>  at scala.Option.getOrElse(Option.scala:189)
>  at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:355)
>  at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:240)
>  at 
> org.apache.spark.sql.DataFrameReader.jdbc(DataFrameReader.scala:381){noformat}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-33933) Broadcast timeout happened unexpectedly in AQE

2021-01-21 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33933?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-33933.
-
Fix Version/s: 3.2.0
   Resolution: Fixed

Issue resolved by pull request 31269
[https://github.com/apache/spark/pull/31269]

> Broadcast timeout happened unexpectedly in AQE 
> ---
>
> Key: SPARK-33933
> URL: https://issues.apache.org/jira/browse/SPARK-33933
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0, 3.0.1
>Reporter: Yu Zhong
>Assignee: Yu Zhong
>Priority: Major
> Fix For: 3.2.0
>
>
> In Spark 3.0, when AQE is enabled, there is often broadcast timeout in normal 
> queries as below.
>  
> {code:java}
> Could not execute broadcast in 300 secs. You can increase the timeout for 
> broadcasts via spark.sql.broadcastTimeout or disable broadcast join by 
> setting spark.sql.autoBroadcastJoinThreshold to -1
> {code}
>  
> This is usually happens when broadcast join(with or without hint) after a 
> long running shuffle (more than 5 minutes).  By disable AQE, the issues 
> disappear.
> The workaround is to increase spark.sql.broadcastTimeout and it works. But 
> because the data to broadcast is very small, that doesn't make sense.
> After investigation, the root cause should be like this: when enable AQE, in 
> getFinalPhysicalPlan, spark traversal the physical plan bottom up and create 
> query stage for materialized part by createQueryStages and materialize those 
> new created query stages to submit map stages or broadcasting. When 
> ShuffleQueryStage are materializing before BroadcastQueryStage, the map job 
> and broadcast job are submitted almost at the same time, but map job will 
> hold all the computing resources. If the map job runs slow (when lots of data 
> needs to process and the resource is limited), the broadcast job cannot be 
> started(and finished) before spark.sql.broadcastTimeout, thus cause whole job 
> failed (introduced in SPARK-31475).
> Code to reproduce:
>  
> {code:java}
> import java.util.UUID
> import scala.util.Random
> import org.apache.spark.sql.functions._
> import org.apache.spark.sql.SparkSession
> val spark = SparkSession.builder()
>   .master("local[2]")
>   .appName("Test Broadcast").getOrCreate()
> import spark.implicits._
> spark.conf.set("spark.sql.adaptive.enabled", "true")
> val sc = spark.sparkContext
> sc.setLogLevel("INFO")
> val uuid = UUID.randomUUID
> val df = sc.parallelize(Range(0, 1), 1).flatMap(x => {
>   for (i <- Range(0, 1 + Random.nextInt(1)))
> yield (x % 26, x, Random.nextInt(10), UUID.randomUUID.toString)
> }).toDF("index", "part", "pv", "uuid")
>   .withColumn("md5", md5($"uuid"))
> val dim_data = Range(0, 26).map(x => (('a' + x).toChar.toString, x))
> val dim = dim_data.toDF("name", "index")
> val result = df.groupBy("index")
>   .agg(sum($"pv").alias("pv"), countDistinct("uuid").alias("uv"))
>   .join(dim, Seq("index"))
>   .collect(){code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33813) JDBC datasource fails when reading spatial datatypes with the MS SQL driver

2021-01-21 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33813?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17269834#comment-17269834
 ] 

Apache Spark commented on SPARK-33813:
--

User 'sarutak' has created a pull request for this issue:
https://github.com/apache/spark/pull/31288

> JDBC datasource fails when reading spatial datatypes with the MS SQL driver
> ---
>
> Key: SPARK-33813
> URL: https://issues.apache.org/jira/browse/SPARK-33813
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0, 3.1.0
>Reporter: Michał Świtakowski
>Assignee: Kousuke Saruta
>Priority: Major
> Fix For: 3.2.0
>
>
> The MS SQL JDBC driver introduced support for spatial types since version 
> 7.0. The JDBC data source lacks mappings for these types which results in an 
> exception below. It seems that a mapping in 
> MsSqlServerDialect.getCatalystType that maps -157 and -158 typecode to 
> VARBINARY should address the issue.
>  
> {noformat}
> java.sql.SQLException: Unrecognized SQL type -157
>  at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.getCatalystType(JdbcUtils.scala:251)
>  at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.$anonfun$getSchema$1(JdbcUtils.scala:321)
>  at scala.Option.getOrElse(Option.scala:189)
>  at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.getSchema(JdbcUtils.scala:321)
>  at 
> org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$.resolveTable(JDBCRDD.scala:63)
>  at 
> org.apache.spark.sql.execution.datasources.jdbc.JDBCRelation$.getSchema(JDBCRelation.scala:226)
>  at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:35)
>  at 
> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:364)
>  at 
> org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:366)
>  at 
> org.apache.spark.sql.DataFrameReader.$anonfun$load$2(DataFrameReader.scala:355)
>  at scala.Option.getOrElse(Option.scala:189)
>  at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:355)
>  at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:240)
>  at 
> org.apache.spark.sql.DataFrameReader.jdbc(DataFrameReader.scala:381){noformat}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34200) ambiguous column reference should consider attribute availability

2021-01-21 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17269802#comment-17269802
 ] 

Apache Spark commented on SPARK-34200:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/31287

> ambiguous column reference should consider attribute availability
> -
>
> Key: SPARK-34200
> URL: https://issues.apache.org/jira/browse/SPARK-34200
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0, 3.1.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34200) ambiguous column reference should consider attribute availability

2021-01-21 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17269800#comment-17269800
 ] 

Apache Spark commented on SPARK-34200:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/31287

> ambiguous column reference should consider attribute availability
> -
>
> Key: SPARK-34200
> URL: https://issues.apache.org/jira/browse/SPARK-34200
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0, 3.1.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-34200) ambiguous column reference should consider attribute availability

2021-01-21 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34200?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34200:


Assignee: Apache Spark  (was: Wenchen Fan)

> ambiguous column reference should consider attribute availability
> -
>
> Key: SPARK-34200
> URL: https://issues.apache.org/jira/browse/SPARK-34200
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0, 3.1.0
>Reporter: Wenchen Fan
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-34200) ambiguous column reference should consider attribute availability

2021-01-21 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34200?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34200:


Assignee: Wenchen Fan  (was: Apache Spark)

> ambiguous column reference should consider attribute availability
> -
>
> Key: SPARK-34200
> URL: https://issues.apache.org/jira/browse/SPARK-34200
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0, 3.1.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-34200) ambiguous column reference should consider attribute availability

2021-01-21 Thread Wenchen Fan (Jira)
Wenchen Fan created SPARK-34200:
---

 Summary: ambiguous column reference should consider attribute 
availability
 Key: SPARK-34200
 URL: https://issues.apache.org/jira/browse/SPARK-34200
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.0, 3.1.0
Reporter: Wenchen Fan
Assignee: Wenchen Fan






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-33245) Add built-in UDF - GETBIT

2021-01-21 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33245?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-33245.
-
Fix Version/s: 3.2.0
   Resolution: Fixed

> Add built-in UDF - GETBIT 
> --
>
> Key: SPARK-33245
> URL: https://issues.apache.org/jira/browse/SPARK-33245
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Yuming Wang
>Assignee: jiaan.geng
>Priority: Major
> Fix For: 3.2.0
>
>
> Teradata, Impala, Snowflake and Yellowbrick support this function:
> https://docs.teradata.com/reader/kmuOwjp1zEYg98JsB8fu_A/PK1oV1b2jqvG~ohRnOro9w
> https://docs.cloudera.com/runtime/7.2.0/impala-sql-reference/topics/impala-bit-functions.html#bit_functions__getbit
> https://docs.snowflake.com/en/sql-reference/functions/getbit.html
> https://www.yellowbrick.com/docs/2.2/ybd_sqlref/getbit.html



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-33245) Add built-in UDF - GETBIT

2021-01-21 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33245?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reopened SPARK-33245:
-

> Add built-in UDF - GETBIT 
> --
>
> Key: SPARK-33245
> URL: https://issues.apache.org/jira/browse/SPARK-33245
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Yuming Wang
>Assignee: jiaan.geng
>Priority: Major
>
> Teradata, Impala, Snowflake and Yellowbrick support this function:
> https://docs.teradata.com/reader/kmuOwjp1zEYg98JsB8fu_A/PK1oV1b2jqvG~ohRnOro9w
> https://docs.cloudera.com/runtime/7.2.0/impala-sql-reference/topics/impala-bit-functions.html#bit_functions__getbit
> https://docs.snowflake.com/en/sql-reference/functions/getbit.html
> https://www.yellowbrick.com/docs/2.2/ybd_sqlref/getbit.html



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34199) Block `count(table.*)` to follow ANSI standard and other SQL engines

2021-01-21 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17269791#comment-17269791
 ] 

Apache Spark commented on SPARK-34199:
--

User 'linhongliu-db' has created a pull request for this issue:
https://github.com/apache/spark/pull/31286

> Block `count(table.*)` to follow ANSI standard and other SQL engines
> 
>
> Key: SPARK-34199
> URL: https://issues.apache.org/jira/browse/SPARK-34199
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Linhong Liu
>Priority: Major
>
> In spark, the count(table.*) may cause very weird result, for example:
> select count(*) from (select 1 as a, null as b) t;
> output: 1
> select count(t.*) from (select 1 as a, null as b) t;
> output: 0
>  
> After checking the ANSI standard, count(*) is always treated as count(1) 
> while count(t.*) is not allowed. What's more, this is also not allowed by 
> common databases, e.g. MySQL, oracle.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-34199) Block `count(table.*)` to follow ANSI standard and other SQL engines

2021-01-21 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34199?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34199:


Assignee: Apache Spark

> Block `count(table.*)` to follow ANSI standard and other SQL engines
> 
>
> Key: SPARK-34199
> URL: https://issues.apache.org/jira/browse/SPARK-34199
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Linhong Liu
>Assignee: Apache Spark
>Priority: Major
>
> In spark, the count(table.*) may cause very weird result, for example:
> select count(*) from (select 1 as a, null as b) t;
> output: 1
> select count(t.*) from (select 1 as a, null as b) t;
> output: 0
>  
> After checking the ANSI standard, count(*) is always treated as count(1) 
> while count(t.*) is not allowed. What's more, this is also not allowed by 
> common databases, e.g. MySQL, oracle.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-34199) Block `count(table.*)` to follow ANSI standard and other SQL engines

2021-01-21 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34199?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34199:


Assignee: (was: Apache Spark)

> Block `count(table.*)` to follow ANSI standard and other SQL engines
> 
>
> Key: SPARK-34199
> URL: https://issues.apache.org/jira/browse/SPARK-34199
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Linhong Liu
>Priority: Major
>
> In spark, the count(table.*) may cause very weird result, for example:
> select count(*) from (select 1 as a, null as b) t;
> output: 1
> select count(t.*) from (select 1 as a, null as b) t;
> output: 0
>  
> After checking the ANSI standard, count(*) is always treated as count(1) 
> while count(t.*) is not allowed. What's more, this is also not allowed by 
> common databases, e.g. MySQL, oracle.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33245) Add built-in UDF - GETBIT

2021-01-21 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33245?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-33245:
---

Assignee: jiaan.geng

> Add built-in UDF - GETBIT 
> --
>
> Key: SPARK-33245
> URL: https://issues.apache.org/jira/browse/SPARK-33245
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Yuming Wang
>Assignee: jiaan.geng
>Priority: Major
>
> Teradata, Impala, Snowflake and Yellowbrick support this function:
> https://docs.teradata.com/reader/kmuOwjp1zEYg98JsB8fu_A/PK1oV1b2jqvG~ohRnOro9w
> https://docs.cloudera.com/runtime/7.2.0/impala-sql-reference/topics/impala-bit-functions.html#bit_functions__getbit
> https://docs.snowflake.com/en/sql-reference/functions/getbit.html
> https://www.yellowbrick.com/docs/2.2/ybd_sqlref/getbit.html



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33541) Group exception messages in catalyst/expressions

2021-01-21 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33541?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-33541:
---

Assignee: jiaan.geng

> Group exception messages in catalyst/expressions
> 
>
> Key: SPARK-33541
> URL: https://issues.apache.org/jira/browse/SPARK-33541
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Allison Wang
>Assignee: jiaan.geng
>Priority: Major
>
> '/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions'
> || Filename  ||   Count ||
> | Cast.scala|  18 |
> | ExprUtils.scala   |   2 |
> | Expression.scala  |   8 |
> | InterpretedUnsafeProjection.scala |   1 |
> | ScalaUDF.scala|   2 |
> | SelectedField.scala   |   3 |
> | SubExprEvaluationRuntime.scala|   1 |
> | arithmetic.scala  |   8 |
> | collectionOperations.scala|   4 |
> | complexTypeExtractors.scala   |   3 |
> | csvExpressions.scala  |   3 |
> | datetimeExpressions.scala |   4 |
> | decimalExpressions.scala  |   2 |
> | generators.scala  |   2 |
> | higherOrderFunctions.scala|   6 |
> | jsonExpressions.scala |   2 |
> | literals.scala|   3 |
> | misc.scala|   2 |
> | namedExpressions.scala|   1 |
> | ordering.scala|   1 |
> | package.scala |   1 |
> | regexpExpressions.scala   |   1 |
> | stringExpressions.scala   |   1 |
> | windowExpressions.scala   |   5 |
> '/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate'
> || Filename||   Count ||
> | ApproximatePercentile.scala |   2 |
> | HyperLogLogPlusPlus.scala   |   1 |
> | Percentile.scala|   1 |
> | interfaces.scala|   2 |
> '/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen'
> || Filename||   Count ||
> | CodeGenerator.scala |   5 |
> | javaCode.scala  |   1 |
> '/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/objects'
> || Filename  ||   Count ||
> | objects.scala |  12 |



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-33541) Group exception messages in catalyst/expressions

2021-01-21 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33541?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-33541.
-
Fix Version/s: 3.2.0
   Resolution: Fixed

Issue resolved by pull request 31228
[https://github.com/apache/spark/pull/31228]

> Group exception messages in catalyst/expressions
> 
>
> Key: SPARK-33541
> URL: https://issues.apache.org/jira/browse/SPARK-33541
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Allison Wang
>Assignee: jiaan.geng
>Priority: Major
> Fix For: 3.2.0
>
>
> '/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions'
> || Filename  ||   Count ||
> | Cast.scala|  18 |
> | ExprUtils.scala   |   2 |
> | Expression.scala  |   8 |
> | InterpretedUnsafeProjection.scala |   1 |
> | ScalaUDF.scala|   2 |
> | SelectedField.scala   |   3 |
> | SubExprEvaluationRuntime.scala|   1 |
> | arithmetic.scala  |   8 |
> | collectionOperations.scala|   4 |
> | complexTypeExtractors.scala   |   3 |
> | csvExpressions.scala  |   3 |
> | datetimeExpressions.scala |   4 |
> | decimalExpressions.scala  |   2 |
> | generators.scala  |   2 |
> | higherOrderFunctions.scala|   6 |
> | jsonExpressions.scala |   2 |
> | literals.scala|   3 |
> | misc.scala|   2 |
> | namedExpressions.scala|   1 |
> | ordering.scala|   1 |
> | package.scala |   1 |
> | regexpExpressions.scala   |   1 |
> | stringExpressions.scala   |   1 |
> | windowExpressions.scala   |   5 |
> '/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate'
> || Filename||   Count ||
> | ApproximatePercentile.scala |   2 |
> | HyperLogLogPlusPlus.scala   |   1 |
> | Percentile.scala|   1 |
> | interfaces.scala|   2 |
> '/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen'
> || Filename||   Count ||
> | CodeGenerator.scala |   5 |
> | javaCode.scala  |   1 |
> '/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/objects'
> || Filename  ||   Count ||
> | objects.scala |  12 |



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-33813) JDBC datasource fails when reading spatial datatypes with the MS SQL driver

2021-01-21 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-33813.
-
Fix Version/s: 3.2.0
   Resolution: Fixed

Issue resolved by pull request 31283
[https://github.com/apache/spark/pull/31283]

> JDBC datasource fails when reading spatial datatypes with the MS SQL driver
> ---
>
> Key: SPARK-33813
> URL: https://issues.apache.org/jira/browse/SPARK-33813
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0, 3.1.0
>Reporter: Michał Świtakowski
>Assignee: Kousuke Saruta
>Priority: Major
> Fix For: 3.2.0
>
>
> The MS SQL JDBC driver introduced support for spatial types since version 
> 7.0. The JDBC data source lacks mappings for these types which results in an 
> exception below. It seems that a mapping in 
> MsSqlServerDialect.getCatalystType that maps -157 and -158 typecode to 
> VARBINARY should address the issue.
>  
> {noformat}
> java.sql.SQLException: Unrecognized SQL type -157
>  at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.getCatalystType(JdbcUtils.scala:251)
>  at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.$anonfun$getSchema$1(JdbcUtils.scala:321)
>  at scala.Option.getOrElse(Option.scala:189)
>  at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.getSchema(JdbcUtils.scala:321)
>  at 
> org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$.resolveTable(JDBCRDD.scala:63)
>  at 
> org.apache.spark.sql.execution.datasources.jdbc.JDBCRelation$.getSchema(JDBCRelation.scala:226)
>  at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:35)
>  at 
> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:364)
>  at 
> org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:366)
>  at 
> org.apache.spark.sql.DataFrameReader.$anonfun$load$2(DataFrameReader.scala:355)
>  at scala.Option.getOrElse(Option.scala:189)
>  at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:355)
>  at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:240)
>  at 
> org.apache.spark.sql.DataFrameReader.jdbc(DataFrameReader.scala:381){noformat}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33813) JDBC datasource fails when reading spatial datatypes with the MS SQL driver

2021-01-21 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-33813:
---

Assignee: Kousuke Saruta

> JDBC datasource fails when reading spatial datatypes with the MS SQL driver
> ---
>
> Key: SPARK-33813
> URL: https://issues.apache.org/jira/browse/SPARK-33813
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0, 3.1.0
>Reporter: Michał Świtakowski
>Assignee: Kousuke Saruta
>Priority: Major
>
> The MS SQL JDBC driver introduced support for spatial types since version 
> 7.0. The JDBC data source lacks mappings for these types which results in an 
> exception below. It seems that a mapping in 
> MsSqlServerDialect.getCatalystType that maps -157 and -158 typecode to 
> VARBINARY should address the issue.
>  
> {noformat}
> java.sql.SQLException: Unrecognized SQL type -157
>  at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.getCatalystType(JdbcUtils.scala:251)
>  at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.$anonfun$getSchema$1(JdbcUtils.scala:321)
>  at scala.Option.getOrElse(Option.scala:189)
>  at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.getSchema(JdbcUtils.scala:321)
>  at 
> org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$.resolveTable(JDBCRDD.scala:63)
>  at 
> org.apache.spark.sql.execution.datasources.jdbc.JDBCRelation$.getSchema(JDBCRelation.scala:226)
>  at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:35)
>  at 
> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:364)
>  at 
> org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:366)
>  at 
> org.apache.spark.sql.DataFrameReader.$anonfun$load$2(DataFrameReader.scala:355)
>  at scala.Option.getOrElse(Option.scala:189)
>  at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:355)
>  at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:240)
>  at 
> org.apache.spark.sql.DataFrameReader.jdbc(DataFrameReader.scala:381){noformat}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-34180) Fix the regression brought by SPARK-33888 for PostgresDialect

2021-01-21 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34180?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-34180.
--
Fix Version/s: 3.2.0
   Resolution: Fixed

Issue resolved by pull request 31262
[https://github.com/apache/spark/pull/31262]

> Fix the regression brought by SPARK-33888 for PostgresDialect
> -
>
> Key: SPARK-34180
> URL: https://issues.apache.org/jira/browse/SPARK-34180
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Kousuke Saruta
>Assignee: Apache Spark
>Priority: Blocker
> Fix For: 3.2.0
>
>
>  A regression brought by SPARK-33888 affects PostgresDialect causing 
> `PostgreSQLIntegrationSuite` to fail.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-34199) Block `count(table.*)` to follow ANSI standard and other SQL engines

2021-01-21 Thread Linhong Liu (Jira)
Linhong Liu created SPARK-34199:
---

 Summary: Block `count(table.*)` to follow ANSI standard and other 
SQL engines
 Key: SPARK-34199
 URL: https://issues.apache.org/jira/browse/SPARK-34199
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.2.0
Reporter: Linhong Liu


In spark, the count(table.*) may cause very weird result, for example:

select count(*) from (select 1 as a, null as b) t;

output: 1

select count(t.*) from (select 1 as a, null as b) t;

output: 0

 

After checking the ANSI standard, count(*) is always treated as count(1) while 
count(t.*) is not allowed. What's more, this is also not allowed by common 
databases, e.g. MySQL, oracle.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33489) Support null for conversion from and to Arrow type

2021-01-21 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17269744#comment-17269744
 ] 

Apache Spark commented on SPARK-33489:
--

User 'ueshin' has created a pull request for this issue:
https://github.com/apache/spark/pull/31285

> Support null for conversion from and to Arrow type
> --
>
> Key: SPARK-33489
> URL: https://issues.apache.org/jira/browse/SPARK-33489
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.0.1
>Reporter: Yuya Kanai
>Priority: Minor
>
> I got below error when using from_arrow_type() in pyspark.sql.pandas.types
> {{Unsupported type in conversion from Arrow: null}}
> I noticed NullType exists under pyspark.sql.types so it seems possible to 
> convert from pyarrow null to pyspark null type and vice versa.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-34148) Move general StateStore tests to StateStoreSuiteBase

2021-01-21 Thread L. C. Hsieh (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34148?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

L. C. Hsieh updated SPARK-34148:

Parent: SPARK-34198
Issue Type: Sub-task  (was: Test)

> Move general StateStore tests to StateStoreSuiteBase
> 
>
> Key: SPARK-34148
> URL: https://issues.apache.org/jira/browse/SPARK-34148
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming
>Affects Versions: 3.2.0
>Reporter: L. C. Hsieh
>Assignee: L. C. Hsieh
>Priority: Major
>
> There are some general StateStore tests in StateStoreSuite which is 
> HDFSBackedStateStoreProvider-specific test suite. We should move general 
> tests into StateStoreSuiteBase.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-34148) Move general StateStore tests to StateStoreSuiteBase

2021-01-21 Thread L. C. Hsieh (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34148?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

L. C. Hsieh resolved SPARK-34148.
-
Resolution: Resolved

> Move general StateStore tests to StateStoreSuiteBase
> 
>
> Key: SPARK-34148
> URL: https://issues.apache.org/jira/browse/SPARK-34148
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming
>Affects Versions: 3.2.0
>Reporter: L. C. Hsieh
>Assignee: L. C. Hsieh
>Priority: Major
>
> There are some general StateStore tests in StateStoreSuite which is 
> HDFSBackedStateStoreProvider-specific test suite. We should move general 
> tests into StateStoreSuiteBase.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34198) Add RocksDB StateStore as external module

2021-01-21 Thread L. C. Hsieh (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17269737#comment-17269737
 ] 

L. C. Hsieh commented on SPARK-34198:
-

cc [~dbtsai][~dongjoon][~hyukjin.kwon]

> Add RocksDB StateStore as external module
> -
>
> Key: SPARK-34198
> URL: https://issues.apache.org/jira/browse/SPARK-34198
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 3.2.0
>Reporter: L. C. Hsieh
>Priority: Major
>
> Currently Spark SS only has one built-in StateStore implementation 
> HDFSBackedStateStore. Actually it uses in-memory map to store state rows. As 
> there are more and more streaming applications, some of them requires to use 
> large state in stateful operations such as streaming aggregation and join.
> Several other major streaming frameworks already use RocksDB for state 
> management. So it is proven to be good choice for large state usage. But 
> Spark SS still lacks of a built-in state store for the requirement.
> We would like to explore the possibility to add RocksDB-based StateStore into 
> Spark SS. For the concern about adding RocksDB as a direct dependency, our 
> plan is to add this StateStore as an external module first.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-34198) Add RocksDB StateStore as external module

2021-01-21 Thread L. C. Hsieh (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34198?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

L. C. Hsieh updated SPARK-34198:

Issue Type: New Feature  (was: Bug)

> Add RocksDB StateStore as external module
> -
>
> Key: SPARK-34198
> URL: https://issues.apache.org/jira/browse/SPARK-34198
> Project: Spark
>  Issue Type: New Feature
>  Components: Structured Streaming
>Affects Versions: 3.2.0
>Reporter: L. C. Hsieh
>Priority: Major
>
> Currently Spark SS only has one built-in StateStore implementation 
> HDFSBackedStateStore. Actually it uses in-memory map to store state rows. As 
> there are more and more streaming applications, some of them requires to use 
> large state in stateful operations such as streaming aggregation and join.
> Several other major streaming frameworks already use RocksDB for state 
> management. So it is proven to be good choice for large state usage. But 
> Spark SS still lacks of a built-in state store for the requirement.
> We would like to explore the possibility to add RocksDB-based StateStore into 
> Spark SS. For the concern about adding RocksDB as a direct dependency, our 
> plan is to add this StateStore as an external module first.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-34198) Add RocksDB StateStore as external module

2021-01-21 Thread L. C. Hsieh (Jira)
L. C. Hsieh created SPARK-34198:
---

 Summary: Add RocksDB StateStore as external module
 Key: SPARK-34198
 URL: https://issues.apache.org/jira/browse/SPARK-34198
 Project: Spark
  Issue Type: Bug
  Components: Structured Streaming
Affects Versions: 3.2.0
Reporter: L. C. Hsieh


Currently Spark SS only has one built-in StateStore implementation 
HDFSBackedStateStore. Actually it uses in-memory map to store state rows. As 
there are more and more streaming applications, some of them requires to use 
large state in stateful operations such as streaming aggregation and join.

Several other major streaming frameworks already use RocksDB for state 
management. So it is proven to be good choice for large state usage. But Spark 
SS still lacks of a built-in state store for the requirement.

We would like to explore the possibility to add RocksDB-based StateStore into 
Spark SS. For the concern about adding RocksDB as a direct dependency, our plan 
is to add this StateStore as an external module first.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34167) Reading parquet with Decimal(8,2) written as a Decimal64 blows up

2021-01-21 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17269726#comment-17269726
 ] 

Apache Spark commented on SPARK-34167:
--

User 'razajafri' has created a pull request for this issue:
https://github.com/apache/spark/pull/31284

> Reading parquet with Decimal(8,2) written as a Decimal64 blows up
> -
>
> Key: SPARK-34167
> URL: https://issues.apache.org/jira/browse/SPARK-34167
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 3.0.1
>Reporter: Raza Jafri
>Priority: Major
> Attachments: 
> part-0-7fecd321-b247-4f7e-bff5-c2e4d8facaa0-c000.snappy.parquet, 
> part-0-940f44f1-f323-4a5e-b828-1e65d87895aa-c000.snappy.parquet
>
>
> When reading a parquet file written with Decimals with precision < 10 as a 
> 64-bit representation, Spark tries to read it as an INT and fails
>  
> Steps to reproduce:
> Read the attached file that has a single Decimal(8,2) column with 10 values
> {code:java}
> scala> spark.read.parquet("/tmp/pyspark_tests/936454/PARQUET_DATA").show
> ...
> Caused by: java.lang.NullPointerException
>   at 
> org.apache.spark.sql.execution.vectorized.OnHeapColumnVector.putLong(OnHeapColumnVector.java:327)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedRleValuesReader.readLongs(VectorizedRleValuesReader.java:370)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readLongBatch(VectorizedColumnReader.java:514)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readBatch(VectorizedColumnReader.java:256)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextBatch(VectorizedParquetRecordReader.java:273)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextKeyValue(VectorizedParquetRecordReader.java:171)
>   at 
> org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:93)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:173)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:93)
>   at 
> org.apache.spark.sql.execution.FileSourceScanExec$$anon$1.hasNext(DataSourceScanExec.scala:497)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.columnartorow_nextBatch_0$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:756)
>   at 
> org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:340)
>   at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:898)
>   at 
> org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:898)
>   at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
>   at org.apache.spark.scheduler.Task.run(Task.scala:127)
>   at 
> org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:480)
>   at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1426)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:483)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
> ...
> {code}
>  
>  
> Here are my findings. The *{{VectorizedParquetRecordReader}}* reads in the 
> parquet file correctly because its basing the read on the 
> [requestedSchema|https://github.com/apache/spark/blob/e6f019836c099398542b443f7700f79de81da0d5/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/SpecificParquetRecordReaderBase.java#L150]
>  which is a MessageType and has the underlying data stored correctly as 
> {{INT64}} where as the *{{OnHeapColumnVector}}* is initialized based on the 
> [batchSchema|https://github.com/apache/spark/blob/e6f019836c099398542b443f7700f79de81da0d5/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/SpecificParquetRecordReaderBase.java#L151]
>  which is coming from 

[jira] [Assigned] (SPARK-34167) Reading parquet with Decimal(8,2) written as a Decimal64 blows up

2021-01-21 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34167?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34167:


Assignee: (was: Apache Spark)

> Reading parquet with Decimal(8,2) written as a Decimal64 blows up
> -
>
> Key: SPARK-34167
> URL: https://issues.apache.org/jira/browse/SPARK-34167
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 3.0.1
>Reporter: Raza Jafri
>Priority: Major
> Attachments: 
> part-0-7fecd321-b247-4f7e-bff5-c2e4d8facaa0-c000.snappy.parquet, 
> part-0-940f44f1-f323-4a5e-b828-1e65d87895aa-c000.snappy.parquet
>
>
> When reading a parquet file written with Decimals with precision < 10 as a 
> 64-bit representation, Spark tries to read it as an INT and fails
>  
> Steps to reproduce:
> Read the attached file that has a single Decimal(8,2) column with 10 values
> {code:java}
> scala> spark.read.parquet("/tmp/pyspark_tests/936454/PARQUET_DATA").show
> ...
> Caused by: java.lang.NullPointerException
>   at 
> org.apache.spark.sql.execution.vectorized.OnHeapColumnVector.putLong(OnHeapColumnVector.java:327)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedRleValuesReader.readLongs(VectorizedRleValuesReader.java:370)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readLongBatch(VectorizedColumnReader.java:514)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readBatch(VectorizedColumnReader.java:256)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextBatch(VectorizedParquetRecordReader.java:273)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextKeyValue(VectorizedParquetRecordReader.java:171)
>   at 
> org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:93)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:173)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:93)
>   at 
> org.apache.spark.sql.execution.FileSourceScanExec$$anon$1.hasNext(DataSourceScanExec.scala:497)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.columnartorow_nextBatch_0$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:756)
>   at 
> org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:340)
>   at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:898)
>   at 
> org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:898)
>   at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
>   at org.apache.spark.scheduler.Task.run(Task.scala:127)
>   at 
> org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:480)
>   at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1426)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:483)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
> ...
> {code}
>  
>  
> Here are my findings. The *{{VectorizedParquetRecordReader}}* reads in the 
> parquet file correctly because its basing the read on the 
> [requestedSchema|https://github.com/apache/spark/blob/e6f019836c099398542b443f7700f79de81da0d5/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/SpecificParquetRecordReaderBase.java#L150]
>  which is a MessageType and has the underlying data stored correctly as 
> {{INT64}} where as the *{{OnHeapColumnVector}}* is initialized based on the 
> [batchSchema|https://github.com/apache/spark/blob/e6f019836c099398542b443f7700f79de81da0d5/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/SpecificParquetRecordReaderBase.java#L151]
>  which is coming from {{org.apache.spark.sql.parquet.row.requested_schema}} 
> that is set by the reader which is a {{StructType}} and only 

[jira] [Commented] (SPARK-34167) Reading parquet with Decimal(8,2) written as a Decimal64 blows up

2021-01-21 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17269725#comment-17269725
 ] 

Apache Spark commented on SPARK-34167:
--

User 'razajafri' has created a pull request for this issue:
https://github.com/apache/spark/pull/31284

> Reading parquet with Decimal(8,2) written as a Decimal64 blows up
> -
>
> Key: SPARK-34167
> URL: https://issues.apache.org/jira/browse/SPARK-34167
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 3.0.1
>Reporter: Raza Jafri
>Priority: Major
> Attachments: 
> part-0-7fecd321-b247-4f7e-bff5-c2e4d8facaa0-c000.snappy.parquet, 
> part-0-940f44f1-f323-4a5e-b828-1e65d87895aa-c000.snappy.parquet
>
>
> When reading a parquet file written with Decimals with precision < 10 as a 
> 64-bit representation, Spark tries to read it as an INT and fails
>  
> Steps to reproduce:
> Read the attached file that has a single Decimal(8,2) column with 10 values
> {code:java}
> scala> spark.read.parquet("/tmp/pyspark_tests/936454/PARQUET_DATA").show
> ...
> Caused by: java.lang.NullPointerException
>   at 
> org.apache.spark.sql.execution.vectorized.OnHeapColumnVector.putLong(OnHeapColumnVector.java:327)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedRleValuesReader.readLongs(VectorizedRleValuesReader.java:370)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readLongBatch(VectorizedColumnReader.java:514)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readBatch(VectorizedColumnReader.java:256)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextBatch(VectorizedParquetRecordReader.java:273)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextKeyValue(VectorizedParquetRecordReader.java:171)
>   at 
> org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:93)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:173)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:93)
>   at 
> org.apache.spark.sql.execution.FileSourceScanExec$$anon$1.hasNext(DataSourceScanExec.scala:497)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.columnartorow_nextBatch_0$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:756)
>   at 
> org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:340)
>   at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:898)
>   at 
> org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:898)
>   at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
>   at org.apache.spark.scheduler.Task.run(Task.scala:127)
>   at 
> org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:480)
>   at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1426)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:483)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
> ...
> {code}
>  
>  
> Here are my findings. The *{{VectorizedParquetRecordReader}}* reads in the 
> parquet file correctly because its basing the read on the 
> [requestedSchema|https://github.com/apache/spark/blob/e6f019836c099398542b443f7700f79de81da0d5/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/SpecificParquetRecordReaderBase.java#L150]
>  which is a MessageType and has the underlying data stored correctly as 
> {{INT64}} where as the *{{OnHeapColumnVector}}* is initialized based on the 
> [batchSchema|https://github.com/apache/spark/blob/e6f019836c099398542b443f7700f79de81da0d5/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/SpecificParquetRecordReaderBase.java#L151]
>  which is coming from 

[jira] [Assigned] (SPARK-34167) Reading parquet with Decimal(8,2) written as a Decimal64 blows up

2021-01-21 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34167?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34167:


Assignee: Apache Spark

> Reading parquet with Decimal(8,2) written as a Decimal64 blows up
> -
>
> Key: SPARK-34167
> URL: https://issues.apache.org/jira/browse/SPARK-34167
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 3.0.1
>Reporter: Raza Jafri
>Assignee: Apache Spark
>Priority: Major
> Attachments: 
> part-0-7fecd321-b247-4f7e-bff5-c2e4d8facaa0-c000.snappy.parquet, 
> part-0-940f44f1-f323-4a5e-b828-1e65d87895aa-c000.snappy.parquet
>
>
> When reading a parquet file written with Decimals with precision < 10 as a 
> 64-bit representation, Spark tries to read it as an INT and fails
>  
> Steps to reproduce:
> Read the attached file that has a single Decimal(8,2) column with 10 values
> {code:java}
> scala> spark.read.parquet("/tmp/pyspark_tests/936454/PARQUET_DATA").show
> ...
> Caused by: java.lang.NullPointerException
>   at 
> org.apache.spark.sql.execution.vectorized.OnHeapColumnVector.putLong(OnHeapColumnVector.java:327)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedRleValuesReader.readLongs(VectorizedRleValuesReader.java:370)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readLongBatch(VectorizedColumnReader.java:514)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readBatch(VectorizedColumnReader.java:256)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextBatch(VectorizedParquetRecordReader.java:273)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextKeyValue(VectorizedParquetRecordReader.java:171)
>   at 
> org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:93)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:173)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:93)
>   at 
> org.apache.spark.sql.execution.FileSourceScanExec$$anon$1.hasNext(DataSourceScanExec.scala:497)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.columnartorow_nextBatch_0$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:756)
>   at 
> org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:340)
>   at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:898)
>   at 
> org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:898)
>   at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
>   at org.apache.spark.scheduler.Task.run(Task.scala:127)
>   at 
> org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:480)
>   at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1426)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:483)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
> ...
> {code}
>  
>  
> Here are my findings. The *{{VectorizedParquetRecordReader}}* reads in the 
> parquet file correctly because its basing the read on the 
> [requestedSchema|https://github.com/apache/spark/blob/e6f019836c099398542b443f7700f79de81da0d5/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/SpecificParquetRecordReaderBase.java#L150]
>  which is a MessageType and has the underlying data stored correctly as 
> {{INT64}} where as the *{{OnHeapColumnVector}}* is initialized based on the 
> [batchSchema|https://github.com/apache/spark/blob/e6f019836c099398542b443f7700f79de81da0d5/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/SpecificParquetRecordReaderBase.java#L151]
>  which is coming from {{org.apache.spark.sql.parquet.row.requested_schema}} 
> that is set by the reader which is a 

[jira] [Comment Edited] (SPARK-7768) Make user-defined type (UDT) API public

2021-01-21 Thread Patrick Grandjean (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-7768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17269707#comment-17269707
 ] 

Patrick Grandjean edited comment on SPARK-7768 at 1/21/21, 11:44 PM:
-

Sorry, but it is frustrating to see this ticket getting postponed again. I 
started to use frameless, only to discover it is not compatible with the Spark 
fork used in Databricks :/ Anyone tried quill-spark on Databricks?

Why keep UDTRegistration private?


was (Author: pgrandjean):
Sorry, but it is frustrating to see this ticket getting postponed again. I 
started to use frameless, only to discover it is not compatible with the Spark 
fork used in Databricks :/ Anyone tried quill-spark?

Why keep UDTRegistration private?

> Make user-defined type (UDT) API public
> ---
>
> Key: SPARK-7768
> URL: https://issues.apache.org/jira/browse/SPARK-7768
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Xiangrui Meng
>Priority: Critical
>
> As the demand for UDTs increases beyond sparse/dense vectors in MLlib, it 
> would be nice to make the UDT API public in 1.5.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7768) Make user-defined type (UDT) API public

2021-01-21 Thread Patrick Grandjean (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-7768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17269707#comment-17269707
 ] 

Patrick Grandjean commented on SPARK-7768:
--

Sorry, but it is frustrating to see this ticket getting postponed again. I 
started to use frameless, only to discover it is not compatible with the Spark 
fork used in Databricks :/ Anyone tried quill-spark?

Why keep UDTRegistration private?

> Make user-defined type (UDT) API public
> ---
>
> Key: SPARK-7768
> URL: https://issues.apache.org/jira/browse/SPARK-7768
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Xiangrui Meng
>Priority: Critical
>
> As the demand for UDTs increases beyond sparse/dense vectors in MLlib, it 
> would be nice to make the UDT API public in 1.5.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-34195) Base value for parsing two-digit year should be made configurable

2021-01-21 Thread Anthony Smart (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34195?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anthony Smart updated SPARK-34195:
--
Description: 
The base value is set as 2000 within spark for parsing a two-digit year date 
string. If we try to parse "10-JAN-97" then this will be interpreted as 2097 
instead of 1997.

I'm unclear as to why this base value has been changed within spark as the 
standard python datetime module instead uses a more sensible value of 69 as the 
boundary cut-off for determining the correct century to apply.

Reference: [https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html]

Other libraries e.g. .NET Core will use 29 as the boundary cut-off. But a base 
value of 2000 is rather non-functional indeed. Most dates encountered in the 
real world will pertain to both centuries and therefore I propose this 
functionality is reverted to match the existing python datetime module and / or 
allow the base value to be set as an option to the various date functions. This 
would ensure there's consistent behaviour across both python and pyspark.

 

Python:

{{import datetime}}
 {{datetime.datetime.strptime('10-JAN-69', '%d-%b-%y').date()}}
 Out[118]: datetime.date(1969, 1, 10)
  
 {{Pyspark:}}{{import spark.sql.functions as F}}
 {{df = spark.createDataFrame([('10-JAN-69',)], ['dt'])}}
 {{df.withColumn("newdate", F.from_unixtime(F.unix_timestamp("dt", 
"dd-MMM-yy"), "dd-MM-")).collect()}}
  
 Out[117]: [Row(dt='10-JAN-69', newdate='10-01-2069')]
  
  
  
 As a work around I had to write my own solution to deal with this. The code 
below is specific to my data pipeline but you get the idea of the issue I had 
to deal with just to change the boundary cut-off to better handle two-digit 
years.
  
 {{from pyspark.sql.functions import to_date, col, trim}}{{def 
convert_dtypes(entity, schema, boundary="40"):}}
 \{{ cols = []}}
 {{ for x in schema[entity]:}}
 \{{ for c in std_df.columns:}}
 {{ if x['name'] == c:}}
 {{ if x['dtype'] == 'date':}}
 \{{ dd = F.substring(c, 1, 2)}}
 \{{ MMM = F.substring(c, 4, 3)}}
 \{{ yy = F.substring(c, 8, 2)}}
 \{{ n = (}}
 \{{ F.when(trim(col(c)) == "", None).otherwise(}}
 \{{ F.when(yy >= ("40"), }}
 {{ F.concat(dd, F.lit("-"), MMM, F.lit("-"), F.lit("19"), yy)).otherwise(}}
 {{ F.concat(dd, F.lit("-"), MMM, F.lit("-"), F.lit("20"), yy)))}}
 \{{ )}}
 \{{ cols.append(to_date(n, 'dd-MMM-').alias(c))}}
 \{{ else:}}
 {{ cols.append(col(c).cast(x['dtype']))}}
 {{ #cols[-1].nullable = x['nullable']}}
 \{{ return std_df.select(*cols)}}
  

  was:
The base value is set as 2000 within spark for parsing a two-digit year date 
string. If we try to parse "10-JAN-97" then this will be interpreted as 2097 
instead of 1997.

I'm unclear as to why this base value has been changed within spark as the 
standard python datetime module instead uses a more sensible value of 69 as the 
boundary cut-off for determining the correct century to apply.

Reference: [https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html]

Other libraries e.g. .NET Core will use 29 as the boundary cut-off. But a base 
value of 2000 is rather non-functional indeed. Most dates encountered in the 
real world will pertain to both centuries and therefore I propose this 
functionality is reverted to match the existing python datetime module and / or 
allow the base value to be set as an option to the various date functions. This 
would ensure there's consistent behaviour across both python and pyspark.

 

Python:

{{import datetime}}
 {{datetime.datetime.strptime('10-JAN-69', '%d-%b-%y').date()}}
 Out[118]: datetime.date(1969, 1, 10)
  
 {{Pyspark:}}{{import spark.sql.functions as F}}
 {{df = spark.createDataFrame([('10-JAN-69',)], ['dt'])}}
 {{df.withColumn("newdate", F.from_unixtime(F.unix_timestamp("dt", 
"dd-MMM-yy"), "dd-MM-")).collect()}}
  
 Out[117]: [Row(dt='10-JAN-70', newdate='10-01-2069')]
  
  
  
 As a work around I had to write my own solution to deal with this. The code 
below is specific to my data pipeline but you get the idea of the issue I had 
to deal with just to change the boundary cut-off to better handle two-digit 
years.
  
 {{from pyspark.sql.functions import to_date, col, trim}}{{def 
convert_dtypes(entity, schema, boundary="40"):}}
 \{{ cols = []}}
 {{ for x in schema[entity]:}}
 \{{ for c in std_df.columns:}}
 {{ if x['name'] == c:}}
 {{ if x['dtype'] == 'date':}}
 \{{ dd = F.substring(c, 1, 2)}}
 \{{ MMM = F.substring(c, 4, 3)}}
 \{{ yy = F.substring(c, 8, 2)}}
 \{{ n = (}}
 \{{ F.when(trim(col(c)) == "", None).otherwise(}}
 \{{ F.when(yy >= ("40"), }}
 {{ F.concat(dd, F.lit("-"), MMM, F.lit("-"), F.lit("19"), yy)).otherwise(}}
 {{ F.concat(dd, F.lit("-"), MMM, F.lit("-"), F.lit("20"), yy)))}}
 \{{ )}}
 \{{ cols.append(to_date(n, 'dd-MMM-').alias(c))}}
 \{{ else:}}
 {{ cols.append(col(c).cast(x['dtype']))}}
 {{ #cols[-1].nullable = x['nullable']}}
 \{{ return 

[jira] [Commented] (SPARK-34197) refreshTable() should not invalidate the relation cache for temporary views

2021-01-21 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34197?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17269617#comment-17269617
 ] 

Apache Spark commented on SPARK-34197:
--

User 'MaxGekk' has created a pull request for this issue:
https://github.com/apache/spark/pull/31265

> refreshTable() should not invalidate the relation cache for temporary views
> ---
>
> Key: SPARK-34197
> URL: https://issues.apache.org/jira/browse/SPARK-34197
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Maxim Gekk
>Priority: Major
>
> The SessionCatalog.refreshTable() should not invalidate the entry in the 
> relation cache for a table when refreshTable() refreshes a temp view.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-34197) refreshTable() should not invalidate the relation cache for temporary views

2021-01-21 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34197?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34197:


Assignee: (was: Apache Spark)

> refreshTable() should not invalidate the relation cache for temporary views
> ---
>
> Key: SPARK-34197
> URL: https://issues.apache.org/jira/browse/SPARK-34197
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Maxim Gekk
>Priority: Major
>
> The SessionCatalog.refreshTable() should not invalidate the entry in the 
> relation cache for a table when refreshTable() refreshes a temp view.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34197) refreshTable() should not invalidate the relation cache for temporary views

2021-01-21 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34197?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17269616#comment-17269616
 ] 

Apache Spark commented on SPARK-34197:
--

User 'MaxGekk' has created a pull request for this issue:
https://github.com/apache/spark/pull/31265

> refreshTable() should not invalidate the relation cache for temporary views
> ---
>
> Key: SPARK-34197
> URL: https://issues.apache.org/jira/browse/SPARK-34197
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Maxim Gekk
>Priority: Major
>
> The SessionCatalog.refreshTable() should not invalidate the entry in the 
> relation cache for a table when refreshTable() refreshes a temp view.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-34197) refreshTable() should not invalidate the relation cache for temporary views

2021-01-21 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34197?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34197:


Assignee: Apache Spark

> refreshTable() should not invalidate the relation cache for temporary views
> ---
>
> Key: SPARK-34197
> URL: https://issues.apache.org/jira/browse/SPARK-34197
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Maxim Gekk
>Assignee: Apache Spark
>Priority: Major
>
> The SessionCatalog.refreshTable() should not invalidate the entry in the 
> relation cache for a table when refreshTable() refreshes a temp view.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-34197) refreshTable() should not invalidate the relation cache for temporary views

2021-01-21 Thread Maxim Gekk (Jira)
Maxim Gekk created SPARK-34197:
--

 Summary: refreshTable() should not invalidate the relation cache 
for temporary views
 Key: SPARK-34197
 URL: https://issues.apache.org/jira/browse/SPARK-34197
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.2.0
Reporter: Maxim Gekk


The SessionCatalog.refreshTable() should not invalidate the entry in the 
relation cache for a table when refreshTable() refreshes a temp view.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-34196) Improve error message when folks try and install in Python 2

2021-01-21 Thread Holden Karau (Jira)
Holden Karau created SPARK-34196:


 Summary: Improve error message when folks try and install in 
Python 2
 Key: SPARK-34196
 URL: https://issues.apache.org/jira/browse/SPARK-34196
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Affects Versions: 3.1.0, 3.2.0, 3.1.1, 3.1.2
Reporter: Holden Karau


Current error message:

 
{code:java}
Processing ./pyspark-3.1.1.tar.gz
    ERROR: Command errored out with exit status 1:
     command: /tmp/py3.1/bin/python2 -c 'import sys, setuptools, tokenize; 
sys.argv[0] = '"'"'/tmp/pip-req-build-lmlitE/setup.py'"'"'; 
__file__='"'"'/tmp/pip-req-build-lmlitE/setup.py'"'"';f=getattr(tokenize, 
'"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', 
'"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' egg_info 
--egg-base /tmp/pip-pip-egg-info-W1BsIL
         cwd: /tmp/pip-req-build-lmlitE/
    Complete output (6 lines):
    Traceback (most recent call last):
      File "", line 1, in 
      File "/tmp/pip-req-build-lmlitE/setup.py", line 31
        file=sys.stderr)
            ^
    SyntaxError: invalid syntax
    
ERROR: Command errored out with exit status 1: python setup.py egg_info Check 
the logs for full command output.{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-34193) Potential race condition during decommissioning with TorrentBroadcast

2021-01-21 Thread Holden Karau (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34193?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Holden Karau updated SPARK-34193:
-
Issue Type: Bug  (was: Improvement)

> Potential race condition during decommissioning with TorrentBroadcast
> -
>
> Key: SPARK-34193
> URL: https://issues.apache.org/jira/browse/SPARK-34193
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.1.0, 3.2.0, 3.1.1, 3.1.2
>Reporter: Holden Karau
>Priority: Major
>
> I found this while back porting so the line numbers should be ignored, but 
> the core of the issue is that we shouldn't be failing the job on this (I 
> don't think). We could fix this by allowing broadcast blocks to be put or 
> having the torrent broadcast ignore this exception.
> [info]   org.apache.spark.SparkException: Job aborted due to stage failure: 
> Task 1 in stage 3.0 failed 4 times, most recent failure: Lost task 1.3 in 
> stage 3.0 (TID 8, 192.168.1.57, executor 1): java.io.IOException: 
> org.apache.spark.storage.BlockSavedOnDecommissionedBlockManagerException: 
> Block broadcast_2_piece0 cannot be saved on decommissioned executor[info]   
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in 
> stage 3.0 failed 4 times, most recent failure: Lost task 1.3 in stage 3.0 
> (TID 8, 192.168.1.57, executor 1): java.io.IOException: 
> org.apache.spark.storage.BlockSavedOnDecommissionedBlockManagerException: 
> Block broadcast_2_piece0 cannot be saved on decommissioned executor[info] at 
> org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1333)[info] at 
> org.apache.spark.broadcast.TorrentBroadcast.readBroadcastBlock(TorrentBroadcast.scala:215)[info]
>  at 
> org.apache.spark.broadcast.TorrentBroadcast._value$lzycompute(TorrentBroadcast.scala:66)[info]
>  at 
> org.apache.spark.broadcast.TorrentBroadcast._value(TorrentBroadcast.scala:66)[info]
>  at 
> org.apache.spark.broadcast.TorrentBroadcast.getValue(TorrentBroadcast.scala:96)[info]
>  at org.apache.spark.broadcast.Broadcast.value(Broadcast.scala:70)[info] at 
> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:84)[info] at 
> org.apache.spark.scheduler.Task.run(Task.scala:123)[info] at 
> org.apache.spark.executor.Executor$TaskRunner$$anonfun$12.apply(Executor.scala:448)[info]
>  at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)[info] 
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:454)[info] 
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)[info]
>  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)[info]
>  at java.lang.Thread.run(Thread.java:748)[info] Caused by: 
> org.apache.spark.storage.BlockSavedOnDecommissionedBlockManagerException: 
> Block broadcast_2_piece0 cannot be saved on decommissioned executor[info] at 
> org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:1105)[info] at 
> org.apache.spark.storage.BlockManager.doPutBytes(BlockManager.scala:1010)[info]
>  at 
> org.apache.spark.storage.BlockManager.putBytes(BlockManager.scala:986)[info] 
> at 
> org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1.apply$mcVI$sp(TorrentBroadcast.scala:181)[info]
>  at 
> org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1.apply(TorrentBroadcast.scala:159)[info]
>  at 
> org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1.apply(TorrentBroadcast.scala:159)[info]
>  at scala.collection.immutable.List.foreach(List.scala:392)[info] at 
> org.apache.spark.broadcast.TorrentBroadcast.org$apache$spark$broadcast$TorrentBroadcast$$readBlocks(TorrentBroadcast.scala:159)[info]
>  at 
> org.apache.spark.broadcast.TorrentBroadcast$$anonfun$readBroadcastBlock$1$$anonfun$apply$2.apply(TorrentBroadcast.scala:239)[info]
>  at scala.Option.getOrElse(Option.scala:121)[info] at 
> org.apache.spark.broadcast.TorrentBroadcast$$anonfun$readBroadcastBlock$1.apply(TorrentBroadcast.scala:219)[info]
>  at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1326)[info] ... 
> 13 more[info][info] Driver stacktrace:[info]   at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1928)[info]
>    at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1916)[info]
>    at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1915)[info]
>    at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)[info]
>    at 
> 

[jira] [Updated] (SPARK-34195) Base value for parsing two-digit year should be made configurable

2021-01-21 Thread Anthony Smart (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34195?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anthony Smart updated SPARK-34195:
--
Description: 
The base value is set as 2000 within spark for parsing a two-digit year date 
string. If we try to parse "10-JAN-97" then this will be interpreted as 2097 
instead of 1997.

I'm unclear as to why this base value has been changed within spark as the 
standard python datetime module instead uses a more sensible value of 69 as the 
boundary cut-off for determining the correct century to apply.

Reference: [https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html]

Other libraries e.g. .NET Core will use 29 as the boundary cut-off. But a base 
value of 2000 is rather non-functional indeed. Most dates encountered in the 
real world will pertain to both centuries and therefore I propose this 
functionality is reverted to match the existing python datetime module and / or 
allow the base value to be set as an option to the various date functions. This 
would ensure there's consistent behaviour across both python and pyspark.

 

Python:

{{import datetime}}
 {{datetime.datetime.strptime('10-JAN-69', '%d-%b-%y').date()}}
 Out[118]: datetime.date(1969, 1, 10)
  
 {{Pyspark:}}{{import spark.sql.functions as F}}
 {{df = spark.createDataFrame([('10-JAN-69',)], ['dt'])}}
 {{df.withColumn("newdate", F.from_unixtime(F.unix_timestamp("dt", 
"dd-MMM-yy"), "dd-MM-")).collect()}}
  
 Out[117]: [Row(dt='10-JAN-70', newdate='10-01-2069')]
  
  
  
 As a work around I had to write my own solution to deal with this. The code 
below is specific to my data pipeline but you get the idea of the issue I had 
to deal with just to change the boundary cut-off to better handle two-digit 
years.
  
 {{from pyspark.sql.functions import to_date, col, trim}}{{def 
convert_dtypes(entity, schema, boundary="40"):}}
 \{{ cols = []}}
 {{ for x in schema[entity]:}}
 \{{ for c in std_df.columns:}}
 {{ if x['name'] == c:}}
 {{ if x['dtype'] == 'date':}}
 \{{ dd = F.substring(c, 1, 2)}}
 \{{ MMM = F.substring(c, 4, 3)}}
 \{{ yy = F.substring(c, 8, 2)}}
 \{{ n = (}}
 \{{ F.when(trim(col(c)) == "", None).otherwise(}}
 \{{ F.when(yy >= ("40"), }}
 {{ F.concat(dd, F.lit("-"), MMM, F.lit("-"), F.lit("19"), yy)).otherwise(}}
 {{ F.concat(dd, F.lit("-"), MMM, F.lit("-"), F.lit("20"), yy)))}}
 \{{ )}}
 \{{ cols.append(to_date(n, 'dd-MMM-').alias(c))}}
 \{{ else:}}
 {{ cols.append(col(c).cast(x['dtype']))}}
 {{ #cols[-1].nullable = x['nullable']}}
 \{{ return std_df.select(*cols)}}
  

  was:
The base value is set as 2000 within spark for parsing a two-digit year date 
string. If we try to parse "10-JAN-97" then this will be interpreted as 2097 
instead of 1997.

I'm unclear as to why this base value has been changed within spark as the 
standard python datetime module instead uses a more sensible value of 69 as the 
boundary cut-off for determining the correct century to apply.

Reference: https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html

Other libraries e.g. .NET Core will use 29 as the boundary cut-off. But a base 
value of 2000 is rather non-functional indeed. Most dates encountered in the 
real world will pertain to both centuries and therefore I propose this 
functionality is reverted to match the existing python datetime module and / or 
allow the base value to be set as an option to the various date functions. This 
would ensure there's consistent behaviour across both python and pyspark.

 

Python:

{{import datetime}}
{{datetime.datetime.strptime('10-JAN-69', '%d-%b-%y').date()}}
Out[118]: datetime.date(1969, 1, 10)
 
{{Pyspark:}}{{import spark.sql.functions as F}}
{{df = spark.createDataFrame([('10-JAN-69',)], ['dt'])}}
{{df.withColumn("newdate", F.from_unixtime(F.unix_timestamp("dt", "dd-MMM-yy"), 
"dd-MM-yy")).collect()}}
 
Out[117]: [Row(dt='10-JAN-70', newdate='10-01-2069')]
 
 
 
As a work around I had to write my own solution to deal with this. The code 
below is specific to my data pipeline but you get the idea of the issue I had 
to deal with just to change the boundary cut-off to better handle two-digit 
years.
 
{{from pyspark.sql.functions import to_date, col, trim}}{{def 
convert_dtypes(entity, schema, boundary="40"):}}
{{ cols = []}}
{{ for x in schema[entity]:}}
{{ for c in std_df.columns:}}
{{ if x['name'] == c:}}
{{ if x['dtype'] == 'date':}}
{{ dd = F.substring(c, 1, 2)}}
{{ MMM = F.substring(c, 4, 3)}}
{{ yy = F.substring(c, 8, 2)}}
{{ n = (}}
{{ F.when(trim(col(c)) == "", None).otherwise(}}
{{ F.when(yy >= ("40"), }}
{{ F.concat(dd, F.lit("-"), MMM, F.lit("-"), F.lit("19"), yy)).otherwise(}}
{{ F.concat(dd, F.lit("-"), MMM, F.lit("-"), F.lit("20"), yy)))}}
{{ )}}
{{ cols.append(to_date(n, 'dd-MMM-').alias(c))}}
{{ else:}}
{{ cols.append(col(c).cast(x['dtype']))}}
{{ #cols[-1].nullable = x['nullable']}}
{{ return std_df.select(*cols)}}
 


> Base value for parsing two-digit 

[jira] [Created] (SPARK-34195) Base value for parsing two-digit year should be made configurable

2021-01-21 Thread Anthony Smart (Jira)
Anthony Smart created SPARK-34195:
-

 Summary: Base value for parsing two-digit year should be made 
configurable
 Key: SPARK-34195
 URL: https://issues.apache.org/jira/browse/SPARK-34195
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Affects Versions: 3.0.1
Reporter: Anthony Smart


The base value is set as 2000 within spark for parsing a two-digit year date 
string. If we try to parse "10-JAN-97" then this will be interpreted as 2097 
instead of 1997.

I'm unclear as to why this base value has been changed within spark as the 
standard python datetime module instead uses a more sensible value of 69 as the 
boundary cut-off for determining the correct century to apply.

Reference: https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html

Other libraries e.g. .NET Core will use 29 as the boundary cut-off. But a base 
value of 2000 is rather non-functional indeed. Most dates encountered in the 
real world will pertain to both centuries and therefore I propose this 
functionality is reverted to match the existing python datetime module and / or 
allow the base value to be set as an option to the various date functions. This 
would ensure there's consistent behaviour across both python and pyspark.

 

Python:

{{import datetime}}
{{datetime.datetime.strptime('10-JAN-69', '%d-%b-%y').date()}}
Out[118]: datetime.date(1969, 1, 10)
 
{{Pyspark:}}{{import spark.sql.functions as F}}
{{df = spark.createDataFrame([('10-JAN-69',)], ['dt'])}}
{{df.withColumn("newdate", F.from_unixtime(F.unix_timestamp("dt", "dd-MMM-yy"), 
"dd-MM-yy")).collect()}}
 
Out[117]: [Row(dt='10-JAN-70', newdate='10-01-2069')]
 
 
 
As a work around I had to write my own solution to deal with this. The code 
below is specific to my data pipeline but you get the idea of the issue I had 
to deal with just to change the boundary cut-off to better handle two-digit 
years.
 
{{from pyspark.sql.functions import to_date, col, trim}}{{def 
convert_dtypes(entity, schema, boundary="40"):}}
{{ cols = []}}
{{ for x in schema[entity]:}}
{{ for c in std_df.columns:}}
{{ if x['name'] == c:}}
{{ if x['dtype'] == 'date':}}
{{ dd = F.substring(c, 1, 2)}}
{{ MMM = F.substring(c, 4, 3)}}
{{ yy = F.substring(c, 8, 2)}}
{{ n = (}}
{{ F.when(trim(col(c)) == "", None).otherwise(}}
{{ F.when(yy >= ("40"), }}
{{ F.concat(dd, F.lit("-"), MMM, F.lit("-"), F.lit("19"), yy)).otherwise(}}
{{ F.concat(dd, F.lit("-"), MMM, F.lit("-"), F.lit("20"), yy)))}}
{{ )}}
{{ cols.append(to_date(n, 'dd-MMM-').alias(c))}}
{{ else:}}
{{ cols.append(col(c).cast(x['dtype']))}}
{{ #cols[-1].nullable = x['nullable']}}
{{ return std_df.select(*cols)}}
 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12890) Spark SQL query related to only partition fields should not scan the whole data.

2021-01-21 Thread Nicholas Chammas (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-12890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17269571#comment-17269571
 ] 

Nicholas Chammas commented on SPARK-12890:
--

I've created SPARK-34194 and fleshed out the description of the problem a bit.

> Spark SQL query related to only partition fields should not scan the whole 
> data.
> 
>
> Key: SPARK-12890
> URL: https://issues.apache.org/jira/browse/SPARK-12890
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Prakash Chockalingam
>Priority: Minor
>
> I have a SQL query which has only partition fields. The query ends up 
> scanning all the data which is unnecessary.
> Example: select max(date) from table, where the table is partitioned by date.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-34194) Queries that only touch partition columns shouldn't scan through all files

2021-01-21 Thread Nicholas Chammas (Jira)
Nicholas Chammas created SPARK-34194:


 Summary: Queries that only touch partition columns shouldn't scan 
through all files
 Key: SPARK-34194
 URL: https://issues.apache.org/jira/browse/SPARK-34194
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.2.0
Reporter: Nicholas Chammas


When querying only the partition columns of a partitioned table, it seems that 
Spark nonetheless scans through all files in the table, even though it doesn't 
need to.

Here's an example:
{code:python}
>>> data = spark.read.option('mergeSchema', 
>>> 'false').parquet('s3a://some/dataset')
[Stage 0:==>  (407 + 12) / 1158]
{code}
Note the 1158 tasks. This matches the number of partitions in the table, which 
is partitioned on a single field named {{file_date}}:
{code:sh}
$ aws s3 ls s3://some/dataset | head -n 3
   PRE file_date=2017-05-01/
   PRE file_date=2017-05-02/
   PRE file_date=2017-05-03/

$ aws s3 ls s3://some/dataset | wc -l
1158
{code}
The table itself has over 138K files, though:
{code:sh}
$ aws s3 ls --recursive --human --summarize s3://some/dataset
...
Total Objects: 138708
   Total Size: 3.7 TiB
{code}
Now let's try to query just the {{file_date}} field and see what Spark does.
{code:python}
>>> data.select('file_date').orderBy('file_date', 
>>> ascending=False).limit(1).explain()
== Physical Plan ==
TakeOrderedAndProject(limit=1, orderBy=[file_date#11 DESC NULLS LAST], 
output=[file_date#11])
+- *(1) ColumnarToRow
   +- FileScan parquet [file_date#11] Batched: true, DataFilters: [], Format: 
Parquet, Location: InMemoryFileIndex[s3a://some/dataset], PartitionFilters: [], 
PushedFilters: [], ReadSchema: struct<>

>>> data.select('file_date').orderBy('file_date', 
>>> ascending=False).limit(1).show()
[Stage 2:>   (179 + 12) / 41011]
{code}
Notice that Spark has spun up 41,011 tasks. Maybe more will be needed as the 
job progresses? I'm not sure.

What I do know is that this operation takes a long time (~20 min) running from 
my laptop, whereas to list the top-level {{file_date}} partitions via the AWS 
CLI take a second or two.

Spark appears to be going through all the files in the table, when it just 
needs to list the partitions captured in the S3 "directory" structure. The 
query is only touching {{file_date}}, after all.

The current workaround for this performance problem / optimizer wastefulness, 
is to [query the catalog directly|https://stackoverflow.com/a/65724151/877069]. 
It works, but is a lot of extra work compared to the elegant query against 
{{file_date}} that users actually intend.

Spark should somehow know when it is only querying partition fields and skip 
iterating through all the individual files in a table.

Tested on Spark 3.0.1.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34193) Potential race condition during decommissioning with TorrentBroadcast

2021-01-21 Thread Holden Karau (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17269565#comment-17269565
 ] 

Holden Karau commented on SPARK-34193:
--

Note, so far I've only triggered this once and in the back porting, the 
"Affects Version" is currently a guess.

> Potential race condition during decommissioning with TorrentBroadcast
> -
>
> Key: SPARK-34193
> URL: https://issues.apache.org/jira/browse/SPARK-34193
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.1.0, 3.2.0, 3.1.1, 3.1.2
>Reporter: Holden Karau
>Priority: Major
>
> I found this while back porting so the line numbers should be ignored, but 
> the core of the issue is that we shouldn't be failing the job on this (I 
> don't think). We could fix this by allowing broadcast blocks to be put or 
> having the torrent broadcast ignore this exception.
> [info]   org.apache.spark.SparkException: Job aborted due to stage failure: 
> Task 1 in stage 3.0 failed 4 times, most recent failure: Lost task 1.3 in 
> stage 3.0 (TID 8, 192.168.1.57, executor 1): java.io.IOException: 
> org.apache.spark.storage.BlockSavedOnDecommissionedBlockManagerException: 
> Block broadcast_2_piece0 cannot be saved on decommissioned executor[info]   
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in 
> stage 3.0 failed 4 times, most recent failure: Lost task 1.3 in stage 3.0 
> (TID 8, 192.168.1.57, executor 1): java.io.IOException: 
> org.apache.spark.storage.BlockSavedOnDecommissionedBlockManagerException: 
> Block broadcast_2_piece0 cannot be saved on decommissioned executor[info] at 
> org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1333)[info] at 
> org.apache.spark.broadcast.TorrentBroadcast.readBroadcastBlock(TorrentBroadcast.scala:215)[info]
>  at 
> org.apache.spark.broadcast.TorrentBroadcast._value$lzycompute(TorrentBroadcast.scala:66)[info]
>  at 
> org.apache.spark.broadcast.TorrentBroadcast._value(TorrentBroadcast.scala:66)[info]
>  at 
> org.apache.spark.broadcast.TorrentBroadcast.getValue(TorrentBroadcast.scala:96)[info]
>  at org.apache.spark.broadcast.Broadcast.value(Broadcast.scala:70)[info] at 
> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:84)[info] at 
> org.apache.spark.scheduler.Task.run(Task.scala:123)[info] at 
> org.apache.spark.executor.Executor$TaskRunner$$anonfun$12.apply(Executor.scala:448)[info]
>  at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)[info] 
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:454)[info] 
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)[info]
>  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)[info]
>  at java.lang.Thread.run(Thread.java:748)[info] Caused by: 
> org.apache.spark.storage.BlockSavedOnDecommissionedBlockManagerException: 
> Block broadcast_2_piece0 cannot be saved on decommissioned executor[info] at 
> org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:1105)[info] at 
> org.apache.spark.storage.BlockManager.doPutBytes(BlockManager.scala:1010)[info]
>  at 
> org.apache.spark.storage.BlockManager.putBytes(BlockManager.scala:986)[info] 
> at 
> org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1.apply$mcVI$sp(TorrentBroadcast.scala:181)[info]
>  at 
> org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1.apply(TorrentBroadcast.scala:159)[info]
>  at 
> org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1.apply(TorrentBroadcast.scala:159)[info]
>  at scala.collection.immutable.List.foreach(List.scala:392)[info] at 
> org.apache.spark.broadcast.TorrentBroadcast.org$apache$spark$broadcast$TorrentBroadcast$$readBlocks(TorrentBroadcast.scala:159)[info]
>  at 
> org.apache.spark.broadcast.TorrentBroadcast$$anonfun$readBroadcastBlock$1$$anonfun$apply$2.apply(TorrentBroadcast.scala:239)[info]
>  at scala.Option.getOrElse(Option.scala:121)[info] at 
> org.apache.spark.broadcast.TorrentBroadcast$$anonfun$readBroadcastBlock$1.apply(TorrentBroadcast.scala:219)[info]
>  at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1326)[info] ... 
> 13 more[info][info] Driver stacktrace:[info]   at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1928)[info]
>    at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1916)[info]
>    at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1915)[info]
>    at 
> 

[jira] [Created] (SPARK-34193) Potential race condition during decommissioning with TorrentBroadcast

2021-01-21 Thread Holden Karau (Jira)
Holden Karau created SPARK-34193:


 Summary: Potential race condition during decommissioning with 
TorrentBroadcast
 Key: SPARK-34193
 URL: https://issues.apache.org/jira/browse/SPARK-34193
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 3.1.0, 3.2.0, 3.1.1, 3.1.2
Reporter: Holden Karau


I found this while back porting so the line numbers should be ignored, but the 
core of the issue is that we shouldn't be failing the job on this (I don't 
think). We could fix this by allowing broadcast blocks to be put or having the 
torrent broadcast ignore this exception.

[info]   org.apache.spark.SparkException: Job aborted due to stage failure: 
Task 1 in stage 3.0 failed 4 times, most recent failure: Lost task 1.3 in stage 
3.0 (TID 8, 192.168.1.57, executor 1): java.io.IOException: 
org.apache.spark.storage.BlockSavedOnDecommissionedBlockManagerException: Block 
broadcast_2_piece0 cannot be saved on decommissioned executor[info]   
org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in 
stage 3.0 failed 4 times, most recent failure: Lost task 1.3 in stage 3.0 (TID 
8, 192.168.1.57, executor 1): java.io.IOException: 
org.apache.spark.storage.BlockSavedOnDecommissionedBlockManagerException: Block 
broadcast_2_piece0 cannot be saved on decommissioned executor[info] at 
org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1333)[info] at 
org.apache.spark.broadcast.TorrentBroadcast.readBroadcastBlock(TorrentBroadcast.scala:215)[info]
 at 
org.apache.spark.broadcast.TorrentBroadcast._value$lzycompute(TorrentBroadcast.scala:66)[info]
 at 
org.apache.spark.broadcast.TorrentBroadcast._value(TorrentBroadcast.scala:66)[info]
 at 
org.apache.spark.broadcast.TorrentBroadcast.getValue(TorrentBroadcast.scala:96)[info]
 at org.apache.spark.broadcast.Broadcast.value(Broadcast.scala:70)[info] at 
org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:84)[info] at 
org.apache.spark.scheduler.Task.run(Task.scala:123)[info] at 
org.apache.spark.executor.Executor$TaskRunner$$anonfun$12.apply(Executor.scala:448)[info]
 at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)[info] at 
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:454)[info] at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)[info]
 at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)[info]
 at java.lang.Thread.run(Thread.java:748)[info] Caused by: 
org.apache.spark.storage.BlockSavedOnDecommissionedBlockManagerException: Block 
broadcast_2_piece0 cannot be saved on decommissioned executor[info] at 
org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:1105)[info] at 
org.apache.spark.storage.BlockManager.doPutBytes(BlockManager.scala:1010)[info] 
at org.apache.spark.storage.BlockManager.putBytes(BlockManager.scala:986)[info] 
at 
org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1.apply$mcVI$sp(TorrentBroadcast.scala:181)[info]
 at 
org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1.apply(TorrentBroadcast.scala:159)[info]
 at 
org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1.apply(TorrentBroadcast.scala:159)[info]
 at scala.collection.immutable.List.foreach(List.scala:392)[info] at 
org.apache.spark.broadcast.TorrentBroadcast.org$apache$spark$broadcast$TorrentBroadcast$$readBlocks(TorrentBroadcast.scala:159)[info]
 at 
org.apache.spark.broadcast.TorrentBroadcast$$anonfun$readBroadcastBlock$1$$anonfun$apply$2.apply(TorrentBroadcast.scala:239)[info]
 at scala.Option.getOrElse(Option.scala:121)[info] at 
org.apache.spark.broadcast.TorrentBroadcast$$anonfun$readBroadcastBlock$1.apply(TorrentBroadcast.scala:219)[info]
 at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1326)[info] ... 
13 more[info][info] Driver stacktrace:[info]   at 
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1928)[info]
   at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1916)[info]
   at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1915)[info]
   at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)[info]
   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)[info]  
 at 
org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1915)[info]
   at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:951)[info]
   at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:951)[info]
   at 

[jira] [Commented] (SPARK-33813) JDBC datasource fails when reading spatial datatypes with the MS SQL driver

2021-01-21 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33813?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17269404#comment-17269404
 ] 

Apache Spark commented on SPARK-33813:
--

User 'sarutak' has created a pull request for this issue:
https://github.com/apache/spark/pull/31283

> JDBC datasource fails when reading spatial datatypes with the MS SQL driver
> ---
>
> Key: SPARK-33813
> URL: https://issues.apache.org/jira/browse/SPARK-33813
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0, 3.1.0
>Reporter: Michał Świtakowski
>Priority: Major
>
> The MS SQL JDBC driver introduced support for spatial types since version 
> 7.0. The JDBC data source lacks mappings for these types which results in an 
> exception below. It seems that a mapping in 
> MsSqlServerDialect.getCatalystType that maps -157 and -158 typecode to 
> VARBINARY should address the issue.
>  
> {noformat}
> java.sql.SQLException: Unrecognized SQL type -157
>  at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.getCatalystType(JdbcUtils.scala:251)
>  at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.$anonfun$getSchema$1(JdbcUtils.scala:321)
>  at scala.Option.getOrElse(Option.scala:189)
>  at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.getSchema(JdbcUtils.scala:321)
>  at 
> org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$.resolveTable(JDBCRDD.scala:63)
>  at 
> org.apache.spark.sql.execution.datasources.jdbc.JDBCRelation$.getSchema(JDBCRelation.scala:226)
>  at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:35)
>  at 
> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:364)
>  at 
> org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:366)
>  at 
> org.apache.spark.sql.DataFrameReader.$anonfun$load$2(DataFrameReader.scala:355)
>  at scala.Option.getOrElse(Option.scala:189)
>  at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:355)
>  at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:240)
>  at 
> org.apache.spark.sql.DataFrameReader.jdbc(DataFrameReader.scala:381){noformat}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33813) JDBC datasource fails when reading spatial datatypes with the MS SQL driver

2021-01-21 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33813:


Assignee: (was: Apache Spark)

> JDBC datasource fails when reading spatial datatypes with the MS SQL driver
> ---
>
> Key: SPARK-33813
> URL: https://issues.apache.org/jira/browse/SPARK-33813
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0, 3.1.0
>Reporter: Michał Świtakowski
>Priority: Major
>
> The MS SQL JDBC driver introduced support for spatial types since version 
> 7.0. The JDBC data source lacks mappings for these types which results in an 
> exception below. It seems that a mapping in 
> MsSqlServerDialect.getCatalystType that maps -157 and -158 typecode to 
> VARBINARY should address the issue.
>  
> {noformat}
> java.sql.SQLException: Unrecognized SQL type -157
>  at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.getCatalystType(JdbcUtils.scala:251)
>  at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.$anonfun$getSchema$1(JdbcUtils.scala:321)
>  at scala.Option.getOrElse(Option.scala:189)
>  at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.getSchema(JdbcUtils.scala:321)
>  at 
> org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$.resolveTable(JDBCRDD.scala:63)
>  at 
> org.apache.spark.sql.execution.datasources.jdbc.JDBCRelation$.getSchema(JDBCRelation.scala:226)
>  at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:35)
>  at 
> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:364)
>  at 
> org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:366)
>  at 
> org.apache.spark.sql.DataFrameReader.$anonfun$load$2(DataFrameReader.scala:355)
>  at scala.Option.getOrElse(Option.scala:189)
>  at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:355)
>  at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:240)
>  at 
> org.apache.spark.sql.DataFrameReader.jdbc(DataFrameReader.scala:381){noformat}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33813) JDBC datasource fails when reading spatial datatypes with the MS SQL driver

2021-01-21 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33813:


Assignee: Apache Spark

> JDBC datasource fails when reading spatial datatypes with the MS SQL driver
> ---
>
> Key: SPARK-33813
> URL: https://issues.apache.org/jira/browse/SPARK-33813
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0, 3.1.0
>Reporter: Michał Świtakowski
>Assignee: Apache Spark
>Priority: Major
>
> The MS SQL JDBC driver introduced support for spatial types since version 
> 7.0. The JDBC data source lacks mappings for these types which results in an 
> exception below. It seems that a mapping in 
> MsSqlServerDialect.getCatalystType that maps -157 and -158 typecode to 
> VARBINARY should address the issue.
>  
> {noformat}
> java.sql.SQLException: Unrecognized SQL type -157
>  at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.getCatalystType(JdbcUtils.scala:251)
>  at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.$anonfun$getSchema$1(JdbcUtils.scala:321)
>  at scala.Option.getOrElse(Option.scala:189)
>  at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.getSchema(JdbcUtils.scala:321)
>  at 
> org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$.resolveTable(JDBCRDD.scala:63)
>  at 
> org.apache.spark.sql.execution.datasources.jdbc.JDBCRelation$.getSchema(JDBCRelation.scala:226)
>  at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:35)
>  at 
> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:364)
>  at 
> org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:366)
>  at 
> org.apache.spark.sql.DataFrameReader.$anonfun$load$2(DataFrameReader.scala:355)
>  at scala.Option.getOrElse(Option.scala:189)
>  at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:355)
>  at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:240)
>  at 
> org.apache.spark.sql.DataFrameReader.jdbc(DataFrameReader.scala:381){noformat}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33813) JDBC datasource fails when reading spatial datatypes with the MS SQL driver

2021-01-21 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33813?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17269403#comment-17269403
 ] 

Apache Spark commented on SPARK-33813:
--

User 'sarutak' has created a pull request for this issue:
https://github.com/apache/spark/pull/31283

> JDBC datasource fails when reading spatial datatypes with the MS SQL driver
> ---
>
> Key: SPARK-33813
> URL: https://issues.apache.org/jira/browse/SPARK-33813
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0, 3.1.0
>Reporter: Michał Świtakowski
>Priority: Major
>
> The MS SQL JDBC driver introduced support for spatial types since version 
> 7.0. The JDBC data source lacks mappings for these types which results in an 
> exception below. It seems that a mapping in 
> MsSqlServerDialect.getCatalystType that maps -157 and -158 typecode to 
> VARBINARY should address the issue.
>  
> {noformat}
> java.sql.SQLException: Unrecognized SQL type -157
>  at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.getCatalystType(JdbcUtils.scala:251)
>  at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.$anonfun$getSchema$1(JdbcUtils.scala:321)
>  at scala.Option.getOrElse(Option.scala:189)
>  at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.getSchema(JdbcUtils.scala:321)
>  at 
> org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$.resolveTable(JDBCRDD.scala:63)
>  at 
> org.apache.spark.sql.execution.datasources.jdbc.JDBCRelation$.getSchema(JDBCRelation.scala:226)
>  at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:35)
>  at 
> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:364)
>  at 
> org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:366)
>  at 
> org.apache.spark.sql.DataFrameReader.$anonfun$load$2(DataFrameReader.scala:355)
>  at scala.Option.getOrElse(Option.scala:189)
>  at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:355)
>  at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:240)
>  at 
> org.apache.spark.sql.DataFrameReader.jdbc(DataFrameReader.scala:381){noformat}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34191) udf type hint should allow dectorator with named returnType

2021-01-21 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17269364#comment-17269364
 ] 

Apache Spark commented on SPARK-34191:
--

User 'pgrz' has created a pull request for this issue:
https://github.com/apache/spark/pull/31282

> udf type hint should allow dectorator with named returnType
> ---
>
> Key: SPARK-34191
> URL: https://issues.apache.org/jira/browse/SPARK-34191
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.1.0, 3.2.0, 3.1.1
>Reporter: Maciej Szymkiewicz
>Priority: Major
>
> At the moment annotations allow following decorated patterns:
>  
> {code:python}
> @udf
> def f(x): ...
> @udf("string")  # Or DataType instance
> def f(x): ...
> @udf(f="string")  # Awkward but technically valid
> def f(x): ...
> {code}
> We should also support 
> {code:python}
> @udf(returnType="string")
> def f(x): ...
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-34191) udf type hint should allow dectorator with named returnType

2021-01-21 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34191?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34191:


Assignee: Apache Spark

> udf type hint should allow dectorator with named returnType
> ---
>
> Key: SPARK-34191
> URL: https://issues.apache.org/jira/browse/SPARK-34191
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.1.0, 3.2.0, 3.1.1
>Reporter: Maciej Szymkiewicz
>Assignee: Apache Spark
>Priority: Major
>
> At the moment annotations allow following decorated patterns:
>  
> {code:python}
> @udf
> def f(x): ...
> @udf("string")  # Or DataType instance
> def f(x): ...
> @udf(f="string")  # Awkward but technically valid
> def f(x): ...
> {code}
> We should also support 
> {code:python}
> @udf(returnType="string")
> def f(x): ...
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-34191) udf type hint should allow dectorator with named returnType

2021-01-21 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34191?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34191:


Assignee: (was: Apache Spark)

> udf type hint should allow dectorator with named returnType
> ---
>
> Key: SPARK-34191
> URL: https://issues.apache.org/jira/browse/SPARK-34191
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.1.0, 3.2.0, 3.1.1
>Reporter: Maciej Szymkiewicz
>Priority: Major
>
> At the moment annotations allow following decorated patterns:
>  
> {code:python}
> @udf
> def f(x): ...
> @udf("string")  # Or DataType instance
> def f(x): ...
> @udf(f="string")  # Awkward but technically valid
> def f(x): ...
> {code}
> We should also support 
> {code:python}
> @udf(returnType="string")
> def f(x): ...
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-34192) Move char padding to write side

2021-01-21 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34192?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34192:


Assignee: (was: Apache Spark)

> Move char padding to write side
> ---
>
> Key: SPARK-34192
> URL: https://issues.apache.org/jira/browse/SPARK-34192
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Kent Yao
>Priority: Major
>
> On the read side, the char length check and padding bring issues to CBO and 
> PPD and other issues to the catalyst.
> It's more reasonable to do it on the write side,  as Spark doesn't take fully 
> control of the storage layer.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34192) Move char padding to write side

2021-01-21 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34192?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17269335#comment-17269335
 ] 

Apache Spark commented on SPARK-34192:
--

User 'yaooqinn' has created a pull request for this issue:
https://github.com/apache/spark/pull/31281

> Move char padding to write side
> ---
>
> Key: SPARK-34192
> URL: https://issues.apache.org/jira/browse/SPARK-34192
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Kent Yao
>Priority: Major
>
> On the read side, the char length check and padding bring issues to CBO and 
> PPD and other issues to the catalyst.
> It's more reasonable to do it on the write side,  as Spark doesn't take fully 
> control of the storage layer.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-34192) Move char padding to write side

2021-01-21 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34192?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34192:


Assignee: Apache Spark

> Move char padding to write side
> ---
>
> Key: SPARK-34192
> URL: https://issues.apache.org/jira/browse/SPARK-34192
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Kent Yao
>Assignee: Apache Spark
>Priority: Major
>
> On the read side, the char length check and padding bring issues to CBO and 
> PPD and other issues to the catalyst.
> It's more reasonable to do it on the write side,  as Spark doesn't take fully 
> control of the storage layer.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-34192) Move char padding to write side

2021-01-21 Thread Kent Yao (Jira)
Kent Yao created SPARK-34192:


 Summary: Move char padding to write side
 Key: SPARK-34192
 URL: https://issues.apache.org/jira/browse/SPARK-34192
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.1.0
Reporter: Kent Yao


On the read side, the char length check and padding bring issues to CBO and PPD 
and other issues to the catalyst.

It's more reasonable to do it on the write side,  as Spark doesn't take fully 
control of the storage layer.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-34094) Extends StringTranslate to support unicode characters whose code point >= U+10000

2021-01-21 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-34094.
--
Fix Version/s: 3.2.0
   Resolution: Fixed

Issue resolved by pull request 31164
[https://github.com/apache/spark/pull/31164]

> Extends StringTranslate to support unicode characters whose code point >= 
> U+1
> -
>
> Key: SPARK-34094
> URL: https://issues.apache.org/jira/browse/SPARK-34094
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Major
> Fix For: 3.2.0
>
>
> Currently, StringTranslate works with only unicode characters whose code 
> point < U+1 so let's extend it to support code points >= U+1.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-34191) udf type hint should allow dectorator with named returnType

2021-01-21 Thread Maciej Szymkiewicz (Jira)
Maciej Szymkiewicz created SPARK-34191:
--

 Summary: udf type hint should allow dectorator with named 
returnType
 Key: SPARK-34191
 URL: https://issues.apache.org/jira/browse/SPARK-34191
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Affects Versions: 3.1.0, 3.2.0, 3.1.1
Reporter: Maciej Szymkiewicz


At the moment annotations allow following decorated patterns:
 
{code:python}
@udf
def f(x): ...

@udf("string")  # Or DataType instance
def f(x): ...

@udf(f="string")  # Awkward but technically valid
def f(x): ...
{code}

We should also support 

{code:python}
@udf(returnType="string")
def f(x): ...
{code}




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-34190) Supplement the description for Python Package Management

2021-01-21 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34190?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-34190:
-
Parent: SPARK-31851
Issue Type: Sub-task  (was: Documentation)

> Supplement the description for Python Package Management
> 
>
> Key: SPARK-34190
> URL: https://issues.apache.org/jira/browse/SPARK-34190
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation
>Affects Versions: 3.1.0
>Reporter: Haejoon Lee
>Assignee: Haejoon Lee
>Priority: Major
> Fix For: 3.1.2
>
>
> There is lack of explanation for "Using Virtualenv" chapter.
> It says "It packs the current virtual environment to an archive file, and It 
> self-contains both Python interpreter and the dependencies", but it's not 
> working if there is no Python installed on the node.
> Because the Python in the packed environment has a symbolic link that 
> connects Python to the local one, so Python must exist in the same path on 
> all nodes.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-34190) Supplement the description for Python Package Management

2021-01-21 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34190?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-34190:


Assignee: Haejoon Lee

> Supplement the description for Python Package Management
> 
>
> Key: SPARK-34190
> URL: https://issues.apache.org/jira/browse/SPARK-34190
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 3.1.0
>Reporter: Haejoon Lee
>Assignee: Haejoon Lee
>Priority: Major
>
> There is lack of explanation for "Using Virtualenv" chapter.
> It says "It packs the current virtual environment to an archive file, and It 
> self-contains both Python interpreter and the dependencies", but it's not 
> working if there is no Python installed on the node.
> Because the Python in the packed environment has a symbolic link that 
> connects Python to the local one, so Python must exist in the same path on 
> all nodes.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-34190) Supplement the description for Python Package Management

2021-01-21 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34190?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-34190.
--
Fix Version/s: 3.1.2
   Resolution: Fixed

Issue resolved by pull request 31280
[https://github.com/apache/spark/pull/31280]

> Supplement the description for Python Package Management
> 
>
> Key: SPARK-34190
> URL: https://issues.apache.org/jira/browse/SPARK-34190
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 3.1.0
>Reporter: Haejoon Lee
>Assignee: Haejoon Lee
>Priority: Major
> Fix For: 3.1.2
>
>
> There is lack of explanation for "Using Virtualenv" chapter.
> It says "It packs the current virtual environment to an archive file, and It 
> self-contains both Python interpreter and the dependencies", but it's not 
> working if there is no Python installed on the node.
> Because the Python in the packed environment has a symbolic link that 
> connects Python to the local one, so Python must exist in the same path on 
> all nodes.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-34190) Supplement the description for Python Package Management

2021-01-21 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34190?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-34190:
-
Affects Version/s: (was: 3.0.1)
   3.1.0

> Supplement the description for Python Package Management
> 
>
> Key: SPARK-34190
> URL: https://issues.apache.org/jira/browse/SPARK-34190
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 3.1.0
>Reporter: Haejoon Lee
>Priority: Major
>
> There is lack of explanation for "Using Virtualenv" chapter.
> It says "It packs the current virtual environment to an archive file, and It 
> self-contains both Python interpreter and the dependencies", but it's not 
> working if there is no Python installed on the node.
> Because the Python in the packed environment has a symbolic link that 
> connects Python to the local one, so Python must exist in the same path on 
> all nodes.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-34138) Keep dependants cached while refreshing v1 tables

2021-01-21 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34138?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-34138:
---

Assignee: Maxim Gekk

> Keep dependants cached while refreshing v1 tables
> -
>
> Key: SPARK-34138
> URL: https://issues.apache.org/jira/browse/SPARK-34138
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Major
>
> Keeping dependants cached while refreshing v1 tables should allow to improve 
> user experience with table/view caching. For example, let's imagine that an 
> user has cached v1 table and cached view based on the table. And the user 
> passed the table to external library which drops/renames/adds partitions in 
> the v1 table. Unfortunately, the user gets the view uncached after that even 
> he/she hasn't uncached the view explicitly.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-34138) Keep dependants cached while refreshing v1 tables

2021-01-21 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34138?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-34138.
-
Fix Version/s: 3.2.0
   Resolution: Fixed

Issue resolved by pull request 31206
[https://github.com/apache/spark/pull/31206]

> Keep dependants cached while refreshing v1 tables
> -
>
> Key: SPARK-34138
> URL: https://issues.apache.org/jira/browse/SPARK-34138
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Major
> Fix For: 3.2.0
>
>
> Keeping dependants cached while refreshing v1 tables should allow to improve 
> user experience with table/view caching. For example, let's imagine that an 
> user has cached v1 table and cached view based on the table. And the user 
> passed the table to external library which drops/renames/adds partitions in 
> the v1 table. Unfortunately, the user gets the view uncached after that even 
> he/she hasn't uncached the view explicitly.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33813) JDBC datasource fails when reading spatial datatypes with the MS SQL driver

2021-01-21 Thread Kousuke Saruta (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33813?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17269271#comment-17269271
 ] 

Kousuke Saruta commented on SPARK-33813:


[~cloud_fan] O.K, I'll try it.

> JDBC datasource fails when reading spatial datatypes with the MS SQL driver
> ---
>
> Key: SPARK-33813
> URL: https://issues.apache.org/jira/browse/SPARK-33813
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0, 3.1.0
>Reporter: Michał Świtakowski
>Priority: Major
>
> The MS SQL JDBC driver introduced support for spatial types since version 
> 7.0. The JDBC data source lacks mappings for these types which results in an 
> exception below. It seems that a mapping in 
> MsSqlServerDialect.getCatalystType that maps -157 and -158 typecode to 
> VARBINARY should address the issue.
>  
> {noformat}
> java.sql.SQLException: Unrecognized SQL type -157
>  at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.getCatalystType(JdbcUtils.scala:251)
>  at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.$anonfun$getSchema$1(JdbcUtils.scala:321)
>  at scala.Option.getOrElse(Option.scala:189)
>  at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.getSchema(JdbcUtils.scala:321)
>  at 
> org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$.resolveTable(JDBCRDD.scala:63)
>  at 
> org.apache.spark.sql.execution.datasources.jdbc.JDBCRelation$.getSchema(JDBCRelation.scala:226)
>  at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:35)
>  at 
> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:364)
>  at 
> org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:366)
>  at 
> org.apache.spark.sql.DataFrameReader.$anonfun$load$2(DataFrameReader.scala:355)
>  at scala.Option.getOrElse(Option.scala:189)
>  at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:355)
>  at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:240)
>  at 
> org.apache.spark.sql.DataFrameReader.jdbc(DataFrameReader.scala:381){noformat}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34190) Supplement the description for Python Package Management

2021-01-21 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17269190#comment-17269190
 ] 

Apache Spark commented on SPARK-34190:
--

User 'itholic' has created a pull request for this issue:
https://github.com/apache/spark/pull/31280

> Supplement the description for Python Package Management
> 
>
> Key: SPARK-34190
> URL: https://issues.apache.org/jira/browse/SPARK-34190
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 3.0.1
>Reporter: Haejoon Lee
>Priority: Major
>
> There is lack of explanation for "Using Virtualenv" chapter.
> It says "It packs the current virtual environment to an archive file, and It 
> self-contains both Python interpreter and the dependencies", but it's not 
> working if there is no Python installed on the node.
> Because the Python in the packed environment has a symbolic link that 
> connects Python to the local one, so Python must exist in the same path on 
> all nodes.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-34190) Supplement the description for Python Package Management

2021-01-21 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34190?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34190:


Assignee: Apache Spark

> Supplement the description for Python Package Management
> 
>
> Key: SPARK-34190
> URL: https://issues.apache.org/jira/browse/SPARK-34190
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 3.0.1
>Reporter: Haejoon Lee
>Assignee: Apache Spark
>Priority: Major
>
> There is lack of explanation for "Using Virtualenv" chapter.
> It says "It packs the current virtual environment to an archive file, and It 
> self-contains both Python interpreter and the dependencies", but it's not 
> working if there is no Python installed on the node.
> Because the Python in the packed environment has a symbolic link that 
> connects Python to the local one, so Python must exist in the same path on 
> all nodes.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-34190) Supplement the description for Python Package Management

2021-01-21 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34190?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34190:


Assignee: (was: Apache Spark)

> Supplement the description for Python Package Management
> 
>
> Key: SPARK-34190
> URL: https://issues.apache.org/jira/browse/SPARK-34190
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 3.0.1
>Reporter: Haejoon Lee
>Priority: Major
>
> There is lack of explanation for "Using Virtualenv" chapter.
> It says "It packs the current virtual environment to an archive file, and It 
> self-contains both Python interpreter and the dependencies", but it's not 
> working if there is no Python installed on the node.
> Because the Python in the packed environment has a symbolic link that 
> connects Python to the local one, so Python must exist in the same path on 
> all nodes.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-34190) Supplement the description for Python Package Management

2021-01-21 Thread Haejoon Lee (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34190?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Haejoon Lee updated SPARK-34190:

Summary: Supplement the description for Python Package Management  (was: 
Supplement the description in the document)

> Supplement the description for Python Package Management
> 
>
> Key: SPARK-34190
> URL: https://issues.apache.org/jira/browse/SPARK-34190
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 3.0.1
>Reporter: Haejoon Lee
>Priority: Major
>
> There is lack of explanation for "Using Virtualenv" chapter.
> It says "It packs the current virtual environment to an archive file, and It 
> self-contains both Python interpreter and the dependencies", but it's not 
> working if there is no Python installed on the node.
> Because the Python in the packed environment has a symbolic link that 
> connects Python to the local one, so Python must exist in the same path on 
> all nodes.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-34190) Supplement the description in the document

2021-01-21 Thread Haejoon Lee (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34190?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Haejoon Lee updated SPARK-34190:

Description: 
There is lack of explanation for "Using Virtualenv" chapter.

It says "It packs the current virtual environment to an archive file, and It 
self-contains both Python interpreter and the dependencies", but it's not 
working if there is no Python installed on the node.

Because the Python in the packed environment has a symbolic link that connects 
Python to the local one, so Python must exist in the same path on all nodes.

 

  was:
There is lack of explanation for "Using Virtualenv" chapter.

It says "It packs the current virtual environment to an archive file, and It 
self-contains both Python interpreter and the dependencies", but it's not 
working if there is no Python installed for the all nodes in cluster.

Because the Python in the packed environment has a symbolic link that connects 
Python to the local one, so Python must exist in the same path on all nodes.

 


> Supplement the description in the document
> --
>
> Key: SPARK-34190
> URL: https://issues.apache.org/jira/browse/SPARK-34190
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 3.0.1
>Reporter: Haejoon Lee
>Priority: Major
>
> There is lack of explanation for "Using Virtualenv" chapter.
> It says "It packs the current virtual environment to an archive file, and It 
> self-contains both Python interpreter and the dependencies", but it's not 
> working if there is no Python installed on the node.
> Because the Python in the packed environment has a symbolic link that 
> connects Python to the local one, so Python must exist in the same path on 
> all nodes.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-34190) Supplement the description in the document

2021-01-21 Thread Haejoon Lee (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34190?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Haejoon Lee updated SPARK-34190:

Description: 
There is lack of explanation for "Using Virtualenv" chapter.

It says "It packs the current virtual environment to an archive file, and It 
self-contains both Python interpreter and the dependencies", but it's not work 
if there is no Python installed for the all nodes in cluster.

Because the Python in the packed environment has a symbolic link that connects 
Python to the local one, so Python must exist in the same path on all nodes.

 

  was:
There is lack of explanation for "Using Virtualenv" chapter.

It says "It packs the current virtual environment to an archive file, and It 
self-contains both Python interpreter and the dependencies", but it's not work 
if there is no Python installed for the all nodes in cluster.

The python in the packed environment has a symbolic link that connects Python 
to the local one, so Python must exist in the same path on all nodes.

 


> Supplement the description in the document
> --
>
> Key: SPARK-34190
> URL: https://issues.apache.org/jira/browse/SPARK-34190
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 3.0.1
>Reporter: Haejoon Lee
>Priority: Major
>
> There is lack of explanation for "Using Virtualenv" chapter.
> It says "It packs the current virtual environment to an archive file, and It 
> self-contains both Python interpreter and the dependencies", but it's not 
> work if there is no Python installed for the all nodes in cluster.
> Because the Python in the packed environment has a symbolic link that 
> connects Python to the local one, so Python must exist in the same path on 
> all nodes.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-34190) Supplement the description in the document

2021-01-21 Thread Haejoon Lee (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34190?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Haejoon Lee updated SPARK-34190:

Description: 
There is lack of explanation for "Using Virtualenv" chapter.

It says "It packs the current virtual environment to an archive file, and It 
self-contains both Python interpreter and the dependencies", but it's not work 
if there is no Python installed for the all nodes in cluster.

The python in the packed environment has a symbolic link that connects Python 
to the local one, so Python must exist in the same path on all nodes.

 

  was:
There is inconsistent explanation for "Using Virtualenv" chapter.

It says "It packs the current virtual environment to an archive file, and It 
self-contains both Python interpreter and the dependencies", but it's not true.

The python in the packed environment has a symbolic link that connects Python 
to the local one, so Python must exist in the same path on all nodes.

 


> Supplement the description in the document
> --
>
> Key: SPARK-34190
> URL: https://issues.apache.org/jira/browse/SPARK-34190
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 3.0.1
>Reporter: Haejoon Lee
>Priority: Major
>
> There is lack of explanation for "Using Virtualenv" chapter.
> It says "It packs the current virtual environment to an archive file, and It 
> self-contains both Python interpreter and the dependencies", but it's not 
> work if there is no Python installed for the all nodes in cluster.
> The python in the packed environment has a symbolic link that connects Python 
> to the local one, so Python must exist in the same path on all nodes.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-34190) Supplement the description in the document

2021-01-21 Thread Haejoon Lee (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34190?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Haejoon Lee updated SPARK-34190:

Description: 
There is lack of explanation for "Using Virtualenv" chapter.

It says "It packs the current virtual environment to an archive file, and It 
self-contains both Python interpreter and the dependencies", but it's not 
working if there is no Python installed for the all nodes in cluster.

Because the Python in the packed environment has a symbolic link that connects 
Python to the local one, so Python must exist in the same path on all nodes.

 

  was:
There is lack of explanation for "Using Virtualenv" chapter.

It says "It packs the current virtual environment to an archive file, and It 
self-contains both Python interpreter and the dependencies", but it's not work 
if there is no Python installed for the all nodes in cluster.

Because the Python in the packed environment has a symbolic link that connects 
Python to the local one, so Python must exist in the same path on all nodes.

 


> Supplement the description in the document
> --
>
> Key: SPARK-34190
> URL: https://issues.apache.org/jira/browse/SPARK-34190
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 3.0.1
>Reporter: Haejoon Lee
>Priority: Major
>
> There is lack of explanation for "Using Virtualenv" chapter.
> It says "It packs the current virtual environment to an archive file, and It 
> self-contains both Python interpreter and the dependencies", but it's not 
> working if there is no Python installed for the all nodes in cluster.
> Because the Python in the packed environment has a symbolic link that 
> connects Python to the local one, so Python must exist in the same path on 
> all nodes.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-34190) Supplement the description in the document

2021-01-21 Thread Haejoon Lee (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34190?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Haejoon Lee updated SPARK-34190:

Summary: Supplement the description in the document  (was: Fix inconsistent 
docs in Python Package Management)

> Supplement the description in the document
> --
>
> Key: SPARK-34190
> URL: https://issues.apache.org/jira/browse/SPARK-34190
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 3.0.1
>Reporter: Haejoon Lee
>Priority: Major
>
> There is inconsistent explanation for "Using Virtualenv" chapter.
> It says "It packs the current virtual environment to an archive file, and It 
> self-contains both Python interpreter and the dependencies", but it's not 
> true.
> The python in the packed environment has a symbolic link that connects Python 
> to the local one, so Python must exist in the same path on all nodes.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-34167) Reading parquet with Decimal(8,2) written as a Decimal64 blows up

2021-01-21 Thread Attila Zsolt Piros (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17269182#comment-17269182
 ] 

Attila Zsolt Piros edited comment on SPARK-34167 at 1/21/21, 9:49 AM:
--

[~razajafri] could you please share with us how the parquet files are created?

I tried to reproduce this issue in the following way but I had no luck:

{noformat}
Spark context Web UI available at http://192.168.0.17:4045
Spark context available as 'sc' (master = local, app id = local-1611221568779).
Spark session available as 'spark'.
Welcome to
    __
 / __/__  ___ _/ /__
_\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 3.0.1
  /_/

Using Scala version 2.12.10 (OpenJDK 64-Bit Server VM, Java 1.8.0_242)
Type in expressions to have them evaluated.
Type :help for more information.

scala> import java.math.BigDecimal
import java.math.BigDecimal

scala> import org.apache.spark.sql.Row
import org.apache.spark.sql.Row

scala> import org.apache.spark.sql.types.{DecimalType, StructField, StructType}
import org.apache.spark.sql.types.{DecimalType, StructField, StructType}

scala> val schema = StructType(Array(StructField("num", DecimalType(8,2),true)))
schema: org.apache.spark.sql.types.StructType = 
StructType(StructField(num,DecimalType(8,2),true))

scala> val rdd = sc.parallelize((0 to 9).map(v => new 
BigDecimal(s"123456.7$v")))
rdd: org.apache.spark.rdd.RDD[java.math.BigDecimal] = ParallelCollectionRDD[0] 
at parallelize at :27

scala> val df = spark.createDataFrame(rdd.map(Row(_)), schema)
df: org.apache.spark.sql.DataFrame = [num: decimal(8,2)]

scala> df.show()
+-+
|  num|
+-+
|123456.70|
|123456.71|
|123456.72|
|123456.73|
|123456.74|
|123456.75|
|123456.76|
|123456.77|
|123456.78|
|123456.79|
+-+


scala> df.write.parquet("num.parquet")

scala> spark.read.parquet("num.parquet").show()
+-+
|  num|
+-+
|123456.70|
|123456.71|
|123456.72|
|123456.73|
|123456.74|
|123456.75|
|123456.76|
|123456.77|
|123456.78|
|123456.79|
+-+

{noformat}




was (Author: attilapiros):
[~razajafri] could you share with us how the parquet files are created?

I tried to reproduce this issue in the following way but I had no luck:

{noformat}
Spark context Web UI available at http://192.168.0.17:4045
Spark context available as 'sc' (master = local, app id = local-1611221568779).
Spark session available as 'spark'.
Welcome to
    __
 / __/__  ___ _/ /__
_\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 3.0.1
  /_/

Using Scala version 2.12.10 (OpenJDK 64-Bit Server VM, Java 1.8.0_242)
Type in expressions to have them evaluated.
Type :help for more information.

scala> import java.math.BigDecimal
import java.math.BigDecimal

scala> import org.apache.spark.sql.Row
import org.apache.spark.sql.Row

scala> import org.apache.spark.sql.types.{DecimalType, StructField, StructType}
import org.apache.spark.sql.types.{DecimalType, StructField, StructType}

scala> val schema = StructType(Array(StructField("num", DecimalType(8,2),true)))
schema: org.apache.spark.sql.types.StructType = 
StructType(StructField(num,DecimalType(8,2),true))

scala> val rdd = sc.parallelize((0 to 9).map(v => new 
BigDecimal(s"123456.7$v")))
rdd: org.apache.spark.rdd.RDD[java.math.BigDecimal] = ParallelCollectionRDD[0] 
at parallelize at :27

scala> val df = spark.createDataFrame(rdd.map(Row(_)), schema)
df: org.apache.spark.sql.DataFrame = [num: decimal(8,2)]

scala> df.show()
+-+
|  num|
+-+
|123456.70|
|123456.71|
|123456.72|
|123456.73|
|123456.74|
|123456.75|
|123456.76|
|123456.77|
|123456.78|
|123456.79|
+-+


scala> df.write.parquet("num.parquet")

scala> spark.read.parquet("num.parquet").show()
+-+
|  num|
+-+
|123456.70|
|123456.71|
|123456.72|
|123456.73|
|123456.74|
|123456.75|
|123456.76|
|123456.77|
|123456.78|
|123456.79|
+-+

{noformat}



> Reading parquet with Decimal(8,2) written as a Decimal64 blows up
> -
>
> Key: SPARK-34167
> URL: https://issues.apache.org/jira/browse/SPARK-34167
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 3.0.1
>Reporter: Raza Jafri
>Priority: Major
> Attachments: 
> part-0-7fecd321-b247-4f7e-bff5-c2e4d8facaa0-c000.snappy.parquet, 
> part-0-940f44f1-f323-4a5e-b828-1e65d87895aa-c000.snappy.parquet
>
>
> When reading a parquet file written with Decimals with precision < 10 as a 
> 64-bit representation, Spark tries to read it as an INT and fails
>  
> Steps to reproduce:
> Read the attached file that has a single Decimal(8,2) column with 10 values
> {code:java}
> scala> 

[jira] [Commented] (SPARK-34167) Reading parquet with Decimal(8,2) written as a Decimal64 blows up

2021-01-21 Thread Attila Zsolt Piros (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17269182#comment-17269182
 ] 

Attila Zsolt Piros commented on SPARK-34167:


[~razajafri] could you share with us how the parquet files are created?

I tried to reproduce this issue in the following way but I had no luck:

{noformat}
Spark context Web UI available at http://192.168.0.17:4045
Spark context available as 'sc' (master = local, app id = local-1611221568779).
Spark session available as 'spark'.
Welcome to
    __
 / __/__  ___ _/ /__
_\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 3.0.1
  /_/

Using Scala version 2.12.10 (OpenJDK 64-Bit Server VM, Java 1.8.0_242)
Type in expressions to have them evaluated.
Type :help for more information.

scala> import java.math.BigDecimal
import java.math.BigDecimal

scala> import org.apache.spark.sql.Row
import org.apache.spark.sql.Row

scala> import org.apache.spark.sql.types.{DecimalType, StructField, StructType}
import org.apache.spark.sql.types.{DecimalType, StructField, StructType}

scala> val schema = StructType(Array(StructField("num", DecimalType(8,2),true)))
schema: org.apache.spark.sql.types.StructType = 
StructType(StructField(num,DecimalType(8,2),true))

scala> val rdd = sc.parallelize((0 to 9).map(v => new 
BigDecimal(s"123456.7$v")))
rdd: org.apache.spark.rdd.RDD[java.math.BigDecimal] = ParallelCollectionRDD[0] 
at parallelize at :27

scala> val df = spark.createDataFrame(rdd.map(Row(_)), schema)
df: org.apache.spark.sql.DataFrame = [num: decimal(8,2)]

scala> df.show()
+-+
|  num|
+-+
|123456.70|
|123456.71|
|123456.72|
|123456.73|
|123456.74|
|123456.75|
|123456.76|
|123456.77|
|123456.78|
|123456.79|
+-+


scala> df.write.parquet("num.parquet")

scala> spark.read.parquet("num.parquet").show()
+-+
|  num|
+-+
|123456.70|
|123456.71|
|123456.72|
|123456.73|
|123456.74|
|123456.75|
|123456.76|
|123456.77|
|123456.78|
|123456.79|
+-+

{noformat}



> Reading parquet with Decimal(8,2) written as a Decimal64 blows up
> -
>
> Key: SPARK-34167
> URL: https://issues.apache.org/jira/browse/SPARK-34167
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 3.0.1
>Reporter: Raza Jafri
>Priority: Major
> Attachments: 
> part-0-7fecd321-b247-4f7e-bff5-c2e4d8facaa0-c000.snappy.parquet, 
> part-0-940f44f1-f323-4a5e-b828-1e65d87895aa-c000.snappy.parquet
>
>
> When reading a parquet file written with Decimals with precision < 10 as a 
> 64-bit representation, Spark tries to read it as an INT and fails
>  
> Steps to reproduce:
> Read the attached file that has a single Decimal(8,2) column with 10 values
> {code:java}
> scala> spark.read.parquet("/tmp/pyspark_tests/936454/PARQUET_DATA").show
> ...
> Caused by: java.lang.NullPointerException
>   at 
> org.apache.spark.sql.execution.vectorized.OnHeapColumnVector.putLong(OnHeapColumnVector.java:327)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedRleValuesReader.readLongs(VectorizedRleValuesReader.java:370)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readLongBatch(VectorizedColumnReader.java:514)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readBatch(VectorizedColumnReader.java:256)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextBatch(VectorizedParquetRecordReader.java:273)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextKeyValue(VectorizedParquetRecordReader.java:171)
>   at 
> org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:93)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:173)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:93)
>   at 
> org.apache.spark.sql.execution.FileSourceScanExec$$anon$1.hasNext(DataSourceScanExec.scala:497)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.columnartorow_nextBatch_0$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:756)
>   at 
> 

[jira] [Commented] (SPARK-33518) Improve performance of ML ALS recommendForAll by GEMV

2021-01-21 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17269177#comment-17269177
 ] 

Apache Spark commented on SPARK-33518:
--

User 'zhengruifeng' has created a pull request for this issue:
https://github.com/apache/spark/pull/31279

> Improve performance of ML ALS recommendForAll by GEMV
> -
>
> Key: SPARK-33518
> URL: https://issues.apache.org/jira/browse/SPARK-33518
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 3.1.0
>Reporter: zhengruifeng
>Assignee: zhengruifeng
>Priority: Major
> Fix For: 3.2.0
>
>
> There were a lot of works on improving ALS's {{recommendForAll}}
> For now, I found that it maybe futhermore optimized by
> 1, using GEMV;
> 2, use guava.ordering instead of BoundedPriorityQueue;
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33518) Improve performance of ML ALS recommendForAll by GEMV

2021-01-21 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17269176#comment-17269176
 ] 

Apache Spark commented on SPARK-33518:
--

User 'zhengruifeng' has created a pull request for this issue:
https://github.com/apache/spark/pull/31279

> Improve performance of ML ALS recommendForAll by GEMV
> -
>
> Key: SPARK-33518
> URL: https://issues.apache.org/jira/browse/SPARK-33518
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 3.1.0
>Reporter: zhengruifeng
>Assignee: zhengruifeng
>Priority: Major
> Fix For: 3.2.0
>
>
> There were a lot of works on improving ALS's {{recommendForAll}}
> For now, I found that it maybe futhermore optimized by
> 1, using GEMV;
> 2, use guava.ordering instead of BoundedPriorityQueue;
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-34190) Fix inconsistent docs in Python Package Management

2021-01-21 Thread Haejoon Lee (Jira)
Haejoon Lee created SPARK-34190:
---

 Summary: Fix inconsistent docs in Python Package Management
 Key: SPARK-34190
 URL: https://issues.apache.org/jira/browse/SPARK-34190
 Project: Spark
  Issue Type: Documentation
  Components: Documentation
Affects Versions: 3.0.1
Reporter: Haejoon Lee


There is inconsistent explanation for "Using Virtualenv" chapter.

It says "It packs the current virtual environment to an archive file, and It 
self-contains both Python interpreter and the dependencies", but it's not true.

The python in the packed environment has a symbolic link that connects Python 
to the local one, so Python must exist in the same path on all nodes.

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34188) Char varchar length check blocks CBO statistics

2021-01-21 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17269136#comment-17269136
 ] 

Apache Spark commented on SPARK-34188:
--

User 'yaooqinn' has created a pull request for this issue:
https://github.com/apache/spark/pull/31278

> Char varchar length check blocks CBO statistics
> ---
>
> Key: SPARK-34188
> URL: https://issues.apache.org/jira/browse/SPARK-34188
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Kent Yao
>Priority: Major
> Attachments: screenshot-1.png
>
>
>  !screenshot-1.png! 
> the char varchar length check changes the output by projection and the filter 
> push down through the projection with the new unaliased output, the CBO 
> estimation can not recognize 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-34188) Char varchar length check blocks CBO statistics

2021-01-21 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34188?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34188:


Assignee: (was: Apache Spark)

> Char varchar length check blocks CBO statistics
> ---
>
> Key: SPARK-34188
> URL: https://issues.apache.org/jira/browse/SPARK-34188
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Kent Yao
>Priority: Major
> Attachments: screenshot-1.png
>
>
>  !screenshot-1.png! 
> the char varchar length check changes the output by projection and the filter 
> push down through the projection with the new unaliased output, the CBO 
> estimation can not recognize 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34188) Char varchar length check blocks CBO statistics

2021-01-21 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17269133#comment-17269133
 ] 

Apache Spark commented on SPARK-34188:
--

User 'yaooqinn' has created a pull request for this issue:
https://github.com/apache/spark/pull/31278

> Char varchar length check blocks CBO statistics
> ---
>
> Key: SPARK-34188
> URL: https://issues.apache.org/jira/browse/SPARK-34188
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Kent Yao
>Priority: Major
> Attachments: screenshot-1.png
>
>
>  !screenshot-1.png! 
> the char varchar length check changes the output by projection and the filter 
> push down through the projection with the new unaliased output, the CBO 
> estimation can not recognize 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-34188) Char varchar length check blocks CBO statistics

2021-01-21 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34188?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34188:


Assignee: Apache Spark

> Char varchar length check blocks CBO statistics
> ---
>
> Key: SPARK-34188
> URL: https://issues.apache.org/jira/browse/SPARK-34188
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Kent Yao
>Assignee: Apache Spark
>Priority: Major
> Attachments: screenshot-1.png
>
>
>  !screenshot-1.png! 
> the char varchar length check changes the output by projection and the filter 
> push down through the projection with the new unaliased output, the CBO 
> estimation can not recognize 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33813) JDBC datasource fails when reading spatial datatypes with the MS SQL driver

2021-01-21 Thread Wenchen Fan (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33813?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17269119#comment-17269119
 ] 

Wenchen Fan commented on SPARK-33813:
-

[~sarutak] do you have time to look into this? thanks!

> JDBC datasource fails when reading spatial datatypes with the MS SQL driver
> ---
>
> Key: SPARK-33813
> URL: https://issues.apache.org/jira/browse/SPARK-33813
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0, 3.1.0
>Reporter: Michał Świtakowski
>Priority: Major
>
> The MS SQL JDBC driver introduced support for spatial types since version 
> 7.0. The JDBC data source lacks mappings for these types which results in an 
> exception below. It seems that a mapping in 
> MsSqlServerDialect.getCatalystType that maps -157 and -158 typecode to 
> VARBINARY should address the issue.
>  
> {noformat}
> java.sql.SQLException: Unrecognized SQL type -157
>  at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.getCatalystType(JdbcUtils.scala:251)
>  at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.$anonfun$getSchema$1(JdbcUtils.scala:321)
>  at scala.Option.getOrElse(Option.scala:189)
>  at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.getSchema(JdbcUtils.scala:321)
>  at 
> org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$.resolveTable(JDBCRDD.scala:63)
>  at 
> org.apache.spark.sql.execution.datasources.jdbc.JDBCRelation$.getSchema(JDBCRelation.scala:226)
>  at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:35)
>  at 
> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:364)
>  at 
> org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:366)
>  at 
> org.apache.spark.sql.DataFrameReader.$anonfun$load$2(DataFrameReader.scala:355)
>  at scala.Option.getOrElse(Option.scala:189)
>  at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:355)
>  at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:240)
>  at 
> org.apache.spark.sql.DataFrameReader.jdbc(DataFrameReader.scala:381){noformat}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33901) Char and Varchar display error after DDLs

2021-01-21 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17269117#comment-17269117
 ] 

Apache Spark commented on SPARK-33901:
--

User 'ulysses-you' has created a pull request for this issue:
https://github.com/apache/spark/pull/31277

> Char and Varchar display error after DDLs
> -
>
> Key: SPARK-33901
> URL: https://issues.apache.org/jira/browse/SPARK-33901
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Kent Yao
>Assignee: Apache Spark
>Priority: Major
> Fix For: 3.1.0
>
>
> CTAS / CREATE TABLE LIKE/ CVAS/ alter table add columns



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-34189) w2v findSynonyms optimization

2021-01-21 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34189?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34189:


Assignee: Apache Spark  (was: zhengruifeng)

> w2v findSynonyms optimization
> -
>
> Key: SPARK-34189
> URL: https://issues.apache.org/jira/browse/SPARK-34189
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 3.2.0
>Reporter: zhengruifeng
>Assignee: Apache Spark
>Priority: Minor
>
> {\{findSynonyms}} in w2v could be future optimized by using guavaording 
> instead of BoundedPriorityQueue



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-34189) w2v findSynonyms optimization

2021-01-21 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34189?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34189:


Assignee: Apache Spark  (was: zhengruifeng)

> w2v findSynonyms optimization
> -
>
> Key: SPARK-34189
> URL: https://issues.apache.org/jira/browse/SPARK-34189
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 3.2.0
>Reporter: zhengruifeng
>Assignee: Apache Spark
>Priority: Minor
>
> {\{findSynonyms}} in w2v could be future optimized by using guavaording 
> instead of BoundedPriorityQueue



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-34189) w2v findSynonyms optimization

2021-01-21 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34189?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34189:


Assignee: zhengruifeng  (was: Apache Spark)

> w2v findSynonyms optimization
> -
>
> Key: SPARK-34189
> URL: https://issues.apache.org/jira/browse/SPARK-34189
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 3.2.0
>Reporter: zhengruifeng
>Assignee: zhengruifeng
>Priority: Minor
>
> {\{findSynonyms}} in w2v could be future optimized by using guavaording 
> instead of BoundedPriorityQueue



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34189) w2v findSynonyms optimization

2021-01-21 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17269115#comment-17269115
 ] 

Apache Spark commented on SPARK-34189:
--

User 'zhengruifeng' has created a pull request for this issue:
https://github.com/apache/spark/pull/31276

> w2v findSynonyms optimization
> -
>
> Key: SPARK-34189
> URL: https://issues.apache.org/jira/browse/SPARK-34189
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 3.2.0
>Reporter: zhengruifeng
>Assignee: zhengruifeng
>Priority: Minor
>
> {\{findSynonyms}} in w2v could be future optimized by using guavaording 
> instead of BoundedPriorityQueue



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-34189) w2v findSynonyms optimization

2021-01-21 Thread zhengruifeng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34189?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhengruifeng reassigned SPARK-34189:


Assignee: zhengruifeng

> w2v findSynonyms optimization
> -
>
> Key: SPARK-34189
> URL: https://issues.apache.org/jira/browse/SPARK-34189
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 3.2.0
>Reporter: zhengruifeng
>Assignee: zhengruifeng
>Priority: Minor
>
> {\{findSynonyms}} in w2v could be future optimized by using guavaording 
> instead of BoundedPriorityQueue



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-34189) w2v findSynonyms optimization

2021-01-21 Thread zhengruifeng (Jira)
zhengruifeng created SPARK-34189:


 Summary: w2v findSynonyms optimization
 Key: SPARK-34189
 URL: https://issues.apache.org/jira/browse/SPARK-34189
 Project: Spark
  Issue Type: Improvement
  Components: ML
Affects Versions: 3.2.0
Reporter: zhengruifeng


{\{findSynonyms}} in w2v could be future optimized by using guavaording instead 
of BoundedPriorityQueue



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org