[jira] [Commented] (SPARK-33813) JDBC datasource fails when reading spatial datatypes with the MS SQL driver
[ https://issues.apache.org/jira/browse/SPARK-33813?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17269964#comment-17269964 ] Apache Spark commented on SPARK-33813: -- User 'sarutak' has created a pull request for this issue: https://github.com/apache/spark/pull/31290 > JDBC datasource fails when reading spatial datatypes with the MS SQL driver > --- > > Key: SPARK-33813 > URL: https://issues.apache.org/jira/browse/SPARK-33813 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0, 3.1.0 >Reporter: Michał Świtakowski >Assignee: Kousuke Saruta >Priority: Major > Fix For: 3.2.0 > > > The MS SQL JDBC driver introduced support for spatial types since version > 7.0. The JDBC data source lacks mappings for these types which results in an > exception below. It seems that a mapping in > MsSqlServerDialect.getCatalystType that maps -157 and -158 typecode to > VARBINARY should address the issue. > > {noformat} > java.sql.SQLException: Unrecognized SQL type -157 > at > org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.getCatalystType(JdbcUtils.scala:251) > at > org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.$anonfun$getSchema$1(JdbcUtils.scala:321) > at scala.Option.getOrElse(Option.scala:189) > at > org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.getSchema(JdbcUtils.scala:321) > at > org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$.resolveTable(JDBCRDD.scala:63) > at > org.apache.spark.sql.execution.datasources.jdbc.JDBCRelation$.getSchema(JDBCRelation.scala:226) > at > org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:35) > at > org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:364) > at > org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:366) > at > org.apache.spark.sql.DataFrameReader.$anonfun$load$2(DataFrameReader.scala:355) > at scala.Option.getOrElse(Option.scala:189) > at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:355) > at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:240) > at > org.apache.spark.sql.DataFrameReader.jdbc(DataFrameReader.scala:381){noformat} > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33813) JDBC datasource fails when reading spatial datatypes with the MS SQL driver
[ https://issues.apache.org/jira/browse/SPARK-33813?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17269963#comment-17269963 ] Apache Spark commented on SPARK-33813: -- User 'sarutak' has created a pull request for this issue: https://github.com/apache/spark/pull/31290 > JDBC datasource fails when reading spatial datatypes with the MS SQL driver > --- > > Key: SPARK-33813 > URL: https://issues.apache.org/jira/browse/SPARK-33813 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0, 3.1.0 >Reporter: Michał Świtakowski >Assignee: Kousuke Saruta >Priority: Major > Fix For: 3.2.0 > > > The MS SQL JDBC driver introduced support for spatial types since version > 7.0. The JDBC data source lacks mappings for these types which results in an > exception below. It seems that a mapping in > MsSqlServerDialect.getCatalystType that maps -157 and -158 typecode to > VARBINARY should address the issue. > > {noformat} > java.sql.SQLException: Unrecognized SQL type -157 > at > org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.getCatalystType(JdbcUtils.scala:251) > at > org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.$anonfun$getSchema$1(JdbcUtils.scala:321) > at scala.Option.getOrElse(Option.scala:189) > at > org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.getSchema(JdbcUtils.scala:321) > at > org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$.resolveTable(JDBCRDD.scala:63) > at > org.apache.spark.sql.execution.datasources.jdbc.JDBCRelation$.getSchema(JDBCRelation.scala:226) > at > org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:35) > at > org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:364) > at > org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:366) > at > org.apache.spark.sql.DataFrameReader.$anonfun$load$2(DataFrameReader.scala:355) > at scala.Option.getOrElse(Option.scala:189) > at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:355) > at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:240) > at > org.apache.spark.sql.DataFrameReader.jdbc(DataFrameReader.scala:381){noformat} > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33813) JDBC datasource fails when reading spatial datatypes with the MS SQL driver
[ https://issues.apache.org/jira/browse/SPARK-33813?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17269962#comment-17269962 ] Apache Spark commented on SPARK-33813: -- User 'sarutak' has created a pull request for this issue: https://github.com/apache/spark/pull/31289 > JDBC datasource fails when reading spatial datatypes with the MS SQL driver > --- > > Key: SPARK-33813 > URL: https://issues.apache.org/jira/browse/SPARK-33813 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0, 3.1.0 >Reporter: Michał Świtakowski >Assignee: Kousuke Saruta >Priority: Major > Fix For: 3.2.0 > > > The MS SQL JDBC driver introduced support for spatial types since version > 7.0. The JDBC data source lacks mappings for these types which results in an > exception below. It seems that a mapping in > MsSqlServerDialect.getCatalystType that maps -157 and -158 typecode to > VARBINARY should address the issue. > > {noformat} > java.sql.SQLException: Unrecognized SQL type -157 > at > org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.getCatalystType(JdbcUtils.scala:251) > at > org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.$anonfun$getSchema$1(JdbcUtils.scala:321) > at scala.Option.getOrElse(Option.scala:189) > at > org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.getSchema(JdbcUtils.scala:321) > at > org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$.resolveTable(JDBCRDD.scala:63) > at > org.apache.spark.sql.execution.datasources.jdbc.JDBCRelation$.getSchema(JDBCRelation.scala:226) > at > org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:35) > at > org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:364) > at > org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:366) > at > org.apache.spark.sql.DataFrameReader.$anonfun$load$2(DataFrameReader.scala:355) > at scala.Option.getOrElse(Option.scala:189) > at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:355) > at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:240) > at > org.apache.spark.sql.DataFrameReader.jdbc(DataFrameReader.scala:381){noformat} > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33813) JDBC datasource fails when reading spatial datatypes with the MS SQL driver
[ https://issues.apache.org/jira/browse/SPARK-33813?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17269961#comment-17269961 ] Apache Spark commented on SPARK-33813: -- User 'sarutak' has created a pull request for this issue: https://github.com/apache/spark/pull/31289 > JDBC datasource fails when reading spatial datatypes with the MS SQL driver > --- > > Key: SPARK-33813 > URL: https://issues.apache.org/jira/browse/SPARK-33813 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0, 3.1.0 >Reporter: Michał Świtakowski >Assignee: Kousuke Saruta >Priority: Major > Fix For: 3.2.0 > > > The MS SQL JDBC driver introduced support for spatial types since version > 7.0. The JDBC data source lacks mappings for these types which results in an > exception below. It seems that a mapping in > MsSqlServerDialect.getCatalystType that maps -157 and -158 typecode to > VARBINARY should address the issue. > > {noformat} > java.sql.SQLException: Unrecognized SQL type -157 > at > org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.getCatalystType(JdbcUtils.scala:251) > at > org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.$anonfun$getSchema$1(JdbcUtils.scala:321) > at scala.Option.getOrElse(Option.scala:189) > at > org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.getSchema(JdbcUtils.scala:321) > at > org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$.resolveTable(JDBCRDD.scala:63) > at > org.apache.spark.sql.execution.datasources.jdbc.JDBCRelation$.getSchema(JDBCRelation.scala:226) > at > org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:35) > at > org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:364) > at > org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:366) > at > org.apache.spark.sql.DataFrameReader.$anonfun$load$2(DataFrameReader.scala:355) > at scala.Option.getOrElse(Option.scala:189) > at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:355) > at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:240) > at > org.apache.spark.sql.DataFrameReader.jdbc(DataFrameReader.scala:381){noformat} > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-33933) Broadcast timeout happened unexpectedly in AQE
[ https://issues.apache.org/jira/browse/SPARK-33933?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-33933. - Fix Version/s: 3.2.0 Resolution: Fixed Issue resolved by pull request 31269 [https://github.com/apache/spark/pull/31269] > Broadcast timeout happened unexpectedly in AQE > --- > > Key: SPARK-33933 > URL: https://issues.apache.org/jira/browse/SPARK-33933 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0, 3.0.1 >Reporter: Yu Zhong >Assignee: Yu Zhong >Priority: Major > Fix For: 3.2.0 > > > In Spark 3.0, when AQE is enabled, there is often broadcast timeout in normal > queries as below. > > {code:java} > Could not execute broadcast in 300 secs. You can increase the timeout for > broadcasts via spark.sql.broadcastTimeout or disable broadcast join by > setting spark.sql.autoBroadcastJoinThreshold to -1 > {code} > > This is usually happens when broadcast join(with or without hint) after a > long running shuffle (more than 5 minutes). By disable AQE, the issues > disappear. > The workaround is to increase spark.sql.broadcastTimeout and it works. But > because the data to broadcast is very small, that doesn't make sense. > After investigation, the root cause should be like this: when enable AQE, in > getFinalPhysicalPlan, spark traversal the physical plan bottom up and create > query stage for materialized part by createQueryStages and materialize those > new created query stages to submit map stages or broadcasting. When > ShuffleQueryStage are materializing before BroadcastQueryStage, the map job > and broadcast job are submitted almost at the same time, but map job will > hold all the computing resources. If the map job runs slow (when lots of data > needs to process and the resource is limited), the broadcast job cannot be > started(and finished) before spark.sql.broadcastTimeout, thus cause whole job > failed (introduced in SPARK-31475). > Code to reproduce: > > {code:java} > import java.util.UUID > import scala.util.Random > import org.apache.spark.sql.functions._ > import org.apache.spark.sql.SparkSession > val spark = SparkSession.builder() > .master("local[2]") > .appName("Test Broadcast").getOrCreate() > import spark.implicits._ > spark.conf.set("spark.sql.adaptive.enabled", "true") > val sc = spark.sparkContext > sc.setLogLevel("INFO") > val uuid = UUID.randomUUID > val df = sc.parallelize(Range(0, 1), 1).flatMap(x => { > for (i <- Range(0, 1 + Random.nextInt(1))) > yield (x % 26, x, Random.nextInt(10), UUID.randomUUID.toString) > }).toDF("index", "part", "pv", "uuid") > .withColumn("md5", md5($"uuid")) > val dim_data = Range(0, 26).map(x => (('a' + x).toChar.toString, x)) > val dim = dim_data.toDF("name", "index") > val result = df.groupBy("index") > .agg(sum($"pv").alias("pv"), countDistinct("uuid").alias("uv")) > .join(dim, Seq("index")) > .collect(){code} > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33813) JDBC datasource fails when reading spatial datatypes with the MS SQL driver
[ https://issues.apache.org/jira/browse/SPARK-33813?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17269834#comment-17269834 ] Apache Spark commented on SPARK-33813: -- User 'sarutak' has created a pull request for this issue: https://github.com/apache/spark/pull/31288 > JDBC datasource fails when reading spatial datatypes with the MS SQL driver > --- > > Key: SPARK-33813 > URL: https://issues.apache.org/jira/browse/SPARK-33813 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0, 3.1.0 >Reporter: Michał Świtakowski >Assignee: Kousuke Saruta >Priority: Major > Fix For: 3.2.0 > > > The MS SQL JDBC driver introduced support for spatial types since version > 7.0. The JDBC data source lacks mappings for these types which results in an > exception below. It seems that a mapping in > MsSqlServerDialect.getCatalystType that maps -157 and -158 typecode to > VARBINARY should address the issue. > > {noformat} > java.sql.SQLException: Unrecognized SQL type -157 > at > org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.getCatalystType(JdbcUtils.scala:251) > at > org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.$anonfun$getSchema$1(JdbcUtils.scala:321) > at scala.Option.getOrElse(Option.scala:189) > at > org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.getSchema(JdbcUtils.scala:321) > at > org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$.resolveTable(JDBCRDD.scala:63) > at > org.apache.spark.sql.execution.datasources.jdbc.JDBCRelation$.getSchema(JDBCRelation.scala:226) > at > org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:35) > at > org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:364) > at > org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:366) > at > org.apache.spark.sql.DataFrameReader.$anonfun$load$2(DataFrameReader.scala:355) > at scala.Option.getOrElse(Option.scala:189) > at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:355) > at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:240) > at > org.apache.spark.sql.DataFrameReader.jdbc(DataFrameReader.scala:381){noformat} > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34200) ambiguous column reference should consider attribute availability
[ https://issues.apache.org/jira/browse/SPARK-34200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17269802#comment-17269802 ] Apache Spark commented on SPARK-34200: -- User 'cloud-fan' has created a pull request for this issue: https://github.com/apache/spark/pull/31287 > ambiguous column reference should consider attribute availability > - > > Key: SPARK-34200 > URL: https://issues.apache.org/jira/browse/SPARK-34200 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0, 3.1.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34200) ambiguous column reference should consider attribute availability
[ https://issues.apache.org/jira/browse/SPARK-34200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17269800#comment-17269800 ] Apache Spark commented on SPARK-34200: -- User 'cloud-fan' has created a pull request for this issue: https://github.com/apache/spark/pull/31287 > ambiguous column reference should consider attribute availability > - > > Key: SPARK-34200 > URL: https://issues.apache.org/jira/browse/SPARK-34200 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0, 3.1.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-34200) ambiguous column reference should consider attribute availability
[ https://issues.apache.org/jira/browse/SPARK-34200?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-34200: Assignee: Apache Spark (was: Wenchen Fan) > ambiguous column reference should consider attribute availability > - > > Key: SPARK-34200 > URL: https://issues.apache.org/jira/browse/SPARK-34200 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0, 3.1.0 >Reporter: Wenchen Fan >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-34200) ambiguous column reference should consider attribute availability
[ https://issues.apache.org/jira/browse/SPARK-34200?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-34200: Assignee: Wenchen Fan (was: Apache Spark) > ambiguous column reference should consider attribute availability > - > > Key: SPARK-34200 > URL: https://issues.apache.org/jira/browse/SPARK-34200 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0, 3.1.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-34200) ambiguous column reference should consider attribute availability
Wenchen Fan created SPARK-34200: --- Summary: ambiguous column reference should consider attribute availability Key: SPARK-34200 URL: https://issues.apache.org/jira/browse/SPARK-34200 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.0.0, 3.1.0 Reporter: Wenchen Fan Assignee: Wenchen Fan -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-33245) Add built-in UDF - GETBIT
[ https://issues.apache.org/jira/browse/SPARK-33245?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-33245. - Fix Version/s: 3.2.0 Resolution: Fixed > Add built-in UDF - GETBIT > -- > > Key: SPARK-33245 > URL: https://issues.apache.org/jira/browse/SPARK-33245 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.0 >Reporter: Yuming Wang >Assignee: jiaan.geng >Priority: Major > Fix For: 3.2.0 > > > Teradata, Impala, Snowflake and Yellowbrick support this function: > https://docs.teradata.com/reader/kmuOwjp1zEYg98JsB8fu_A/PK1oV1b2jqvG~ohRnOro9w > https://docs.cloudera.com/runtime/7.2.0/impala-sql-reference/topics/impala-bit-functions.html#bit_functions__getbit > https://docs.snowflake.com/en/sql-reference/functions/getbit.html > https://www.yellowbrick.com/docs/2.2/ybd_sqlref/getbit.html -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-33245) Add built-in UDF - GETBIT
[ https://issues.apache.org/jira/browse/SPARK-33245?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reopened SPARK-33245: - > Add built-in UDF - GETBIT > -- > > Key: SPARK-33245 > URL: https://issues.apache.org/jira/browse/SPARK-33245 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.0 >Reporter: Yuming Wang >Assignee: jiaan.geng >Priority: Major > > Teradata, Impala, Snowflake and Yellowbrick support this function: > https://docs.teradata.com/reader/kmuOwjp1zEYg98JsB8fu_A/PK1oV1b2jqvG~ohRnOro9w > https://docs.cloudera.com/runtime/7.2.0/impala-sql-reference/topics/impala-bit-functions.html#bit_functions__getbit > https://docs.snowflake.com/en/sql-reference/functions/getbit.html > https://www.yellowbrick.com/docs/2.2/ybd_sqlref/getbit.html -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34199) Block `count(table.*)` to follow ANSI standard and other SQL engines
[ https://issues.apache.org/jira/browse/SPARK-34199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17269791#comment-17269791 ] Apache Spark commented on SPARK-34199: -- User 'linhongliu-db' has created a pull request for this issue: https://github.com/apache/spark/pull/31286 > Block `count(table.*)` to follow ANSI standard and other SQL engines > > > Key: SPARK-34199 > URL: https://issues.apache.org/jira/browse/SPARK-34199 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.0 >Reporter: Linhong Liu >Priority: Major > > In spark, the count(table.*) may cause very weird result, for example: > select count(*) from (select 1 as a, null as b) t; > output: 1 > select count(t.*) from (select 1 as a, null as b) t; > output: 0 > > After checking the ANSI standard, count(*) is always treated as count(1) > while count(t.*) is not allowed. What's more, this is also not allowed by > common databases, e.g. MySQL, oracle. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-34199) Block `count(table.*)` to follow ANSI standard and other SQL engines
[ https://issues.apache.org/jira/browse/SPARK-34199?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-34199: Assignee: Apache Spark > Block `count(table.*)` to follow ANSI standard and other SQL engines > > > Key: SPARK-34199 > URL: https://issues.apache.org/jira/browse/SPARK-34199 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.0 >Reporter: Linhong Liu >Assignee: Apache Spark >Priority: Major > > In spark, the count(table.*) may cause very weird result, for example: > select count(*) from (select 1 as a, null as b) t; > output: 1 > select count(t.*) from (select 1 as a, null as b) t; > output: 0 > > After checking the ANSI standard, count(*) is always treated as count(1) > while count(t.*) is not allowed. What's more, this is also not allowed by > common databases, e.g. MySQL, oracle. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-34199) Block `count(table.*)` to follow ANSI standard and other SQL engines
[ https://issues.apache.org/jira/browse/SPARK-34199?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-34199: Assignee: (was: Apache Spark) > Block `count(table.*)` to follow ANSI standard and other SQL engines > > > Key: SPARK-34199 > URL: https://issues.apache.org/jira/browse/SPARK-34199 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.0 >Reporter: Linhong Liu >Priority: Major > > In spark, the count(table.*) may cause very weird result, for example: > select count(*) from (select 1 as a, null as b) t; > output: 1 > select count(t.*) from (select 1 as a, null as b) t; > output: 0 > > After checking the ANSI standard, count(*) is always treated as count(1) > while count(t.*) is not allowed. What's more, this is also not allowed by > common databases, e.g. MySQL, oracle. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33245) Add built-in UDF - GETBIT
[ https://issues.apache.org/jira/browse/SPARK-33245?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-33245: --- Assignee: jiaan.geng > Add built-in UDF - GETBIT > -- > > Key: SPARK-33245 > URL: https://issues.apache.org/jira/browse/SPARK-33245 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.0 >Reporter: Yuming Wang >Assignee: jiaan.geng >Priority: Major > > Teradata, Impala, Snowflake and Yellowbrick support this function: > https://docs.teradata.com/reader/kmuOwjp1zEYg98JsB8fu_A/PK1oV1b2jqvG~ohRnOro9w > https://docs.cloudera.com/runtime/7.2.0/impala-sql-reference/topics/impala-bit-functions.html#bit_functions__getbit > https://docs.snowflake.com/en/sql-reference/functions/getbit.html > https://www.yellowbrick.com/docs/2.2/ybd_sqlref/getbit.html -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33541) Group exception messages in catalyst/expressions
[ https://issues.apache.org/jira/browse/SPARK-33541?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-33541: --- Assignee: jiaan.geng > Group exception messages in catalyst/expressions > > > Key: SPARK-33541 > URL: https://issues.apache.org/jira/browse/SPARK-33541 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.2.0 >Reporter: Allison Wang >Assignee: jiaan.geng >Priority: Major > > '/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions' > || Filename || Count || > | Cast.scala| 18 | > | ExprUtils.scala | 2 | > | Expression.scala | 8 | > | InterpretedUnsafeProjection.scala | 1 | > | ScalaUDF.scala| 2 | > | SelectedField.scala | 3 | > | SubExprEvaluationRuntime.scala| 1 | > | arithmetic.scala | 8 | > | collectionOperations.scala| 4 | > | complexTypeExtractors.scala | 3 | > | csvExpressions.scala | 3 | > | datetimeExpressions.scala | 4 | > | decimalExpressions.scala | 2 | > | generators.scala | 2 | > | higherOrderFunctions.scala| 6 | > | jsonExpressions.scala | 2 | > | literals.scala| 3 | > | misc.scala| 2 | > | namedExpressions.scala| 1 | > | ordering.scala| 1 | > | package.scala | 1 | > | regexpExpressions.scala | 1 | > | stringExpressions.scala | 1 | > | windowExpressions.scala | 5 | > '/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate' > || Filename|| Count || > | ApproximatePercentile.scala | 2 | > | HyperLogLogPlusPlus.scala | 1 | > | Percentile.scala| 1 | > | interfaces.scala| 2 | > '/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen' > || Filename|| Count || > | CodeGenerator.scala | 5 | > | javaCode.scala | 1 | > '/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/objects' > || Filename || Count || > | objects.scala | 12 | -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-33541) Group exception messages in catalyst/expressions
[ https://issues.apache.org/jira/browse/SPARK-33541?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-33541. - Fix Version/s: 3.2.0 Resolution: Fixed Issue resolved by pull request 31228 [https://github.com/apache/spark/pull/31228] > Group exception messages in catalyst/expressions > > > Key: SPARK-33541 > URL: https://issues.apache.org/jira/browse/SPARK-33541 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.2.0 >Reporter: Allison Wang >Assignee: jiaan.geng >Priority: Major > Fix For: 3.2.0 > > > '/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions' > || Filename || Count || > | Cast.scala| 18 | > | ExprUtils.scala | 2 | > | Expression.scala | 8 | > | InterpretedUnsafeProjection.scala | 1 | > | ScalaUDF.scala| 2 | > | SelectedField.scala | 3 | > | SubExprEvaluationRuntime.scala| 1 | > | arithmetic.scala | 8 | > | collectionOperations.scala| 4 | > | complexTypeExtractors.scala | 3 | > | csvExpressions.scala | 3 | > | datetimeExpressions.scala | 4 | > | decimalExpressions.scala | 2 | > | generators.scala | 2 | > | higherOrderFunctions.scala| 6 | > | jsonExpressions.scala | 2 | > | literals.scala| 3 | > | misc.scala| 2 | > | namedExpressions.scala| 1 | > | ordering.scala| 1 | > | package.scala | 1 | > | regexpExpressions.scala | 1 | > | stringExpressions.scala | 1 | > | windowExpressions.scala | 5 | > '/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate' > || Filename|| Count || > | ApproximatePercentile.scala | 2 | > | HyperLogLogPlusPlus.scala | 1 | > | Percentile.scala| 1 | > | interfaces.scala| 2 | > '/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen' > || Filename|| Count || > | CodeGenerator.scala | 5 | > | javaCode.scala | 1 | > '/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/objects' > || Filename || Count || > | objects.scala | 12 | -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-33813) JDBC datasource fails when reading spatial datatypes with the MS SQL driver
[ https://issues.apache.org/jira/browse/SPARK-33813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-33813. - Fix Version/s: 3.2.0 Resolution: Fixed Issue resolved by pull request 31283 [https://github.com/apache/spark/pull/31283] > JDBC datasource fails when reading spatial datatypes with the MS SQL driver > --- > > Key: SPARK-33813 > URL: https://issues.apache.org/jira/browse/SPARK-33813 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0, 3.1.0 >Reporter: Michał Świtakowski >Assignee: Kousuke Saruta >Priority: Major > Fix For: 3.2.0 > > > The MS SQL JDBC driver introduced support for spatial types since version > 7.0. The JDBC data source lacks mappings for these types which results in an > exception below. It seems that a mapping in > MsSqlServerDialect.getCatalystType that maps -157 and -158 typecode to > VARBINARY should address the issue. > > {noformat} > java.sql.SQLException: Unrecognized SQL type -157 > at > org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.getCatalystType(JdbcUtils.scala:251) > at > org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.$anonfun$getSchema$1(JdbcUtils.scala:321) > at scala.Option.getOrElse(Option.scala:189) > at > org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.getSchema(JdbcUtils.scala:321) > at > org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$.resolveTable(JDBCRDD.scala:63) > at > org.apache.spark.sql.execution.datasources.jdbc.JDBCRelation$.getSchema(JDBCRelation.scala:226) > at > org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:35) > at > org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:364) > at > org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:366) > at > org.apache.spark.sql.DataFrameReader.$anonfun$load$2(DataFrameReader.scala:355) > at scala.Option.getOrElse(Option.scala:189) > at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:355) > at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:240) > at > org.apache.spark.sql.DataFrameReader.jdbc(DataFrameReader.scala:381){noformat} > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33813) JDBC datasource fails when reading spatial datatypes with the MS SQL driver
[ https://issues.apache.org/jira/browse/SPARK-33813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-33813: --- Assignee: Kousuke Saruta > JDBC datasource fails when reading spatial datatypes with the MS SQL driver > --- > > Key: SPARK-33813 > URL: https://issues.apache.org/jira/browse/SPARK-33813 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0, 3.1.0 >Reporter: Michał Świtakowski >Assignee: Kousuke Saruta >Priority: Major > > The MS SQL JDBC driver introduced support for spatial types since version > 7.0. The JDBC data source lacks mappings for these types which results in an > exception below. It seems that a mapping in > MsSqlServerDialect.getCatalystType that maps -157 and -158 typecode to > VARBINARY should address the issue. > > {noformat} > java.sql.SQLException: Unrecognized SQL type -157 > at > org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.getCatalystType(JdbcUtils.scala:251) > at > org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.$anonfun$getSchema$1(JdbcUtils.scala:321) > at scala.Option.getOrElse(Option.scala:189) > at > org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.getSchema(JdbcUtils.scala:321) > at > org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$.resolveTable(JDBCRDD.scala:63) > at > org.apache.spark.sql.execution.datasources.jdbc.JDBCRelation$.getSchema(JDBCRelation.scala:226) > at > org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:35) > at > org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:364) > at > org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:366) > at > org.apache.spark.sql.DataFrameReader.$anonfun$load$2(DataFrameReader.scala:355) > at scala.Option.getOrElse(Option.scala:189) > at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:355) > at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:240) > at > org.apache.spark.sql.DataFrameReader.jdbc(DataFrameReader.scala:381){noformat} > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-34180) Fix the regression brought by SPARK-33888 for PostgresDialect
[ https://issues.apache.org/jira/browse/SPARK-34180?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-34180. -- Fix Version/s: 3.2.0 Resolution: Fixed Issue resolved by pull request 31262 [https://github.com/apache/spark/pull/31262] > Fix the regression brought by SPARK-33888 for PostgresDialect > - > > Key: SPARK-34180 > URL: https://issues.apache.org/jira/browse/SPARK-34180 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.0 >Reporter: Kousuke Saruta >Assignee: Apache Spark >Priority: Blocker > Fix For: 3.2.0 > > > A regression brought by SPARK-33888 affects PostgresDialect causing > `PostgreSQLIntegrationSuite` to fail. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-34199) Block `count(table.*)` to follow ANSI standard and other SQL engines
Linhong Liu created SPARK-34199: --- Summary: Block `count(table.*)` to follow ANSI standard and other SQL engines Key: SPARK-34199 URL: https://issues.apache.org/jira/browse/SPARK-34199 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.2.0 Reporter: Linhong Liu In spark, the count(table.*) may cause very weird result, for example: select count(*) from (select 1 as a, null as b) t; output: 1 select count(t.*) from (select 1 as a, null as b) t; output: 0 After checking the ANSI standard, count(*) is always treated as count(1) while count(t.*) is not allowed. What's more, this is also not allowed by common databases, e.g. MySQL, oracle. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33489) Support null for conversion from and to Arrow type
[ https://issues.apache.org/jira/browse/SPARK-33489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17269744#comment-17269744 ] Apache Spark commented on SPARK-33489: -- User 'ueshin' has created a pull request for this issue: https://github.com/apache/spark/pull/31285 > Support null for conversion from and to Arrow type > -- > > Key: SPARK-33489 > URL: https://issues.apache.org/jira/browse/SPARK-33489 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.0.1 >Reporter: Yuya Kanai >Priority: Minor > > I got below error when using from_arrow_type() in pyspark.sql.pandas.types > {{Unsupported type in conversion from Arrow: null}} > I noticed NullType exists under pyspark.sql.types so it seems possible to > convert from pyarrow null to pyspark null type and vice versa. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-34148) Move general StateStore tests to StateStoreSuiteBase
[ https://issues.apache.org/jira/browse/SPARK-34148?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] L. C. Hsieh updated SPARK-34148: Parent: SPARK-34198 Issue Type: Sub-task (was: Test) > Move general StateStore tests to StateStoreSuiteBase > > > Key: SPARK-34148 > URL: https://issues.apache.org/jira/browse/SPARK-34148 > Project: Spark > Issue Type: Sub-task > Components: Structured Streaming >Affects Versions: 3.2.0 >Reporter: L. C. Hsieh >Assignee: L. C. Hsieh >Priority: Major > > There are some general StateStore tests in StateStoreSuite which is > HDFSBackedStateStoreProvider-specific test suite. We should move general > tests into StateStoreSuiteBase. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-34148) Move general StateStore tests to StateStoreSuiteBase
[ https://issues.apache.org/jira/browse/SPARK-34148?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] L. C. Hsieh resolved SPARK-34148. - Resolution: Resolved > Move general StateStore tests to StateStoreSuiteBase > > > Key: SPARK-34148 > URL: https://issues.apache.org/jira/browse/SPARK-34148 > Project: Spark > Issue Type: Sub-task > Components: Structured Streaming >Affects Versions: 3.2.0 >Reporter: L. C. Hsieh >Assignee: L. C. Hsieh >Priority: Major > > There are some general StateStore tests in StateStoreSuite which is > HDFSBackedStateStoreProvider-specific test suite. We should move general > tests into StateStoreSuiteBase. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34198) Add RocksDB StateStore as external module
[ https://issues.apache.org/jira/browse/SPARK-34198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17269737#comment-17269737 ] L. C. Hsieh commented on SPARK-34198: - cc [~dbtsai][~dongjoon][~hyukjin.kwon] > Add RocksDB StateStore as external module > - > > Key: SPARK-34198 > URL: https://issues.apache.org/jira/browse/SPARK-34198 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 3.2.0 >Reporter: L. C. Hsieh >Priority: Major > > Currently Spark SS only has one built-in StateStore implementation > HDFSBackedStateStore. Actually it uses in-memory map to store state rows. As > there are more and more streaming applications, some of them requires to use > large state in stateful operations such as streaming aggregation and join. > Several other major streaming frameworks already use RocksDB for state > management. So it is proven to be good choice for large state usage. But > Spark SS still lacks of a built-in state store for the requirement. > We would like to explore the possibility to add RocksDB-based StateStore into > Spark SS. For the concern about adding RocksDB as a direct dependency, our > plan is to add this StateStore as an external module first. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-34198) Add RocksDB StateStore as external module
[ https://issues.apache.org/jira/browse/SPARK-34198?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] L. C. Hsieh updated SPARK-34198: Issue Type: New Feature (was: Bug) > Add RocksDB StateStore as external module > - > > Key: SPARK-34198 > URL: https://issues.apache.org/jira/browse/SPARK-34198 > Project: Spark > Issue Type: New Feature > Components: Structured Streaming >Affects Versions: 3.2.0 >Reporter: L. C. Hsieh >Priority: Major > > Currently Spark SS only has one built-in StateStore implementation > HDFSBackedStateStore. Actually it uses in-memory map to store state rows. As > there are more and more streaming applications, some of them requires to use > large state in stateful operations such as streaming aggregation and join. > Several other major streaming frameworks already use RocksDB for state > management. So it is proven to be good choice for large state usage. But > Spark SS still lacks of a built-in state store for the requirement. > We would like to explore the possibility to add RocksDB-based StateStore into > Spark SS. For the concern about adding RocksDB as a direct dependency, our > plan is to add this StateStore as an external module first. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-34198) Add RocksDB StateStore as external module
L. C. Hsieh created SPARK-34198: --- Summary: Add RocksDB StateStore as external module Key: SPARK-34198 URL: https://issues.apache.org/jira/browse/SPARK-34198 Project: Spark Issue Type: Bug Components: Structured Streaming Affects Versions: 3.2.0 Reporter: L. C. Hsieh Currently Spark SS only has one built-in StateStore implementation HDFSBackedStateStore. Actually it uses in-memory map to store state rows. As there are more and more streaming applications, some of them requires to use large state in stateful operations such as streaming aggregation and join. Several other major streaming frameworks already use RocksDB for state management. So it is proven to be good choice for large state usage. But Spark SS still lacks of a built-in state store for the requirement. We would like to explore the possibility to add RocksDB-based StateStore into Spark SS. For the concern about adding RocksDB as a direct dependency, our plan is to add this StateStore as an external module first. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34167) Reading parquet with Decimal(8,2) written as a Decimal64 blows up
[ https://issues.apache.org/jira/browse/SPARK-34167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17269726#comment-17269726 ] Apache Spark commented on SPARK-34167: -- User 'razajafri' has created a pull request for this issue: https://github.com/apache/spark/pull/31284 > Reading parquet with Decimal(8,2) written as a Decimal64 blows up > - > > Key: SPARK-34167 > URL: https://issues.apache.org/jira/browse/SPARK-34167 > Project: Spark > Issue Type: Bug > Components: Input/Output >Affects Versions: 3.0.1 >Reporter: Raza Jafri >Priority: Major > Attachments: > part-0-7fecd321-b247-4f7e-bff5-c2e4d8facaa0-c000.snappy.parquet, > part-0-940f44f1-f323-4a5e-b828-1e65d87895aa-c000.snappy.parquet > > > When reading a parquet file written with Decimals with precision < 10 as a > 64-bit representation, Spark tries to read it as an INT and fails > > Steps to reproduce: > Read the attached file that has a single Decimal(8,2) column with 10 values > {code:java} > scala> spark.read.parquet("/tmp/pyspark_tests/936454/PARQUET_DATA").show > ... > Caused by: java.lang.NullPointerException > at > org.apache.spark.sql.execution.vectorized.OnHeapColumnVector.putLong(OnHeapColumnVector.java:327) > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedRleValuesReader.readLongs(VectorizedRleValuesReader.java:370) > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readLongBatch(VectorizedColumnReader.java:514) > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readBatch(VectorizedColumnReader.java:256) > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextBatch(VectorizedParquetRecordReader.java:273) > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextKeyValue(VectorizedParquetRecordReader.java:171) > at > org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:93) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:173) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:93) > at > org.apache.spark.sql.execution.FileSourceScanExec$$anon$1.hasNext(DataSourceScanExec.scala:497) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.columnartorow_nextBatch_0$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:756) > at > org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:340) > at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:898) > at > org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:898) > at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:337) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) > at org.apache.spark.scheduler.Task.run(Task.scala:127) > at > org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:480) > at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1426) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:483) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > ... > {code} > > > Here are my findings. The *{{VectorizedParquetRecordReader}}* reads in the > parquet file correctly because its basing the read on the > [requestedSchema|https://github.com/apache/spark/blob/e6f019836c099398542b443f7700f79de81da0d5/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/SpecificParquetRecordReaderBase.java#L150] > which is a MessageType and has the underlying data stored correctly as > {{INT64}} where as the *{{OnHeapColumnVector}}* is initialized based on the > [batchSchema|https://github.com/apache/spark/blob/e6f019836c099398542b443f7700f79de81da0d5/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/SpecificParquetRecordReaderBase.java#L151] > which is coming from
[jira] [Assigned] (SPARK-34167) Reading parquet with Decimal(8,2) written as a Decimal64 blows up
[ https://issues.apache.org/jira/browse/SPARK-34167?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-34167: Assignee: (was: Apache Spark) > Reading parquet with Decimal(8,2) written as a Decimal64 blows up > - > > Key: SPARK-34167 > URL: https://issues.apache.org/jira/browse/SPARK-34167 > Project: Spark > Issue Type: Bug > Components: Input/Output >Affects Versions: 3.0.1 >Reporter: Raza Jafri >Priority: Major > Attachments: > part-0-7fecd321-b247-4f7e-bff5-c2e4d8facaa0-c000.snappy.parquet, > part-0-940f44f1-f323-4a5e-b828-1e65d87895aa-c000.snappy.parquet > > > When reading a parquet file written with Decimals with precision < 10 as a > 64-bit representation, Spark tries to read it as an INT and fails > > Steps to reproduce: > Read the attached file that has a single Decimal(8,2) column with 10 values > {code:java} > scala> spark.read.parquet("/tmp/pyspark_tests/936454/PARQUET_DATA").show > ... > Caused by: java.lang.NullPointerException > at > org.apache.spark.sql.execution.vectorized.OnHeapColumnVector.putLong(OnHeapColumnVector.java:327) > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedRleValuesReader.readLongs(VectorizedRleValuesReader.java:370) > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readLongBatch(VectorizedColumnReader.java:514) > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readBatch(VectorizedColumnReader.java:256) > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextBatch(VectorizedParquetRecordReader.java:273) > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextKeyValue(VectorizedParquetRecordReader.java:171) > at > org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:93) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:173) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:93) > at > org.apache.spark.sql.execution.FileSourceScanExec$$anon$1.hasNext(DataSourceScanExec.scala:497) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.columnartorow_nextBatch_0$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:756) > at > org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:340) > at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:898) > at > org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:898) > at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:337) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) > at org.apache.spark.scheduler.Task.run(Task.scala:127) > at > org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:480) > at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1426) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:483) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > ... > {code} > > > Here are my findings. The *{{VectorizedParquetRecordReader}}* reads in the > parquet file correctly because its basing the read on the > [requestedSchema|https://github.com/apache/spark/blob/e6f019836c099398542b443f7700f79de81da0d5/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/SpecificParquetRecordReaderBase.java#L150] > which is a MessageType and has the underlying data stored correctly as > {{INT64}} where as the *{{OnHeapColumnVector}}* is initialized based on the > [batchSchema|https://github.com/apache/spark/blob/e6f019836c099398542b443f7700f79de81da0d5/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/SpecificParquetRecordReaderBase.java#L151] > which is coming from {{org.apache.spark.sql.parquet.row.requested_schema}} > that is set by the reader which is a {{StructType}} and only
[jira] [Commented] (SPARK-34167) Reading parquet with Decimal(8,2) written as a Decimal64 blows up
[ https://issues.apache.org/jira/browse/SPARK-34167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17269725#comment-17269725 ] Apache Spark commented on SPARK-34167: -- User 'razajafri' has created a pull request for this issue: https://github.com/apache/spark/pull/31284 > Reading parquet with Decimal(8,2) written as a Decimal64 blows up > - > > Key: SPARK-34167 > URL: https://issues.apache.org/jira/browse/SPARK-34167 > Project: Spark > Issue Type: Bug > Components: Input/Output >Affects Versions: 3.0.1 >Reporter: Raza Jafri >Priority: Major > Attachments: > part-0-7fecd321-b247-4f7e-bff5-c2e4d8facaa0-c000.snappy.parquet, > part-0-940f44f1-f323-4a5e-b828-1e65d87895aa-c000.snappy.parquet > > > When reading a parquet file written with Decimals with precision < 10 as a > 64-bit representation, Spark tries to read it as an INT and fails > > Steps to reproduce: > Read the attached file that has a single Decimal(8,2) column with 10 values > {code:java} > scala> spark.read.parquet("/tmp/pyspark_tests/936454/PARQUET_DATA").show > ... > Caused by: java.lang.NullPointerException > at > org.apache.spark.sql.execution.vectorized.OnHeapColumnVector.putLong(OnHeapColumnVector.java:327) > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedRleValuesReader.readLongs(VectorizedRleValuesReader.java:370) > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readLongBatch(VectorizedColumnReader.java:514) > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readBatch(VectorizedColumnReader.java:256) > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextBatch(VectorizedParquetRecordReader.java:273) > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextKeyValue(VectorizedParquetRecordReader.java:171) > at > org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:93) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:173) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:93) > at > org.apache.spark.sql.execution.FileSourceScanExec$$anon$1.hasNext(DataSourceScanExec.scala:497) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.columnartorow_nextBatch_0$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:756) > at > org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:340) > at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:898) > at > org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:898) > at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:337) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) > at org.apache.spark.scheduler.Task.run(Task.scala:127) > at > org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:480) > at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1426) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:483) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > ... > {code} > > > Here are my findings. The *{{VectorizedParquetRecordReader}}* reads in the > parquet file correctly because its basing the read on the > [requestedSchema|https://github.com/apache/spark/blob/e6f019836c099398542b443f7700f79de81da0d5/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/SpecificParquetRecordReaderBase.java#L150] > which is a MessageType and has the underlying data stored correctly as > {{INT64}} where as the *{{OnHeapColumnVector}}* is initialized based on the > [batchSchema|https://github.com/apache/spark/blob/e6f019836c099398542b443f7700f79de81da0d5/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/SpecificParquetRecordReaderBase.java#L151] > which is coming from
[jira] [Assigned] (SPARK-34167) Reading parquet with Decimal(8,2) written as a Decimal64 blows up
[ https://issues.apache.org/jira/browse/SPARK-34167?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-34167: Assignee: Apache Spark > Reading parquet with Decimal(8,2) written as a Decimal64 blows up > - > > Key: SPARK-34167 > URL: https://issues.apache.org/jira/browse/SPARK-34167 > Project: Spark > Issue Type: Bug > Components: Input/Output >Affects Versions: 3.0.1 >Reporter: Raza Jafri >Assignee: Apache Spark >Priority: Major > Attachments: > part-0-7fecd321-b247-4f7e-bff5-c2e4d8facaa0-c000.snappy.parquet, > part-0-940f44f1-f323-4a5e-b828-1e65d87895aa-c000.snappy.parquet > > > When reading a parquet file written with Decimals with precision < 10 as a > 64-bit representation, Spark tries to read it as an INT and fails > > Steps to reproduce: > Read the attached file that has a single Decimal(8,2) column with 10 values > {code:java} > scala> spark.read.parquet("/tmp/pyspark_tests/936454/PARQUET_DATA").show > ... > Caused by: java.lang.NullPointerException > at > org.apache.spark.sql.execution.vectorized.OnHeapColumnVector.putLong(OnHeapColumnVector.java:327) > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedRleValuesReader.readLongs(VectorizedRleValuesReader.java:370) > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readLongBatch(VectorizedColumnReader.java:514) > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readBatch(VectorizedColumnReader.java:256) > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextBatch(VectorizedParquetRecordReader.java:273) > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextKeyValue(VectorizedParquetRecordReader.java:171) > at > org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:93) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:173) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:93) > at > org.apache.spark.sql.execution.FileSourceScanExec$$anon$1.hasNext(DataSourceScanExec.scala:497) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.columnartorow_nextBatch_0$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:756) > at > org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:340) > at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:898) > at > org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:898) > at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:337) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) > at org.apache.spark.scheduler.Task.run(Task.scala:127) > at > org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:480) > at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1426) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:483) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > ... > {code} > > > Here are my findings. The *{{VectorizedParquetRecordReader}}* reads in the > parquet file correctly because its basing the read on the > [requestedSchema|https://github.com/apache/spark/blob/e6f019836c099398542b443f7700f79de81da0d5/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/SpecificParquetRecordReaderBase.java#L150] > which is a MessageType and has the underlying data stored correctly as > {{INT64}} where as the *{{OnHeapColumnVector}}* is initialized based on the > [batchSchema|https://github.com/apache/spark/blob/e6f019836c099398542b443f7700f79de81da0d5/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/SpecificParquetRecordReaderBase.java#L151] > which is coming from {{org.apache.spark.sql.parquet.row.requested_schema}} > that is set by the reader which is a
[jira] [Comment Edited] (SPARK-7768) Make user-defined type (UDT) API public
[ https://issues.apache.org/jira/browse/SPARK-7768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17269707#comment-17269707 ] Patrick Grandjean edited comment on SPARK-7768 at 1/21/21, 11:44 PM: - Sorry, but it is frustrating to see this ticket getting postponed again. I started to use frameless, only to discover it is not compatible with the Spark fork used in Databricks :/ Anyone tried quill-spark on Databricks? Why keep UDTRegistration private? was (Author: pgrandjean): Sorry, but it is frustrating to see this ticket getting postponed again. I started to use frameless, only to discover it is not compatible with the Spark fork used in Databricks :/ Anyone tried quill-spark? Why keep UDTRegistration private? > Make user-defined type (UDT) API public > --- > > Key: SPARK-7768 > URL: https://issues.apache.org/jira/browse/SPARK-7768 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Xiangrui Meng >Priority: Critical > > As the demand for UDTs increases beyond sparse/dense vectors in MLlib, it > would be nice to make the UDT API public in 1.5. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7768) Make user-defined type (UDT) API public
[ https://issues.apache.org/jira/browse/SPARK-7768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17269707#comment-17269707 ] Patrick Grandjean commented on SPARK-7768: -- Sorry, but it is frustrating to see this ticket getting postponed again. I started to use frameless, only to discover it is not compatible with the Spark fork used in Databricks :/ Anyone tried quill-spark? Why keep UDTRegistration private? > Make user-defined type (UDT) API public > --- > > Key: SPARK-7768 > URL: https://issues.apache.org/jira/browse/SPARK-7768 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Xiangrui Meng >Priority: Critical > > As the demand for UDTs increases beyond sparse/dense vectors in MLlib, it > would be nice to make the UDT API public in 1.5. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-34195) Base value for parsing two-digit year should be made configurable
[ https://issues.apache.org/jira/browse/SPARK-34195?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anthony Smart updated SPARK-34195: -- Description: The base value is set as 2000 within spark for parsing a two-digit year date string. If we try to parse "10-JAN-97" then this will be interpreted as 2097 instead of 1997. I'm unclear as to why this base value has been changed within spark as the standard python datetime module instead uses a more sensible value of 69 as the boundary cut-off for determining the correct century to apply. Reference: [https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html] Other libraries e.g. .NET Core will use 29 as the boundary cut-off. But a base value of 2000 is rather non-functional indeed. Most dates encountered in the real world will pertain to both centuries and therefore I propose this functionality is reverted to match the existing python datetime module and / or allow the base value to be set as an option to the various date functions. This would ensure there's consistent behaviour across both python and pyspark. Python: {{import datetime}} {{datetime.datetime.strptime('10-JAN-69', '%d-%b-%y').date()}} Out[118]: datetime.date(1969, 1, 10) {{Pyspark:}}{{import spark.sql.functions as F}} {{df = spark.createDataFrame([('10-JAN-69',)], ['dt'])}} {{df.withColumn("newdate", F.from_unixtime(F.unix_timestamp("dt", "dd-MMM-yy"), "dd-MM-")).collect()}} Out[117]: [Row(dt='10-JAN-69', newdate='10-01-2069')] As a work around I had to write my own solution to deal with this. The code below is specific to my data pipeline but you get the idea of the issue I had to deal with just to change the boundary cut-off to better handle two-digit years. {{from pyspark.sql.functions import to_date, col, trim}}{{def convert_dtypes(entity, schema, boundary="40"):}} \{{ cols = []}} {{ for x in schema[entity]:}} \{{ for c in std_df.columns:}} {{ if x['name'] == c:}} {{ if x['dtype'] == 'date':}} \{{ dd = F.substring(c, 1, 2)}} \{{ MMM = F.substring(c, 4, 3)}} \{{ yy = F.substring(c, 8, 2)}} \{{ n = (}} \{{ F.when(trim(col(c)) == "", None).otherwise(}} \{{ F.when(yy >= ("40"), }} {{ F.concat(dd, F.lit("-"), MMM, F.lit("-"), F.lit("19"), yy)).otherwise(}} {{ F.concat(dd, F.lit("-"), MMM, F.lit("-"), F.lit("20"), yy)))}} \{{ )}} \{{ cols.append(to_date(n, 'dd-MMM-').alias(c))}} \{{ else:}} {{ cols.append(col(c).cast(x['dtype']))}} {{ #cols[-1].nullable = x['nullable']}} \{{ return std_df.select(*cols)}} was: The base value is set as 2000 within spark for parsing a two-digit year date string. If we try to parse "10-JAN-97" then this will be interpreted as 2097 instead of 1997. I'm unclear as to why this base value has been changed within spark as the standard python datetime module instead uses a more sensible value of 69 as the boundary cut-off for determining the correct century to apply. Reference: [https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html] Other libraries e.g. .NET Core will use 29 as the boundary cut-off. But a base value of 2000 is rather non-functional indeed. Most dates encountered in the real world will pertain to both centuries and therefore I propose this functionality is reverted to match the existing python datetime module and / or allow the base value to be set as an option to the various date functions. This would ensure there's consistent behaviour across both python and pyspark. Python: {{import datetime}} {{datetime.datetime.strptime('10-JAN-69', '%d-%b-%y').date()}} Out[118]: datetime.date(1969, 1, 10) {{Pyspark:}}{{import spark.sql.functions as F}} {{df = spark.createDataFrame([('10-JAN-69',)], ['dt'])}} {{df.withColumn("newdate", F.from_unixtime(F.unix_timestamp("dt", "dd-MMM-yy"), "dd-MM-")).collect()}} Out[117]: [Row(dt='10-JAN-70', newdate='10-01-2069')] As a work around I had to write my own solution to deal with this. The code below is specific to my data pipeline but you get the idea of the issue I had to deal with just to change the boundary cut-off to better handle two-digit years. {{from pyspark.sql.functions import to_date, col, trim}}{{def convert_dtypes(entity, schema, boundary="40"):}} \{{ cols = []}} {{ for x in schema[entity]:}} \{{ for c in std_df.columns:}} {{ if x['name'] == c:}} {{ if x['dtype'] == 'date':}} \{{ dd = F.substring(c, 1, 2)}} \{{ MMM = F.substring(c, 4, 3)}} \{{ yy = F.substring(c, 8, 2)}} \{{ n = (}} \{{ F.when(trim(col(c)) == "", None).otherwise(}} \{{ F.when(yy >= ("40"), }} {{ F.concat(dd, F.lit("-"), MMM, F.lit("-"), F.lit("19"), yy)).otherwise(}} {{ F.concat(dd, F.lit("-"), MMM, F.lit("-"), F.lit("20"), yy)))}} \{{ )}} \{{ cols.append(to_date(n, 'dd-MMM-').alias(c))}} \{{ else:}} {{ cols.append(col(c).cast(x['dtype']))}} {{ #cols[-1].nullable = x['nullable']}} \{{ return
[jira] [Commented] (SPARK-34197) refreshTable() should not invalidate the relation cache for temporary views
[ https://issues.apache.org/jira/browse/SPARK-34197?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17269617#comment-17269617 ] Apache Spark commented on SPARK-34197: -- User 'MaxGekk' has created a pull request for this issue: https://github.com/apache/spark/pull/31265 > refreshTable() should not invalidate the relation cache for temporary views > --- > > Key: SPARK-34197 > URL: https://issues.apache.org/jira/browse/SPARK-34197 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.0 >Reporter: Maxim Gekk >Priority: Major > > The SessionCatalog.refreshTable() should not invalidate the entry in the > relation cache for a table when refreshTable() refreshes a temp view. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-34197) refreshTable() should not invalidate the relation cache for temporary views
[ https://issues.apache.org/jira/browse/SPARK-34197?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-34197: Assignee: (was: Apache Spark) > refreshTable() should not invalidate the relation cache for temporary views > --- > > Key: SPARK-34197 > URL: https://issues.apache.org/jira/browse/SPARK-34197 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.0 >Reporter: Maxim Gekk >Priority: Major > > The SessionCatalog.refreshTable() should not invalidate the entry in the > relation cache for a table when refreshTable() refreshes a temp view. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34197) refreshTable() should not invalidate the relation cache for temporary views
[ https://issues.apache.org/jira/browse/SPARK-34197?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17269616#comment-17269616 ] Apache Spark commented on SPARK-34197: -- User 'MaxGekk' has created a pull request for this issue: https://github.com/apache/spark/pull/31265 > refreshTable() should not invalidate the relation cache for temporary views > --- > > Key: SPARK-34197 > URL: https://issues.apache.org/jira/browse/SPARK-34197 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.0 >Reporter: Maxim Gekk >Priority: Major > > The SessionCatalog.refreshTable() should not invalidate the entry in the > relation cache for a table when refreshTable() refreshes a temp view. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-34197) refreshTable() should not invalidate the relation cache for temporary views
[ https://issues.apache.org/jira/browse/SPARK-34197?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-34197: Assignee: Apache Spark > refreshTable() should not invalidate the relation cache for temporary views > --- > > Key: SPARK-34197 > URL: https://issues.apache.org/jira/browse/SPARK-34197 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.0 >Reporter: Maxim Gekk >Assignee: Apache Spark >Priority: Major > > The SessionCatalog.refreshTable() should not invalidate the entry in the > relation cache for a table when refreshTable() refreshes a temp view. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-34197) refreshTable() should not invalidate the relation cache for temporary views
Maxim Gekk created SPARK-34197: -- Summary: refreshTable() should not invalidate the relation cache for temporary views Key: SPARK-34197 URL: https://issues.apache.org/jira/browse/SPARK-34197 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.2.0 Reporter: Maxim Gekk The SessionCatalog.refreshTable() should not invalidate the entry in the relation cache for a table when refreshTable() refreshes a temp view. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-34196) Improve error message when folks try and install in Python 2
Holden Karau created SPARK-34196: Summary: Improve error message when folks try and install in Python 2 Key: SPARK-34196 URL: https://issues.apache.org/jira/browse/SPARK-34196 Project: Spark Issue Type: Improvement Components: PySpark Affects Versions: 3.1.0, 3.2.0, 3.1.1, 3.1.2 Reporter: Holden Karau Current error message: {code:java} Processing ./pyspark-3.1.1.tar.gz ERROR: Command errored out with exit status 1: command: /tmp/py3.1/bin/python2 -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-req-build-lmlitE/setup.py'"'"'; __file__='"'"'/tmp/pip-req-build-lmlitE/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' egg_info --egg-base /tmp/pip-pip-egg-info-W1BsIL cwd: /tmp/pip-req-build-lmlitE/ Complete output (6 lines): Traceback (most recent call last): File "", line 1, in File "/tmp/pip-req-build-lmlitE/setup.py", line 31 file=sys.stderr) ^ SyntaxError: invalid syntax ERROR: Command errored out with exit status 1: python setup.py egg_info Check the logs for full command output.{code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-34193) Potential race condition during decommissioning with TorrentBroadcast
[ https://issues.apache.org/jira/browse/SPARK-34193?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Holden Karau updated SPARK-34193: - Issue Type: Bug (was: Improvement) > Potential race condition during decommissioning with TorrentBroadcast > - > > Key: SPARK-34193 > URL: https://issues.apache.org/jira/browse/SPARK-34193 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.1.0, 3.2.0, 3.1.1, 3.1.2 >Reporter: Holden Karau >Priority: Major > > I found this while back porting so the line numbers should be ignored, but > the core of the issue is that we shouldn't be failing the job on this (I > don't think). We could fix this by allowing broadcast blocks to be put or > having the torrent broadcast ignore this exception. > [info] org.apache.spark.SparkException: Job aborted due to stage failure: > Task 1 in stage 3.0 failed 4 times, most recent failure: Lost task 1.3 in > stage 3.0 (TID 8, 192.168.1.57, executor 1): java.io.IOException: > org.apache.spark.storage.BlockSavedOnDecommissionedBlockManagerException: > Block broadcast_2_piece0 cannot be saved on decommissioned executor[info] > org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in > stage 3.0 failed 4 times, most recent failure: Lost task 1.3 in stage 3.0 > (TID 8, 192.168.1.57, executor 1): java.io.IOException: > org.apache.spark.storage.BlockSavedOnDecommissionedBlockManagerException: > Block broadcast_2_piece0 cannot be saved on decommissioned executor[info] at > org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1333)[info] at > org.apache.spark.broadcast.TorrentBroadcast.readBroadcastBlock(TorrentBroadcast.scala:215)[info] > at > org.apache.spark.broadcast.TorrentBroadcast._value$lzycompute(TorrentBroadcast.scala:66)[info] > at > org.apache.spark.broadcast.TorrentBroadcast._value(TorrentBroadcast.scala:66)[info] > at > org.apache.spark.broadcast.TorrentBroadcast.getValue(TorrentBroadcast.scala:96)[info] > at org.apache.spark.broadcast.Broadcast.value(Broadcast.scala:70)[info] at > org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:84)[info] at > org.apache.spark.scheduler.Task.run(Task.scala:123)[info] at > org.apache.spark.executor.Executor$TaskRunner$$anonfun$12.apply(Executor.scala:448)[info] > at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)[info] > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:454)[info] > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)[info] > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)[info] > at java.lang.Thread.run(Thread.java:748)[info] Caused by: > org.apache.spark.storage.BlockSavedOnDecommissionedBlockManagerException: > Block broadcast_2_piece0 cannot be saved on decommissioned executor[info] at > org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:1105)[info] at > org.apache.spark.storage.BlockManager.doPutBytes(BlockManager.scala:1010)[info] > at > org.apache.spark.storage.BlockManager.putBytes(BlockManager.scala:986)[info] > at > org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1.apply$mcVI$sp(TorrentBroadcast.scala:181)[info] > at > org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1.apply(TorrentBroadcast.scala:159)[info] > at > org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1.apply(TorrentBroadcast.scala:159)[info] > at scala.collection.immutable.List.foreach(List.scala:392)[info] at > org.apache.spark.broadcast.TorrentBroadcast.org$apache$spark$broadcast$TorrentBroadcast$$readBlocks(TorrentBroadcast.scala:159)[info] > at > org.apache.spark.broadcast.TorrentBroadcast$$anonfun$readBroadcastBlock$1$$anonfun$apply$2.apply(TorrentBroadcast.scala:239)[info] > at scala.Option.getOrElse(Option.scala:121)[info] at > org.apache.spark.broadcast.TorrentBroadcast$$anonfun$readBroadcastBlock$1.apply(TorrentBroadcast.scala:219)[info] > at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1326)[info] ... > 13 more[info][info] Driver stacktrace:[info] at > org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1928)[info] > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1916)[info] > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1915)[info] > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)[info] > at >
[jira] [Updated] (SPARK-34195) Base value for parsing two-digit year should be made configurable
[ https://issues.apache.org/jira/browse/SPARK-34195?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anthony Smart updated SPARK-34195: -- Description: The base value is set as 2000 within spark for parsing a two-digit year date string. If we try to parse "10-JAN-97" then this will be interpreted as 2097 instead of 1997. I'm unclear as to why this base value has been changed within spark as the standard python datetime module instead uses a more sensible value of 69 as the boundary cut-off for determining the correct century to apply. Reference: [https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html] Other libraries e.g. .NET Core will use 29 as the boundary cut-off. But a base value of 2000 is rather non-functional indeed. Most dates encountered in the real world will pertain to both centuries and therefore I propose this functionality is reverted to match the existing python datetime module and / or allow the base value to be set as an option to the various date functions. This would ensure there's consistent behaviour across both python and pyspark. Python: {{import datetime}} {{datetime.datetime.strptime('10-JAN-69', '%d-%b-%y').date()}} Out[118]: datetime.date(1969, 1, 10) {{Pyspark:}}{{import spark.sql.functions as F}} {{df = spark.createDataFrame([('10-JAN-69',)], ['dt'])}} {{df.withColumn("newdate", F.from_unixtime(F.unix_timestamp("dt", "dd-MMM-yy"), "dd-MM-")).collect()}} Out[117]: [Row(dt='10-JAN-70', newdate='10-01-2069')] As a work around I had to write my own solution to deal with this. The code below is specific to my data pipeline but you get the idea of the issue I had to deal with just to change the boundary cut-off to better handle two-digit years. {{from pyspark.sql.functions import to_date, col, trim}}{{def convert_dtypes(entity, schema, boundary="40"):}} \{{ cols = []}} {{ for x in schema[entity]:}} \{{ for c in std_df.columns:}} {{ if x['name'] == c:}} {{ if x['dtype'] == 'date':}} \{{ dd = F.substring(c, 1, 2)}} \{{ MMM = F.substring(c, 4, 3)}} \{{ yy = F.substring(c, 8, 2)}} \{{ n = (}} \{{ F.when(trim(col(c)) == "", None).otherwise(}} \{{ F.when(yy >= ("40"), }} {{ F.concat(dd, F.lit("-"), MMM, F.lit("-"), F.lit("19"), yy)).otherwise(}} {{ F.concat(dd, F.lit("-"), MMM, F.lit("-"), F.lit("20"), yy)))}} \{{ )}} \{{ cols.append(to_date(n, 'dd-MMM-').alias(c))}} \{{ else:}} {{ cols.append(col(c).cast(x['dtype']))}} {{ #cols[-1].nullable = x['nullable']}} \{{ return std_df.select(*cols)}} was: The base value is set as 2000 within spark for parsing a two-digit year date string. If we try to parse "10-JAN-97" then this will be interpreted as 2097 instead of 1997. I'm unclear as to why this base value has been changed within spark as the standard python datetime module instead uses a more sensible value of 69 as the boundary cut-off for determining the correct century to apply. Reference: https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html Other libraries e.g. .NET Core will use 29 as the boundary cut-off. But a base value of 2000 is rather non-functional indeed. Most dates encountered in the real world will pertain to both centuries and therefore I propose this functionality is reverted to match the existing python datetime module and / or allow the base value to be set as an option to the various date functions. This would ensure there's consistent behaviour across both python and pyspark. Python: {{import datetime}} {{datetime.datetime.strptime('10-JAN-69', '%d-%b-%y').date()}} Out[118]: datetime.date(1969, 1, 10) {{Pyspark:}}{{import spark.sql.functions as F}} {{df = spark.createDataFrame([('10-JAN-69',)], ['dt'])}} {{df.withColumn("newdate", F.from_unixtime(F.unix_timestamp("dt", "dd-MMM-yy"), "dd-MM-yy")).collect()}} Out[117]: [Row(dt='10-JAN-70', newdate='10-01-2069')] As a work around I had to write my own solution to deal with this. The code below is specific to my data pipeline but you get the idea of the issue I had to deal with just to change the boundary cut-off to better handle two-digit years. {{from pyspark.sql.functions import to_date, col, trim}}{{def convert_dtypes(entity, schema, boundary="40"):}} {{ cols = []}} {{ for x in schema[entity]:}} {{ for c in std_df.columns:}} {{ if x['name'] == c:}} {{ if x['dtype'] == 'date':}} {{ dd = F.substring(c, 1, 2)}} {{ MMM = F.substring(c, 4, 3)}} {{ yy = F.substring(c, 8, 2)}} {{ n = (}} {{ F.when(trim(col(c)) == "", None).otherwise(}} {{ F.when(yy >= ("40"), }} {{ F.concat(dd, F.lit("-"), MMM, F.lit("-"), F.lit("19"), yy)).otherwise(}} {{ F.concat(dd, F.lit("-"), MMM, F.lit("-"), F.lit("20"), yy)))}} {{ )}} {{ cols.append(to_date(n, 'dd-MMM-').alias(c))}} {{ else:}} {{ cols.append(col(c).cast(x['dtype']))}} {{ #cols[-1].nullable = x['nullable']}} {{ return std_df.select(*cols)}} > Base value for parsing two-digit
[jira] [Created] (SPARK-34195) Base value for parsing two-digit year should be made configurable
Anthony Smart created SPARK-34195: - Summary: Base value for parsing two-digit year should be made configurable Key: SPARK-34195 URL: https://issues.apache.org/jira/browse/SPARK-34195 Project: Spark Issue Type: Improvement Components: PySpark Affects Versions: 3.0.1 Reporter: Anthony Smart The base value is set as 2000 within spark for parsing a two-digit year date string. If we try to parse "10-JAN-97" then this will be interpreted as 2097 instead of 1997. I'm unclear as to why this base value has been changed within spark as the standard python datetime module instead uses a more sensible value of 69 as the boundary cut-off for determining the correct century to apply. Reference: https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html Other libraries e.g. .NET Core will use 29 as the boundary cut-off. But a base value of 2000 is rather non-functional indeed. Most dates encountered in the real world will pertain to both centuries and therefore I propose this functionality is reverted to match the existing python datetime module and / or allow the base value to be set as an option to the various date functions. This would ensure there's consistent behaviour across both python and pyspark. Python: {{import datetime}} {{datetime.datetime.strptime('10-JAN-69', '%d-%b-%y').date()}} Out[118]: datetime.date(1969, 1, 10) {{Pyspark:}}{{import spark.sql.functions as F}} {{df = spark.createDataFrame([('10-JAN-69',)], ['dt'])}} {{df.withColumn("newdate", F.from_unixtime(F.unix_timestamp("dt", "dd-MMM-yy"), "dd-MM-yy")).collect()}} Out[117]: [Row(dt='10-JAN-70', newdate='10-01-2069')] As a work around I had to write my own solution to deal with this. The code below is specific to my data pipeline but you get the idea of the issue I had to deal with just to change the boundary cut-off to better handle two-digit years. {{from pyspark.sql.functions import to_date, col, trim}}{{def convert_dtypes(entity, schema, boundary="40"):}} {{ cols = []}} {{ for x in schema[entity]:}} {{ for c in std_df.columns:}} {{ if x['name'] == c:}} {{ if x['dtype'] == 'date':}} {{ dd = F.substring(c, 1, 2)}} {{ MMM = F.substring(c, 4, 3)}} {{ yy = F.substring(c, 8, 2)}} {{ n = (}} {{ F.when(trim(col(c)) == "", None).otherwise(}} {{ F.when(yy >= ("40"), }} {{ F.concat(dd, F.lit("-"), MMM, F.lit("-"), F.lit("19"), yy)).otherwise(}} {{ F.concat(dd, F.lit("-"), MMM, F.lit("-"), F.lit("20"), yy)))}} {{ )}} {{ cols.append(to_date(n, 'dd-MMM-').alias(c))}} {{ else:}} {{ cols.append(col(c).cast(x['dtype']))}} {{ #cols[-1].nullable = x['nullable']}} {{ return std_df.select(*cols)}} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12890) Spark SQL query related to only partition fields should not scan the whole data.
[ https://issues.apache.org/jira/browse/SPARK-12890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17269571#comment-17269571 ] Nicholas Chammas commented on SPARK-12890: -- I've created SPARK-34194 and fleshed out the description of the problem a bit. > Spark SQL query related to only partition fields should not scan the whole > data. > > > Key: SPARK-12890 > URL: https://issues.apache.org/jira/browse/SPARK-12890 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Prakash Chockalingam >Priority: Minor > > I have a SQL query which has only partition fields. The query ends up > scanning all the data which is unnecessary. > Example: select max(date) from table, where the table is partitioned by date. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-34194) Queries that only touch partition columns shouldn't scan through all files
Nicholas Chammas created SPARK-34194: Summary: Queries that only touch partition columns shouldn't scan through all files Key: SPARK-34194 URL: https://issues.apache.org/jira/browse/SPARK-34194 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.2.0 Reporter: Nicholas Chammas When querying only the partition columns of a partitioned table, it seems that Spark nonetheless scans through all files in the table, even though it doesn't need to. Here's an example: {code:python} >>> data = spark.read.option('mergeSchema', >>> 'false').parquet('s3a://some/dataset') [Stage 0:==> (407 + 12) / 1158] {code} Note the 1158 tasks. This matches the number of partitions in the table, which is partitioned on a single field named {{file_date}}: {code:sh} $ aws s3 ls s3://some/dataset | head -n 3 PRE file_date=2017-05-01/ PRE file_date=2017-05-02/ PRE file_date=2017-05-03/ $ aws s3 ls s3://some/dataset | wc -l 1158 {code} The table itself has over 138K files, though: {code:sh} $ aws s3 ls --recursive --human --summarize s3://some/dataset ... Total Objects: 138708 Total Size: 3.7 TiB {code} Now let's try to query just the {{file_date}} field and see what Spark does. {code:python} >>> data.select('file_date').orderBy('file_date', >>> ascending=False).limit(1).explain() == Physical Plan == TakeOrderedAndProject(limit=1, orderBy=[file_date#11 DESC NULLS LAST], output=[file_date#11]) +- *(1) ColumnarToRow +- FileScan parquet [file_date#11] Batched: true, DataFilters: [], Format: Parquet, Location: InMemoryFileIndex[s3a://some/dataset], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<> >>> data.select('file_date').orderBy('file_date', >>> ascending=False).limit(1).show() [Stage 2:> (179 + 12) / 41011] {code} Notice that Spark has spun up 41,011 tasks. Maybe more will be needed as the job progresses? I'm not sure. What I do know is that this operation takes a long time (~20 min) running from my laptop, whereas to list the top-level {{file_date}} partitions via the AWS CLI take a second or two. Spark appears to be going through all the files in the table, when it just needs to list the partitions captured in the S3 "directory" structure. The query is only touching {{file_date}}, after all. The current workaround for this performance problem / optimizer wastefulness, is to [query the catalog directly|https://stackoverflow.com/a/65724151/877069]. It works, but is a lot of extra work compared to the elegant query against {{file_date}} that users actually intend. Spark should somehow know when it is only querying partition fields and skip iterating through all the individual files in a table. Tested on Spark 3.0.1. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34193) Potential race condition during decommissioning with TorrentBroadcast
[ https://issues.apache.org/jira/browse/SPARK-34193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17269565#comment-17269565 ] Holden Karau commented on SPARK-34193: -- Note, so far I've only triggered this once and in the back porting, the "Affects Version" is currently a guess. > Potential race condition during decommissioning with TorrentBroadcast > - > > Key: SPARK-34193 > URL: https://issues.apache.org/jira/browse/SPARK-34193 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.1.0, 3.2.0, 3.1.1, 3.1.2 >Reporter: Holden Karau >Priority: Major > > I found this while back porting so the line numbers should be ignored, but > the core of the issue is that we shouldn't be failing the job on this (I > don't think). We could fix this by allowing broadcast blocks to be put or > having the torrent broadcast ignore this exception. > [info] org.apache.spark.SparkException: Job aborted due to stage failure: > Task 1 in stage 3.0 failed 4 times, most recent failure: Lost task 1.3 in > stage 3.0 (TID 8, 192.168.1.57, executor 1): java.io.IOException: > org.apache.spark.storage.BlockSavedOnDecommissionedBlockManagerException: > Block broadcast_2_piece0 cannot be saved on decommissioned executor[info] > org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in > stage 3.0 failed 4 times, most recent failure: Lost task 1.3 in stage 3.0 > (TID 8, 192.168.1.57, executor 1): java.io.IOException: > org.apache.spark.storage.BlockSavedOnDecommissionedBlockManagerException: > Block broadcast_2_piece0 cannot be saved on decommissioned executor[info] at > org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1333)[info] at > org.apache.spark.broadcast.TorrentBroadcast.readBroadcastBlock(TorrentBroadcast.scala:215)[info] > at > org.apache.spark.broadcast.TorrentBroadcast._value$lzycompute(TorrentBroadcast.scala:66)[info] > at > org.apache.spark.broadcast.TorrentBroadcast._value(TorrentBroadcast.scala:66)[info] > at > org.apache.spark.broadcast.TorrentBroadcast.getValue(TorrentBroadcast.scala:96)[info] > at org.apache.spark.broadcast.Broadcast.value(Broadcast.scala:70)[info] at > org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:84)[info] at > org.apache.spark.scheduler.Task.run(Task.scala:123)[info] at > org.apache.spark.executor.Executor$TaskRunner$$anonfun$12.apply(Executor.scala:448)[info] > at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)[info] > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:454)[info] > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)[info] > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)[info] > at java.lang.Thread.run(Thread.java:748)[info] Caused by: > org.apache.spark.storage.BlockSavedOnDecommissionedBlockManagerException: > Block broadcast_2_piece0 cannot be saved on decommissioned executor[info] at > org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:1105)[info] at > org.apache.spark.storage.BlockManager.doPutBytes(BlockManager.scala:1010)[info] > at > org.apache.spark.storage.BlockManager.putBytes(BlockManager.scala:986)[info] > at > org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1.apply$mcVI$sp(TorrentBroadcast.scala:181)[info] > at > org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1.apply(TorrentBroadcast.scala:159)[info] > at > org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1.apply(TorrentBroadcast.scala:159)[info] > at scala.collection.immutable.List.foreach(List.scala:392)[info] at > org.apache.spark.broadcast.TorrentBroadcast.org$apache$spark$broadcast$TorrentBroadcast$$readBlocks(TorrentBroadcast.scala:159)[info] > at > org.apache.spark.broadcast.TorrentBroadcast$$anonfun$readBroadcastBlock$1$$anonfun$apply$2.apply(TorrentBroadcast.scala:239)[info] > at scala.Option.getOrElse(Option.scala:121)[info] at > org.apache.spark.broadcast.TorrentBroadcast$$anonfun$readBroadcastBlock$1.apply(TorrentBroadcast.scala:219)[info] > at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1326)[info] ... > 13 more[info][info] Driver stacktrace:[info] at > org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1928)[info] > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1916)[info] > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1915)[info] > at >
[jira] [Created] (SPARK-34193) Potential race condition during decommissioning with TorrentBroadcast
Holden Karau created SPARK-34193: Summary: Potential race condition during decommissioning with TorrentBroadcast Key: SPARK-34193 URL: https://issues.apache.org/jira/browse/SPARK-34193 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 3.1.0, 3.2.0, 3.1.1, 3.1.2 Reporter: Holden Karau I found this while back porting so the line numbers should be ignored, but the core of the issue is that we shouldn't be failing the job on this (I don't think). We could fix this by allowing broadcast blocks to be put or having the torrent broadcast ignore this exception. [info] org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 3.0 failed 4 times, most recent failure: Lost task 1.3 in stage 3.0 (TID 8, 192.168.1.57, executor 1): java.io.IOException: org.apache.spark.storage.BlockSavedOnDecommissionedBlockManagerException: Block broadcast_2_piece0 cannot be saved on decommissioned executor[info] org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 3.0 failed 4 times, most recent failure: Lost task 1.3 in stage 3.0 (TID 8, 192.168.1.57, executor 1): java.io.IOException: org.apache.spark.storage.BlockSavedOnDecommissionedBlockManagerException: Block broadcast_2_piece0 cannot be saved on decommissioned executor[info] at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1333)[info] at org.apache.spark.broadcast.TorrentBroadcast.readBroadcastBlock(TorrentBroadcast.scala:215)[info] at org.apache.spark.broadcast.TorrentBroadcast._value$lzycompute(TorrentBroadcast.scala:66)[info] at org.apache.spark.broadcast.TorrentBroadcast._value(TorrentBroadcast.scala:66)[info] at org.apache.spark.broadcast.TorrentBroadcast.getValue(TorrentBroadcast.scala:96)[info] at org.apache.spark.broadcast.Broadcast.value(Broadcast.scala:70)[info] at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:84)[info] at org.apache.spark.scheduler.Task.run(Task.scala:123)[info] at org.apache.spark.executor.Executor$TaskRunner$$anonfun$12.apply(Executor.scala:448)[info] at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)[info] at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:454)[info] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)[info] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)[info] at java.lang.Thread.run(Thread.java:748)[info] Caused by: org.apache.spark.storage.BlockSavedOnDecommissionedBlockManagerException: Block broadcast_2_piece0 cannot be saved on decommissioned executor[info] at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:1105)[info] at org.apache.spark.storage.BlockManager.doPutBytes(BlockManager.scala:1010)[info] at org.apache.spark.storage.BlockManager.putBytes(BlockManager.scala:986)[info] at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1.apply$mcVI$sp(TorrentBroadcast.scala:181)[info] at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1.apply(TorrentBroadcast.scala:159)[info] at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1.apply(TorrentBroadcast.scala:159)[info] at scala.collection.immutable.List.foreach(List.scala:392)[info] at org.apache.spark.broadcast.TorrentBroadcast.org$apache$spark$broadcast$TorrentBroadcast$$readBlocks(TorrentBroadcast.scala:159)[info] at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$readBroadcastBlock$1$$anonfun$apply$2.apply(TorrentBroadcast.scala:239)[info] at scala.Option.getOrElse(Option.scala:121)[info] at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$readBroadcastBlock$1.apply(TorrentBroadcast.scala:219)[info] at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1326)[info] ... 13 more[info][info] Driver stacktrace:[info] at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1928)[info] at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1916)[info] at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1915)[info] at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)[info] at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)[info] at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1915)[info] at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:951)[info] at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:951)[info] at
[jira] [Commented] (SPARK-33813) JDBC datasource fails when reading spatial datatypes with the MS SQL driver
[ https://issues.apache.org/jira/browse/SPARK-33813?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17269404#comment-17269404 ] Apache Spark commented on SPARK-33813: -- User 'sarutak' has created a pull request for this issue: https://github.com/apache/spark/pull/31283 > JDBC datasource fails when reading spatial datatypes with the MS SQL driver > --- > > Key: SPARK-33813 > URL: https://issues.apache.org/jira/browse/SPARK-33813 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0, 3.1.0 >Reporter: Michał Świtakowski >Priority: Major > > The MS SQL JDBC driver introduced support for spatial types since version > 7.0. The JDBC data source lacks mappings for these types which results in an > exception below. It seems that a mapping in > MsSqlServerDialect.getCatalystType that maps -157 and -158 typecode to > VARBINARY should address the issue. > > {noformat} > java.sql.SQLException: Unrecognized SQL type -157 > at > org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.getCatalystType(JdbcUtils.scala:251) > at > org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.$anonfun$getSchema$1(JdbcUtils.scala:321) > at scala.Option.getOrElse(Option.scala:189) > at > org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.getSchema(JdbcUtils.scala:321) > at > org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$.resolveTable(JDBCRDD.scala:63) > at > org.apache.spark.sql.execution.datasources.jdbc.JDBCRelation$.getSchema(JDBCRelation.scala:226) > at > org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:35) > at > org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:364) > at > org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:366) > at > org.apache.spark.sql.DataFrameReader.$anonfun$load$2(DataFrameReader.scala:355) > at scala.Option.getOrElse(Option.scala:189) > at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:355) > at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:240) > at > org.apache.spark.sql.DataFrameReader.jdbc(DataFrameReader.scala:381){noformat} > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33813) JDBC datasource fails when reading spatial datatypes with the MS SQL driver
[ https://issues.apache.org/jira/browse/SPARK-33813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-33813: Assignee: (was: Apache Spark) > JDBC datasource fails when reading spatial datatypes with the MS SQL driver > --- > > Key: SPARK-33813 > URL: https://issues.apache.org/jira/browse/SPARK-33813 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0, 3.1.0 >Reporter: Michał Świtakowski >Priority: Major > > The MS SQL JDBC driver introduced support for spatial types since version > 7.0. The JDBC data source lacks mappings for these types which results in an > exception below. It seems that a mapping in > MsSqlServerDialect.getCatalystType that maps -157 and -158 typecode to > VARBINARY should address the issue. > > {noformat} > java.sql.SQLException: Unrecognized SQL type -157 > at > org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.getCatalystType(JdbcUtils.scala:251) > at > org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.$anonfun$getSchema$1(JdbcUtils.scala:321) > at scala.Option.getOrElse(Option.scala:189) > at > org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.getSchema(JdbcUtils.scala:321) > at > org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$.resolveTable(JDBCRDD.scala:63) > at > org.apache.spark.sql.execution.datasources.jdbc.JDBCRelation$.getSchema(JDBCRelation.scala:226) > at > org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:35) > at > org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:364) > at > org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:366) > at > org.apache.spark.sql.DataFrameReader.$anonfun$load$2(DataFrameReader.scala:355) > at scala.Option.getOrElse(Option.scala:189) > at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:355) > at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:240) > at > org.apache.spark.sql.DataFrameReader.jdbc(DataFrameReader.scala:381){noformat} > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33813) JDBC datasource fails when reading spatial datatypes with the MS SQL driver
[ https://issues.apache.org/jira/browse/SPARK-33813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-33813: Assignee: Apache Spark > JDBC datasource fails when reading spatial datatypes with the MS SQL driver > --- > > Key: SPARK-33813 > URL: https://issues.apache.org/jira/browse/SPARK-33813 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0, 3.1.0 >Reporter: Michał Świtakowski >Assignee: Apache Spark >Priority: Major > > The MS SQL JDBC driver introduced support for spatial types since version > 7.0. The JDBC data source lacks mappings for these types which results in an > exception below. It seems that a mapping in > MsSqlServerDialect.getCatalystType that maps -157 and -158 typecode to > VARBINARY should address the issue. > > {noformat} > java.sql.SQLException: Unrecognized SQL type -157 > at > org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.getCatalystType(JdbcUtils.scala:251) > at > org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.$anonfun$getSchema$1(JdbcUtils.scala:321) > at scala.Option.getOrElse(Option.scala:189) > at > org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.getSchema(JdbcUtils.scala:321) > at > org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$.resolveTable(JDBCRDD.scala:63) > at > org.apache.spark.sql.execution.datasources.jdbc.JDBCRelation$.getSchema(JDBCRelation.scala:226) > at > org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:35) > at > org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:364) > at > org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:366) > at > org.apache.spark.sql.DataFrameReader.$anonfun$load$2(DataFrameReader.scala:355) > at scala.Option.getOrElse(Option.scala:189) > at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:355) > at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:240) > at > org.apache.spark.sql.DataFrameReader.jdbc(DataFrameReader.scala:381){noformat} > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33813) JDBC datasource fails when reading spatial datatypes with the MS SQL driver
[ https://issues.apache.org/jira/browse/SPARK-33813?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17269403#comment-17269403 ] Apache Spark commented on SPARK-33813: -- User 'sarutak' has created a pull request for this issue: https://github.com/apache/spark/pull/31283 > JDBC datasource fails when reading spatial datatypes with the MS SQL driver > --- > > Key: SPARK-33813 > URL: https://issues.apache.org/jira/browse/SPARK-33813 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0, 3.1.0 >Reporter: Michał Świtakowski >Priority: Major > > The MS SQL JDBC driver introduced support for spatial types since version > 7.0. The JDBC data source lacks mappings for these types which results in an > exception below. It seems that a mapping in > MsSqlServerDialect.getCatalystType that maps -157 and -158 typecode to > VARBINARY should address the issue. > > {noformat} > java.sql.SQLException: Unrecognized SQL type -157 > at > org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.getCatalystType(JdbcUtils.scala:251) > at > org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.$anonfun$getSchema$1(JdbcUtils.scala:321) > at scala.Option.getOrElse(Option.scala:189) > at > org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.getSchema(JdbcUtils.scala:321) > at > org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$.resolveTable(JDBCRDD.scala:63) > at > org.apache.spark.sql.execution.datasources.jdbc.JDBCRelation$.getSchema(JDBCRelation.scala:226) > at > org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:35) > at > org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:364) > at > org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:366) > at > org.apache.spark.sql.DataFrameReader.$anonfun$load$2(DataFrameReader.scala:355) > at scala.Option.getOrElse(Option.scala:189) > at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:355) > at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:240) > at > org.apache.spark.sql.DataFrameReader.jdbc(DataFrameReader.scala:381){noformat} > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34191) udf type hint should allow dectorator with named returnType
[ https://issues.apache.org/jira/browse/SPARK-34191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17269364#comment-17269364 ] Apache Spark commented on SPARK-34191: -- User 'pgrz' has created a pull request for this issue: https://github.com/apache/spark/pull/31282 > udf type hint should allow dectorator with named returnType > --- > > Key: SPARK-34191 > URL: https://issues.apache.org/jira/browse/SPARK-34191 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.1.0, 3.2.0, 3.1.1 >Reporter: Maciej Szymkiewicz >Priority: Major > > At the moment annotations allow following decorated patterns: > > {code:python} > @udf > def f(x): ... > @udf("string") # Or DataType instance > def f(x): ... > @udf(f="string") # Awkward but technically valid > def f(x): ... > {code} > We should also support > {code:python} > @udf(returnType="string") > def f(x): ... > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-34191) udf type hint should allow dectorator with named returnType
[ https://issues.apache.org/jira/browse/SPARK-34191?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-34191: Assignee: Apache Spark > udf type hint should allow dectorator with named returnType > --- > > Key: SPARK-34191 > URL: https://issues.apache.org/jira/browse/SPARK-34191 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.1.0, 3.2.0, 3.1.1 >Reporter: Maciej Szymkiewicz >Assignee: Apache Spark >Priority: Major > > At the moment annotations allow following decorated patterns: > > {code:python} > @udf > def f(x): ... > @udf("string") # Or DataType instance > def f(x): ... > @udf(f="string") # Awkward but technically valid > def f(x): ... > {code} > We should also support > {code:python} > @udf(returnType="string") > def f(x): ... > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-34191) udf type hint should allow dectorator with named returnType
[ https://issues.apache.org/jira/browse/SPARK-34191?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-34191: Assignee: (was: Apache Spark) > udf type hint should allow dectorator with named returnType > --- > > Key: SPARK-34191 > URL: https://issues.apache.org/jira/browse/SPARK-34191 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.1.0, 3.2.0, 3.1.1 >Reporter: Maciej Szymkiewicz >Priority: Major > > At the moment annotations allow following decorated patterns: > > {code:python} > @udf > def f(x): ... > @udf("string") # Or DataType instance > def f(x): ... > @udf(f="string") # Awkward but technically valid > def f(x): ... > {code} > We should also support > {code:python} > @udf(returnType="string") > def f(x): ... > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-34192) Move char padding to write side
[ https://issues.apache.org/jira/browse/SPARK-34192?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-34192: Assignee: (was: Apache Spark) > Move char padding to write side > --- > > Key: SPARK-34192 > URL: https://issues.apache.org/jira/browse/SPARK-34192 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Kent Yao >Priority: Major > > On the read side, the char length check and padding bring issues to CBO and > PPD and other issues to the catalyst. > It's more reasonable to do it on the write side, as Spark doesn't take fully > control of the storage layer. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34192) Move char padding to write side
[ https://issues.apache.org/jira/browse/SPARK-34192?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17269335#comment-17269335 ] Apache Spark commented on SPARK-34192: -- User 'yaooqinn' has created a pull request for this issue: https://github.com/apache/spark/pull/31281 > Move char padding to write side > --- > > Key: SPARK-34192 > URL: https://issues.apache.org/jira/browse/SPARK-34192 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Kent Yao >Priority: Major > > On the read side, the char length check and padding bring issues to CBO and > PPD and other issues to the catalyst. > It's more reasonable to do it on the write side, as Spark doesn't take fully > control of the storage layer. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-34192) Move char padding to write side
[ https://issues.apache.org/jira/browse/SPARK-34192?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-34192: Assignee: Apache Spark > Move char padding to write side > --- > > Key: SPARK-34192 > URL: https://issues.apache.org/jira/browse/SPARK-34192 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Kent Yao >Assignee: Apache Spark >Priority: Major > > On the read side, the char length check and padding bring issues to CBO and > PPD and other issues to the catalyst. > It's more reasonable to do it on the write side, as Spark doesn't take fully > control of the storage layer. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-34192) Move char padding to write side
Kent Yao created SPARK-34192: Summary: Move char padding to write side Key: SPARK-34192 URL: https://issues.apache.org/jira/browse/SPARK-34192 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.1.0 Reporter: Kent Yao On the read side, the char length check and padding bring issues to CBO and PPD and other issues to the catalyst. It's more reasonable to do it on the write side, as Spark doesn't take fully control of the storage layer. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-34094) Extends StringTranslate to support unicode characters whose code point >= U+10000
[ https://issues.apache.org/jira/browse/SPARK-34094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen resolved SPARK-34094. -- Fix Version/s: 3.2.0 Resolution: Fixed Issue resolved by pull request 31164 [https://github.com/apache/spark/pull/31164] > Extends StringTranslate to support unicode characters whose code point >= > U+1 > - > > Key: SPARK-34094 > URL: https://issues.apache.org/jira/browse/SPARK-34094 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.0 >Reporter: Kousuke Saruta >Assignee: Kousuke Saruta >Priority: Major > Fix For: 3.2.0 > > > Currently, StringTranslate works with only unicode characters whose code > point < U+1 so let's extend it to support code points >= U+1. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-34191) udf type hint should allow dectorator with named returnType
Maciej Szymkiewicz created SPARK-34191: -- Summary: udf type hint should allow dectorator with named returnType Key: SPARK-34191 URL: https://issues.apache.org/jira/browse/SPARK-34191 Project: Spark Issue Type: Improvement Components: PySpark Affects Versions: 3.1.0, 3.2.0, 3.1.1 Reporter: Maciej Szymkiewicz At the moment annotations allow following decorated patterns: {code:python} @udf def f(x): ... @udf("string") # Or DataType instance def f(x): ... @udf(f="string") # Awkward but technically valid def f(x): ... {code} We should also support {code:python} @udf(returnType="string") def f(x): ... {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-34190) Supplement the description for Python Package Management
[ https://issues.apache.org/jira/browse/SPARK-34190?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-34190: - Parent: SPARK-31851 Issue Type: Sub-task (was: Documentation) > Supplement the description for Python Package Management > > > Key: SPARK-34190 > URL: https://issues.apache.org/jira/browse/SPARK-34190 > Project: Spark > Issue Type: Sub-task > Components: Documentation >Affects Versions: 3.1.0 >Reporter: Haejoon Lee >Assignee: Haejoon Lee >Priority: Major > Fix For: 3.1.2 > > > There is lack of explanation for "Using Virtualenv" chapter. > It says "It packs the current virtual environment to an archive file, and It > self-contains both Python interpreter and the dependencies", but it's not > working if there is no Python installed on the node. > Because the Python in the packed environment has a symbolic link that > connects Python to the local one, so Python must exist in the same path on > all nodes. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-34190) Supplement the description for Python Package Management
[ https://issues.apache.org/jira/browse/SPARK-34190?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-34190: Assignee: Haejoon Lee > Supplement the description for Python Package Management > > > Key: SPARK-34190 > URL: https://issues.apache.org/jira/browse/SPARK-34190 > Project: Spark > Issue Type: Documentation > Components: Documentation >Affects Versions: 3.1.0 >Reporter: Haejoon Lee >Assignee: Haejoon Lee >Priority: Major > > There is lack of explanation for "Using Virtualenv" chapter. > It says "It packs the current virtual environment to an archive file, and It > self-contains both Python interpreter and the dependencies", but it's not > working if there is no Python installed on the node. > Because the Python in the packed environment has a symbolic link that > connects Python to the local one, so Python must exist in the same path on > all nodes. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-34190) Supplement the description for Python Package Management
[ https://issues.apache.org/jira/browse/SPARK-34190?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-34190. -- Fix Version/s: 3.1.2 Resolution: Fixed Issue resolved by pull request 31280 [https://github.com/apache/spark/pull/31280] > Supplement the description for Python Package Management > > > Key: SPARK-34190 > URL: https://issues.apache.org/jira/browse/SPARK-34190 > Project: Spark > Issue Type: Documentation > Components: Documentation >Affects Versions: 3.1.0 >Reporter: Haejoon Lee >Assignee: Haejoon Lee >Priority: Major > Fix For: 3.1.2 > > > There is lack of explanation for "Using Virtualenv" chapter. > It says "It packs the current virtual environment to an archive file, and It > self-contains both Python interpreter and the dependencies", but it's not > working if there is no Python installed on the node. > Because the Python in the packed environment has a symbolic link that > connects Python to the local one, so Python must exist in the same path on > all nodes. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-34190) Supplement the description for Python Package Management
[ https://issues.apache.org/jira/browse/SPARK-34190?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-34190: - Affects Version/s: (was: 3.0.1) 3.1.0 > Supplement the description for Python Package Management > > > Key: SPARK-34190 > URL: https://issues.apache.org/jira/browse/SPARK-34190 > Project: Spark > Issue Type: Documentation > Components: Documentation >Affects Versions: 3.1.0 >Reporter: Haejoon Lee >Priority: Major > > There is lack of explanation for "Using Virtualenv" chapter. > It says "It packs the current virtual environment to an archive file, and It > self-contains both Python interpreter and the dependencies", but it's not > working if there is no Python installed on the node. > Because the Python in the packed environment has a symbolic link that > connects Python to the local one, so Python must exist in the same path on > all nodes. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-34138) Keep dependants cached while refreshing v1 tables
[ https://issues.apache.org/jira/browse/SPARK-34138?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-34138: --- Assignee: Maxim Gekk > Keep dependants cached while refreshing v1 tables > - > > Key: SPARK-34138 > URL: https://issues.apache.org/jira/browse/SPARK-34138 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.2.0 >Reporter: Maxim Gekk >Assignee: Maxim Gekk >Priority: Major > > Keeping dependants cached while refreshing v1 tables should allow to improve > user experience with table/view caching. For example, let's imagine that an > user has cached v1 table and cached view based on the table. And the user > passed the table to external library which drops/renames/adds partitions in > the v1 table. Unfortunately, the user gets the view uncached after that even > he/she hasn't uncached the view explicitly. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-34138) Keep dependants cached while refreshing v1 tables
[ https://issues.apache.org/jira/browse/SPARK-34138?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-34138. - Fix Version/s: 3.2.0 Resolution: Fixed Issue resolved by pull request 31206 [https://github.com/apache/spark/pull/31206] > Keep dependants cached while refreshing v1 tables > - > > Key: SPARK-34138 > URL: https://issues.apache.org/jira/browse/SPARK-34138 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.2.0 >Reporter: Maxim Gekk >Assignee: Maxim Gekk >Priority: Major > Fix For: 3.2.0 > > > Keeping dependants cached while refreshing v1 tables should allow to improve > user experience with table/view caching. For example, let's imagine that an > user has cached v1 table and cached view based on the table. And the user > passed the table to external library which drops/renames/adds partitions in > the v1 table. Unfortunately, the user gets the view uncached after that even > he/she hasn't uncached the view explicitly. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33813) JDBC datasource fails when reading spatial datatypes with the MS SQL driver
[ https://issues.apache.org/jira/browse/SPARK-33813?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17269271#comment-17269271 ] Kousuke Saruta commented on SPARK-33813: [~cloud_fan] O.K, I'll try it. > JDBC datasource fails when reading spatial datatypes with the MS SQL driver > --- > > Key: SPARK-33813 > URL: https://issues.apache.org/jira/browse/SPARK-33813 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0, 3.1.0 >Reporter: Michał Świtakowski >Priority: Major > > The MS SQL JDBC driver introduced support for spatial types since version > 7.0. The JDBC data source lacks mappings for these types which results in an > exception below. It seems that a mapping in > MsSqlServerDialect.getCatalystType that maps -157 and -158 typecode to > VARBINARY should address the issue. > > {noformat} > java.sql.SQLException: Unrecognized SQL type -157 > at > org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.getCatalystType(JdbcUtils.scala:251) > at > org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.$anonfun$getSchema$1(JdbcUtils.scala:321) > at scala.Option.getOrElse(Option.scala:189) > at > org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.getSchema(JdbcUtils.scala:321) > at > org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$.resolveTable(JDBCRDD.scala:63) > at > org.apache.spark.sql.execution.datasources.jdbc.JDBCRelation$.getSchema(JDBCRelation.scala:226) > at > org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:35) > at > org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:364) > at > org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:366) > at > org.apache.spark.sql.DataFrameReader.$anonfun$load$2(DataFrameReader.scala:355) > at scala.Option.getOrElse(Option.scala:189) > at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:355) > at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:240) > at > org.apache.spark.sql.DataFrameReader.jdbc(DataFrameReader.scala:381){noformat} > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34190) Supplement the description for Python Package Management
[ https://issues.apache.org/jira/browse/SPARK-34190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17269190#comment-17269190 ] Apache Spark commented on SPARK-34190: -- User 'itholic' has created a pull request for this issue: https://github.com/apache/spark/pull/31280 > Supplement the description for Python Package Management > > > Key: SPARK-34190 > URL: https://issues.apache.org/jira/browse/SPARK-34190 > Project: Spark > Issue Type: Documentation > Components: Documentation >Affects Versions: 3.0.1 >Reporter: Haejoon Lee >Priority: Major > > There is lack of explanation for "Using Virtualenv" chapter. > It says "It packs the current virtual environment to an archive file, and It > self-contains both Python interpreter and the dependencies", but it's not > working if there is no Python installed on the node. > Because the Python in the packed environment has a symbolic link that > connects Python to the local one, so Python must exist in the same path on > all nodes. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-34190) Supplement the description for Python Package Management
[ https://issues.apache.org/jira/browse/SPARK-34190?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-34190: Assignee: Apache Spark > Supplement the description for Python Package Management > > > Key: SPARK-34190 > URL: https://issues.apache.org/jira/browse/SPARK-34190 > Project: Spark > Issue Type: Documentation > Components: Documentation >Affects Versions: 3.0.1 >Reporter: Haejoon Lee >Assignee: Apache Spark >Priority: Major > > There is lack of explanation for "Using Virtualenv" chapter. > It says "It packs the current virtual environment to an archive file, and It > self-contains both Python interpreter and the dependencies", but it's not > working if there is no Python installed on the node. > Because the Python in the packed environment has a symbolic link that > connects Python to the local one, so Python must exist in the same path on > all nodes. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-34190) Supplement the description for Python Package Management
[ https://issues.apache.org/jira/browse/SPARK-34190?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-34190: Assignee: (was: Apache Spark) > Supplement the description for Python Package Management > > > Key: SPARK-34190 > URL: https://issues.apache.org/jira/browse/SPARK-34190 > Project: Spark > Issue Type: Documentation > Components: Documentation >Affects Versions: 3.0.1 >Reporter: Haejoon Lee >Priority: Major > > There is lack of explanation for "Using Virtualenv" chapter. > It says "It packs the current virtual environment to an archive file, and It > self-contains both Python interpreter and the dependencies", but it's not > working if there is no Python installed on the node. > Because the Python in the packed environment has a symbolic link that > connects Python to the local one, so Python must exist in the same path on > all nodes. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-34190) Supplement the description for Python Package Management
[ https://issues.apache.org/jira/browse/SPARK-34190?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Haejoon Lee updated SPARK-34190: Summary: Supplement the description for Python Package Management (was: Supplement the description in the document) > Supplement the description for Python Package Management > > > Key: SPARK-34190 > URL: https://issues.apache.org/jira/browse/SPARK-34190 > Project: Spark > Issue Type: Documentation > Components: Documentation >Affects Versions: 3.0.1 >Reporter: Haejoon Lee >Priority: Major > > There is lack of explanation for "Using Virtualenv" chapter. > It says "It packs the current virtual environment to an archive file, and It > self-contains both Python interpreter and the dependencies", but it's not > working if there is no Python installed on the node. > Because the Python in the packed environment has a symbolic link that > connects Python to the local one, so Python must exist in the same path on > all nodes. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-34190) Supplement the description in the document
[ https://issues.apache.org/jira/browse/SPARK-34190?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Haejoon Lee updated SPARK-34190: Description: There is lack of explanation for "Using Virtualenv" chapter. It says "It packs the current virtual environment to an archive file, and It self-contains both Python interpreter and the dependencies", but it's not working if there is no Python installed on the node. Because the Python in the packed environment has a symbolic link that connects Python to the local one, so Python must exist in the same path on all nodes. was: There is lack of explanation for "Using Virtualenv" chapter. It says "It packs the current virtual environment to an archive file, and It self-contains both Python interpreter and the dependencies", but it's not working if there is no Python installed for the all nodes in cluster. Because the Python in the packed environment has a symbolic link that connects Python to the local one, so Python must exist in the same path on all nodes. > Supplement the description in the document > -- > > Key: SPARK-34190 > URL: https://issues.apache.org/jira/browse/SPARK-34190 > Project: Spark > Issue Type: Documentation > Components: Documentation >Affects Versions: 3.0.1 >Reporter: Haejoon Lee >Priority: Major > > There is lack of explanation for "Using Virtualenv" chapter. > It says "It packs the current virtual environment to an archive file, and It > self-contains both Python interpreter and the dependencies", but it's not > working if there is no Python installed on the node. > Because the Python in the packed environment has a symbolic link that > connects Python to the local one, so Python must exist in the same path on > all nodes. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-34190) Supplement the description in the document
[ https://issues.apache.org/jira/browse/SPARK-34190?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Haejoon Lee updated SPARK-34190: Description: There is lack of explanation for "Using Virtualenv" chapter. It says "It packs the current virtual environment to an archive file, and It self-contains both Python interpreter and the dependencies", but it's not work if there is no Python installed for the all nodes in cluster. Because the Python in the packed environment has a symbolic link that connects Python to the local one, so Python must exist in the same path on all nodes. was: There is lack of explanation for "Using Virtualenv" chapter. It says "It packs the current virtual environment to an archive file, and It self-contains both Python interpreter and the dependencies", but it's not work if there is no Python installed for the all nodes in cluster. The python in the packed environment has a symbolic link that connects Python to the local one, so Python must exist in the same path on all nodes. > Supplement the description in the document > -- > > Key: SPARK-34190 > URL: https://issues.apache.org/jira/browse/SPARK-34190 > Project: Spark > Issue Type: Documentation > Components: Documentation >Affects Versions: 3.0.1 >Reporter: Haejoon Lee >Priority: Major > > There is lack of explanation for "Using Virtualenv" chapter. > It says "It packs the current virtual environment to an archive file, and It > self-contains both Python interpreter and the dependencies", but it's not > work if there is no Python installed for the all nodes in cluster. > Because the Python in the packed environment has a symbolic link that > connects Python to the local one, so Python must exist in the same path on > all nodes. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-34190) Supplement the description in the document
[ https://issues.apache.org/jira/browse/SPARK-34190?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Haejoon Lee updated SPARK-34190: Description: There is lack of explanation for "Using Virtualenv" chapter. It says "It packs the current virtual environment to an archive file, and It self-contains both Python interpreter and the dependencies", but it's not work if there is no Python installed for the all nodes in cluster. The python in the packed environment has a symbolic link that connects Python to the local one, so Python must exist in the same path on all nodes. was: There is inconsistent explanation for "Using Virtualenv" chapter. It says "It packs the current virtual environment to an archive file, and It self-contains both Python interpreter and the dependencies", but it's not true. The python in the packed environment has a symbolic link that connects Python to the local one, so Python must exist in the same path on all nodes. > Supplement the description in the document > -- > > Key: SPARK-34190 > URL: https://issues.apache.org/jira/browse/SPARK-34190 > Project: Spark > Issue Type: Documentation > Components: Documentation >Affects Versions: 3.0.1 >Reporter: Haejoon Lee >Priority: Major > > There is lack of explanation for "Using Virtualenv" chapter. > It says "It packs the current virtual environment to an archive file, and It > self-contains both Python interpreter and the dependencies", but it's not > work if there is no Python installed for the all nodes in cluster. > The python in the packed environment has a symbolic link that connects Python > to the local one, so Python must exist in the same path on all nodes. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-34190) Supplement the description in the document
[ https://issues.apache.org/jira/browse/SPARK-34190?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Haejoon Lee updated SPARK-34190: Description: There is lack of explanation for "Using Virtualenv" chapter. It says "It packs the current virtual environment to an archive file, and It self-contains both Python interpreter and the dependencies", but it's not working if there is no Python installed for the all nodes in cluster. Because the Python in the packed environment has a symbolic link that connects Python to the local one, so Python must exist in the same path on all nodes. was: There is lack of explanation for "Using Virtualenv" chapter. It says "It packs the current virtual environment to an archive file, and It self-contains both Python interpreter and the dependencies", but it's not work if there is no Python installed for the all nodes in cluster. Because the Python in the packed environment has a symbolic link that connects Python to the local one, so Python must exist in the same path on all nodes. > Supplement the description in the document > -- > > Key: SPARK-34190 > URL: https://issues.apache.org/jira/browse/SPARK-34190 > Project: Spark > Issue Type: Documentation > Components: Documentation >Affects Versions: 3.0.1 >Reporter: Haejoon Lee >Priority: Major > > There is lack of explanation for "Using Virtualenv" chapter. > It says "It packs the current virtual environment to an archive file, and It > self-contains both Python interpreter and the dependencies", but it's not > working if there is no Python installed for the all nodes in cluster. > Because the Python in the packed environment has a symbolic link that > connects Python to the local one, so Python must exist in the same path on > all nodes. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-34190) Supplement the description in the document
[ https://issues.apache.org/jira/browse/SPARK-34190?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Haejoon Lee updated SPARK-34190: Summary: Supplement the description in the document (was: Fix inconsistent docs in Python Package Management) > Supplement the description in the document > -- > > Key: SPARK-34190 > URL: https://issues.apache.org/jira/browse/SPARK-34190 > Project: Spark > Issue Type: Documentation > Components: Documentation >Affects Versions: 3.0.1 >Reporter: Haejoon Lee >Priority: Major > > There is inconsistent explanation for "Using Virtualenv" chapter. > It says "It packs the current virtual environment to an archive file, and It > self-contains both Python interpreter and the dependencies", but it's not > true. > The python in the packed environment has a symbolic link that connects Python > to the local one, so Python must exist in the same path on all nodes. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-34167) Reading parquet with Decimal(8,2) written as a Decimal64 blows up
[ https://issues.apache.org/jira/browse/SPARK-34167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17269182#comment-17269182 ] Attila Zsolt Piros edited comment on SPARK-34167 at 1/21/21, 9:49 AM: -- [~razajafri] could you please share with us how the parquet files are created? I tried to reproduce this issue in the following way but I had no luck: {noformat} Spark context Web UI available at http://192.168.0.17:4045 Spark context available as 'sc' (master = local, app id = local-1611221568779). Spark session available as 'spark'. Welcome to __ / __/__ ___ _/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 3.0.1 /_/ Using Scala version 2.12.10 (OpenJDK 64-Bit Server VM, Java 1.8.0_242) Type in expressions to have them evaluated. Type :help for more information. scala> import java.math.BigDecimal import java.math.BigDecimal scala> import org.apache.spark.sql.Row import org.apache.spark.sql.Row scala> import org.apache.spark.sql.types.{DecimalType, StructField, StructType} import org.apache.spark.sql.types.{DecimalType, StructField, StructType} scala> val schema = StructType(Array(StructField("num", DecimalType(8,2),true))) schema: org.apache.spark.sql.types.StructType = StructType(StructField(num,DecimalType(8,2),true)) scala> val rdd = sc.parallelize((0 to 9).map(v => new BigDecimal(s"123456.7$v"))) rdd: org.apache.spark.rdd.RDD[java.math.BigDecimal] = ParallelCollectionRDD[0] at parallelize at :27 scala> val df = spark.createDataFrame(rdd.map(Row(_)), schema) df: org.apache.spark.sql.DataFrame = [num: decimal(8,2)] scala> df.show() +-+ | num| +-+ |123456.70| |123456.71| |123456.72| |123456.73| |123456.74| |123456.75| |123456.76| |123456.77| |123456.78| |123456.79| +-+ scala> df.write.parquet("num.parquet") scala> spark.read.parquet("num.parquet").show() +-+ | num| +-+ |123456.70| |123456.71| |123456.72| |123456.73| |123456.74| |123456.75| |123456.76| |123456.77| |123456.78| |123456.79| +-+ {noformat} was (Author: attilapiros): [~razajafri] could you share with us how the parquet files are created? I tried to reproduce this issue in the following way but I had no luck: {noformat} Spark context Web UI available at http://192.168.0.17:4045 Spark context available as 'sc' (master = local, app id = local-1611221568779). Spark session available as 'spark'. Welcome to __ / __/__ ___ _/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 3.0.1 /_/ Using Scala version 2.12.10 (OpenJDK 64-Bit Server VM, Java 1.8.0_242) Type in expressions to have them evaluated. Type :help for more information. scala> import java.math.BigDecimal import java.math.BigDecimal scala> import org.apache.spark.sql.Row import org.apache.spark.sql.Row scala> import org.apache.spark.sql.types.{DecimalType, StructField, StructType} import org.apache.spark.sql.types.{DecimalType, StructField, StructType} scala> val schema = StructType(Array(StructField("num", DecimalType(8,2),true))) schema: org.apache.spark.sql.types.StructType = StructType(StructField(num,DecimalType(8,2),true)) scala> val rdd = sc.parallelize((0 to 9).map(v => new BigDecimal(s"123456.7$v"))) rdd: org.apache.spark.rdd.RDD[java.math.BigDecimal] = ParallelCollectionRDD[0] at parallelize at :27 scala> val df = spark.createDataFrame(rdd.map(Row(_)), schema) df: org.apache.spark.sql.DataFrame = [num: decimal(8,2)] scala> df.show() +-+ | num| +-+ |123456.70| |123456.71| |123456.72| |123456.73| |123456.74| |123456.75| |123456.76| |123456.77| |123456.78| |123456.79| +-+ scala> df.write.parquet("num.parquet") scala> spark.read.parquet("num.parquet").show() +-+ | num| +-+ |123456.70| |123456.71| |123456.72| |123456.73| |123456.74| |123456.75| |123456.76| |123456.77| |123456.78| |123456.79| +-+ {noformat} > Reading parquet with Decimal(8,2) written as a Decimal64 blows up > - > > Key: SPARK-34167 > URL: https://issues.apache.org/jira/browse/SPARK-34167 > Project: Spark > Issue Type: Bug > Components: Input/Output >Affects Versions: 3.0.1 >Reporter: Raza Jafri >Priority: Major > Attachments: > part-0-7fecd321-b247-4f7e-bff5-c2e4d8facaa0-c000.snappy.parquet, > part-0-940f44f1-f323-4a5e-b828-1e65d87895aa-c000.snappy.parquet > > > When reading a parquet file written with Decimals with precision < 10 as a > 64-bit representation, Spark tries to read it as an INT and fails > > Steps to reproduce: > Read the attached file that has a single Decimal(8,2) column with 10 values > {code:java} > scala>
[jira] [Commented] (SPARK-34167) Reading parquet with Decimal(8,2) written as a Decimal64 blows up
[ https://issues.apache.org/jira/browse/SPARK-34167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17269182#comment-17269182 ] Attila Zsolt Piros commented on SPARK-34167: [~razajafri] could you share with us how the parquet files are created? I tried to reproduce this issue in the following way but I had no luck: {noformat} Spark context Web UI available at http://192.168.0.17:4045 Spark context available as 'sc' (master = local, app id = local-1611221568779). Spark session available as 'spark'. Welcome to __ / __/__ ___ _/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 3.0.1 /_/ Using Scala version 2.12.10 (OpenJDK 64-Bit Server VM, Java 1.8.0_242) Type in expressions to have them evaluated. Type :help for more information. scala> import java.math.BigDecimal import java.math.BigDecimal scala> import org.apache.spark.sql.Row import org.apache.spark.sql.Row scala> import org.apache.spark.sql.types.{DecimalType, StructField, StructType} import org.apache.spark.sql.types.{DecimalType, StructField, StructType} scala> val schema = StructType(Array(StructField("num", DecimalType(8,2),true))) schema: org.apache.spark.sql.types.StructType = StructType(StructField(num,DecimalType(8,2),true)) scala> val rdd = sc.parallelize((0 to 9).map(v => new BigDecimal(s"123456.7$v"))) rdd: org.apache.spark.rdd.RDD[java.math.BigDecimal] = ParallelCollectionRDD[0] at parallelize at :27 scala> val df = spark.createDataFrame(rdd.map(Row(_)), schema) df: org.apache.spark.sql.DataFrame = [num: decimal(8,2)] scala> df.show() +-+ | num| +-+ |123456.70| |123456.71| |123456.72| |123456.73| |123456.74| |123456.75| |123456.76| |123456.77| |123456.78| |123456.79| +-+ scala> df.write.parquet("num.parquet") scala> spark.read.parquet("num.parquet").show() +-+ | num| +-+ |123456.70| |123456.71| |123456.72| |123456.73| |123456.74| |123456.75| |123456.76| |123456.77| |123456.78| |123456.79| +-+ {noformat} > Reading parquet with Decimal(8,2) written as a Decimal64 blows up > - > > Key: SPARK-34167 > URL: https://issues.apache.org/jira/browse/SPARK-34167 > Project: Spark > Issue Type: Bug > Components: Input/Output >Affects Versions: 3.0.1 >Reporter: Raza Jafri >Priority: Major > Attachments: > part-0-7fecd321-b247-4f7e-bff5-c2e4d8facaa0-c000.snappy.parquet, > part-0-940f44f1-f323-4a5e-b828-1e65d87895aa-c000.snappy.parquet > > > When reading a parquet file written with Decimals with precision < 10 as a > 64-bit representation, Spark tries to read it as an INT and fails > > Steps to reproduce: > Read the attached file that has a single Decimal(8,2) column with 10 values > {code:java} > scala> spark.read.parquet("/tmp/pyspark_tests/936454/PARQUET_DATA").show > ... > Caused by: java.lang.NullPointerException > at > org.apache.spark.sql.execution.vectorized.OnHeapColumnVector.putLong(OnHeapColumnVector.java:327) > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedRleValuesReader.readLongs(VectorizedRleValuesReader.java:370) > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readLongBatch(VectorizedColumnReader.java:514) > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readBatch(VectorizedColumnReader.java:256) > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextBatch(VectorizedParquetRecordReader.java:273) > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextKeyValue(VectorizedParquetRecordReader.java:171) > at > org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:93) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:173) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:93) > at > org.apache.spark.sql.execution.FileSourceScanExec$$anon$1.hasNext(DataSourceScanExec.scala:497) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.columnartorow_nextBatch_0$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:756) > at >
[jira] [Commented] (SPARK-33518) Improve performance of ML ALS recommendForAll by GEMV
[ https://issues.apache.org/jira/browse/SPARK-33518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17269177#comment-17269177 ] Apache Spark commented on SPARK-33518: -- User 'zhengruifeng' has created a pull request for this issue: https://github.com/apache/spark/pull/31279 > Improve performance of ML ALS recommendForAll by GEMV > - > > Key: SPARK-33518 > URL: https://issues.apache.org/jira/browse/SPARK-33518 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 3.1.0 >Reporter: zhengruifeng >Assignee: zhengruifeng >Priority: Major > Fix For: 3.2.0 > > > There were a lot of works on improving ALS's {{recommendForAll}} > For now, I found that it maybe futhermore optimized by > 1, using GEMV; > 2, use guava.ordering instead of BoundedPriorityQueue; > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33518) Improve performance of ML ALS recommendForAll by GEMV
[ https://issues.apache.org/jira/browse/SPARK-33518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17269176#comment-17269176 ] Apache Spark commented on SPARK-33518: -- User 'zhengruifeng' has created a pull request for this issue: https://github.com/apache/spark/pull/31279 > Improve performance of ML ALS recommendForAll by GEMV > - > > Key: SPARK-33518 > URL: https://issues.apache.org/jira/browse/SPARK-33518 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 3.1.0 >Reporter: zhengruifeng >Assignee: zhengruifeng >Priority: Major > Fix For: 3.2.0 > > > There were a lot of works on improving ALS's {{recommendForAll}} > For now, I found that it maybe futhermore optimized by > 1, using GEMV; > 2, use guava.ordering instead of BoundedPriorityQueue; > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-34190) Fix inconsistent docs in Python Package Management
Haejoon Lee created SPARK-34190: --- Summary: Fix inconsistent docs in Python Package Management Key: SPARK-34190 URL: https://issues.apache.org/jira/browse/SPARK-34190 Project: Spark Issue Type: Documentation Components: Documentation Affects Versions: 3.0.1 Reporter: Haejoon Lee There is inconsistent explanation for "Using Virtualenv" chapter. It says "It packs the current virtual environment to an archive file, and It self-contains both Python interpreter and the dependencies", but it's not true. The python in the packed environment has a symbolic link that connects Python to the local one, so Python must exist in the same path on all nodes. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34188) Char varchar length check blocks CBO statistics
[ https://issues.apache.org/jira/browse/SPARK-34188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17269136#comment-17269136 ] Apache Spark commented on SPARK-34188: -- User 'yaooqinn' has created a pull request for this issue: https://github.com/apache/spark/pull/31278 > Char varchar length check blocks CBO statistics > --- > > Key: SPARK-34188 > URL: https://issues.apache.org/jira/browse/SPARK-34188 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.0 >Reporter: Kent Yao >Priority: Major > Attachments: screenshot-1.png > > > !screenshot-1.png! > the char varchar length check changes the output by projection and the filter > push down through the projection with the new unaliased output, the CBO > estimation can not recognize -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-34188) Char varchar length check blocks CBO statistics
[ https://issues.apache.org/jira/browse/SPARK-34188?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-34188: Assignee: (was: Apache Spark) > Char varchar length check blocks CBO statistics > --- > > Key: SPARK-34188 > URL: https://issues.apache.org/jira/browse/SPARK-34188 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.0 >Reporter: Kent Yao >Priority: Major > Attachments: screenshot-1.png > > > !screenshot-1.png! > the char varchar length check changes the output by projection and the filter > push down through the projection with the new unaliased output, the CBO > estimation can not recognize -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34188) Char varchar length check blocks CBO statistics
[ https://issues.apache.org/jira/browse/SPARK-34188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17269133#comment-17269133 ] Apache Spark commented on SPARK-34188: -- User 'yaooqinn' has created a pull request for this issue: https://github.com/apache/spark/pull/31278 > Char varchar length check blocks CBO statistics > --- > > Key: SPARK-34188 > URL: https://issues.apache.org/jira/browse/SPARK-34188 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.0 >Reporter: Kent Yao >Priority: Major > Attachments: screenshot-1.png > > > !screenshot-1.png! > the char varchar length check changes the output by projection and the filter > push down through the projection with the new unaliased output, the CBO > estimation can not recognize -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-34188) Char varchar length check blocks CBO statistics
[ https://issues.apache.org/jira/browse/SPARK-34188?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-34188: Assignee: Apache Spark > Char varchar length check blocks CBO statistics > --- > > Key: SPARK-34188 > URL: https://issues.apache.org/jira/browse/SPARK-34188 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.0 >Reporter: Kent Yao >Assignee: Apache Spark >Priority: Major > Attachments: screenshot-1.png > > > !screenshot-1.png! > the char varchar length check changes the output by projection and the filter > push down through the projection with the new unaliased output, the CBO > estimation can not recognize -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33813) JDBC datasource fails when reading spatial datatypes with the MS SQL driver
[ https://issues.apache.org/jira/browse/SPARK-33813?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17269119#comment-17269119 ] Wenchen Fan commented on SPARK-33813: - [~sarutak] do you have time to look into this? thanks! > JDBC datasource fails when reading spatial datatypes with the MS SQL driver > --- > > Key: SPARK-33813 > URL: https://issues.apache.org/jira/browse/SPARK-33813 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0, 3.1.0 >Reporter: Michał Świtakowski >Priority: Major > > The MS SQL JDBC driver introduced support for spatial types since version > 7.0. The JDBC data source lacks mappings for these types which results in an > exception below. It seems that a mapping in > MsSqlServerDialect.getCatalystType that maps -157 and -158 typecode to > VARBINARY should address the issue. > > {noformat} > java.sql.SQLException: Unrecognized SQL type -157 > at > org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.getCatalystType(JdbcUtils.scala:251) > at > org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.$anonfun$getSchema$1(JdbcUtils.scala:321) > at scala.Option.getOrElse(Option.scala:189) > at > org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.getSchema(JdbcUtils.scala:321) > at > org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$.resolveTable(JDBCRDD.scala:63) > at > org.apache.spark.sql.execution.datasources.jdbc.JDBCRelation$.getSchema(JDBCRelation.scala:226) > at > org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:35) > at > org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:364) > at > org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:366) > at > org.apache.spark.sql.DataFrameReader.$anonfun$load$2(DataFrameReader.scala:355) > at scala.Option.getOrElse(Option.scala:189) > at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:355) > at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:240) > at > org.apache.spark.sql.DataFrameReader.jdbc(DataFrameReader.scala:381){noformat} > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33901) Char and Varchar display error after DDLs
[ https://issues.apache.org/jira/browse/SPARK-33901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17269117#comment-17269117 ] Apache Spark commented on SPARK-33901: -- User 'ulysses-you' has created a pull request for this issue: https://github.com/apache/spark/pull/31277 > Char and Varchar display error after DDLs > - > > Key: SPARK-33901 > URL: https://issues.apache.org/jira/browse/SPARK-33901 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.0 >Reporter: Kent Yao >Assignee: Apache Spark >Priority: Major > Fix For: 3.1.0 > > > CTAS / CREATE TABLE LIKE/ CVAS/ alter table add columns -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-34189) w2v findSynonyms optimization
[ https://issues.apache.org/jira/browse/SPARK-34189?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-34189: Assignee: Apache Spark (was: zhengruifeng) > w2v findSynonyms optimization > - > > Key: SPARK-34189 > URL: https://issues.apache.org/jira/browse/SPARK-34189 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 3.2.0 >Reporter: zhengruifeng >Assignee: Apache Spark >Priority: Minor > > {\{findSynonyms}} in w2v could be future optimized by using guavaording > instead of BoundedPriorityQueue -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-34189) w2v findSynonyms optimization
[ https://issues.apache.org/jira/browse/SPARK-34189?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-34189: Assignee: Apache Spark (was: zhengruifeng) > w2v findSynonyms optimization > - > > Key: SPARK-34189 > URL: https://issues.apache.org/jira/browse/SPARK-34189 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 3.2.0 >Reporter: zhengruifeng >Assignee: Apache Spark >Priority: Minor > > {\{findSynonyms}} in w2v could be future optimized by using guavaording > instead of BoundedPriorityQueue -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-34189) w2v findSynonyms optimization
[ https://issues.apache.org/jira/browse/SPARK-34189?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-34189: Assignee: zhengruifeng (was: Apache Spark) > w2v findSynonyms optimization > - > > Key: SPARK-34189 > URL: https://issues.apache.org/jira/browse/SPARK-34189 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 3.2.0 >Reporter: zhengruifeng >Assignee: zhengruifeng >Priority: Minor > > {\{findSynonyms}} in w2v could be future optimized by using guavaording > instead of BoundedPriorityQueue -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34189) w2v findSynonyms optimization
[ https://issues.apache.org/jira/browse/SPARK-34189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17269115#comment-17269115 ] Apache Spark commented on SPARK-34189: -- User 'zhengruifeng' has created a pull request for this issue: https://github.com/apache/spark/pull/31276 > w2v findSynonyms optimization > - > > Key: SPARK-34189 > URL: https://issues.apache.org/jira/browse/SPARK-34189 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 3.2.0 >Reporter: zhengruifeng >Assignee: zhengruifeng >Priority: Minor > > {\{findSynonyms}} in w2v could be future optimized by using guavaording > instead of BoundedPriorityQueue -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-34189) w2v findSynonyms optimization
[ https://issues.apache.org/jira/browse/SPARK-34189?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhengruifeng reassigned SPARK-34189: Assignee: zhengruifeng > w2v findSynonyms optimization > - > > Key: SPARK-34189 > URL: https://issues.apache.org/jira/browse/SPARK-34189 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 3.2.0 >Reporter: zhengruifeng >Assignee: zhengruifeng >Priority: Minor > > {\{findSynonyms}} in w2v could be future optimized by using guavaording > instead of BoundedPriorityQueue -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-34189) w2v findSynonyms optimization
zhengruifeng created SPARK-34189: Summary: w2v findSynonyms optimization Key: SPARK-34189 URL: https://issues.apache.org/jira/browse/SPARK-34189 Project: Spark Issue Type: Improvement Components: ML Affects Versions: 3.2.0 Reporter: zhengruifeng {\{findSynonyms}} in w2v could be future optimized by using guavaording instead of BoundedPriorityQueue -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org