[jira] [Assigned] (SPARK-37202) Temp view didn't collect temp function that registered with catalog API
[ https://issues.apache.org/jira/browse/SPARK-37202?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-37202: Assignee: (was: Apache Spark) > Temp view didn't collect temp function that registered with catalog API > --- > > Key: SPARK-37202 > URL: https://issues.apache.org/jira/browse/SPARK-37202 > Project: Spark > Issue Type: Task > Components: SQL >Affects Versions: 3.2.0 >Reporter: Linhong Liu >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37202) Temp view didn't collect temp function that registered with catalog API
[ https://issues.apache.org/jira/browse/SPARK-37202?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-37202: Assignee: Apache Spark > Temp view didn't collect temp function that registered with catalog API > --- > > Key: SPARK-37202 > URL: https://issues.apache.org/jira/browse/SPARK-37202 > Project: Spark > Issue Type: Task > Components: SQL >Affects Versions: 3.2.0 >Reporter: Linhong Liu >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37202) Temp view didn't collect temp function that registered with catalog API
[ https://issues.apache.org/jira/browse/SPARK-37202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17437730#comment-17437730 ] Apache Spark commented on SPARK-37202: -- User 'linhongliu-db' has created a pull request for this issue: https://github.com/apache/spark/pull/34473 > Temp view didn't collect temp function that registered with catalog API > --- > > Key: SPARK-37202 > URL: https://issues.apache.org/jira/browse/SPARK-37202 > Project: Spark > Issue Type: Task > Components: SQL >Affects Versions: 3.2.0 >Reporter: Linhong Liu >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-37202) Temp view didn't collect temp function that registered with catalog API
Linhong Liu created SPARK-37202: --- Summary: Temp view didn't collect temp function that registered with catalog API Key: SPARK-37202 URL: https://issues.apache.org/jira/browse/SPARK-37202 Project: Spark Issue Type: Task Components: SQL Affects Versions: 3.2.0 Reporter: Linhong Liu -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24774) support reading AVRO logical types - Decimal
[ https://issues.apache.org/jira/browse/SPARK-24774?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17437714#comment-17437714 ] Apache Spark commented on SPARK-24774: -- User 'gengliangwang' has created a pull request for this issue: https://github.com/apache/spark/pull/34472 > support reading AVRO logical types - Decimal > > > Key: SPARK-24774 > URL: https://issues.apache.org/jira/browse/SPARK-24774 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.4.0 >Reporter: Gengliang Wang >Assignee: Gengliang Wang >Priority: Major > Fix For: 2.4.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24774) support reading AVRO logical types - Decimal
[ https://issues.apache.org/jira/browse/SPARK-24774?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17437713#comment-17437713 ] Apache Spark commented on SPARK-24774: -- User 'gengliangwang' has created a pull request for this issue: https://github.com/apache/spark/pull/34472 > support reading AVRO logical types - Decimal > > > Key: SPARK-24774 > URL: https://issues.apache.org/jira/browse/SPARK-24774 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.4.0 >Reporter: Gengliang Wang >Assignee: Gengliang Wang >Priority: Major > Fix For: 2.4.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-37191) Allow merging DecimalTypes with different precision values
[ https://issues.apache.org/jira/browse/SPARK-37191?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-37191. - Resolution: Fixed Issue resolved by pull request 34462 [https://github.com/apache/spark/pull/34462] > Allow merging DecimalTypes with different precision values > --- > > Key: SPARK-37191 > URL: https://issues.apache.org/jira/browse/SPARK-37191 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.3, 3.1.0, 3.1.1, 3.2.0 >Reporter: Ivan >Assignee: Ivan >Priority: Major > Fix For: 3.3.0 > > > When merging DecimalTypes with different precision but the same scale, one > would get the following error: > {code:java} > Failed to merge fields 'col' and 'col'. Failed to merge decimal types with > incompatible precision 17 and 12 at > org.apache.spark.sql.types.StructType$.$anonfun$merge$2(StructType.scala:652) > at scala.Option.map(Option.scala:230) > at > org.apache.spark.sql.types.StructType$.$anonfun$merge$1(StructType.scala:644) > at > org.apache.spark.sql.types.StructType$.$anonfun$merge$1$adapted(StructType.scala:641) > at > scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36) > at > scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33) > at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:198) > at org.apache.spark.sql.types.StructType$.merge(StructType.scala:641) > at org.apache.spark.sql.types.StructType.merge(StructType.scala:550) > {code} > > We could allow merging DecimalType values with different precision if the > scale is the same for both types since there should not be any data > correctness issues as one of the types will be extended, for example, > DECIMAL(12, 2) -> DECIMAL(17, 2); however, this is not the case for upcasting > when the scale is different - this would depend on the actual values. > > Repro code: > {code:java} > import org.apache.spark.sql.types._ > val schema1 = StructType(StructField("col", DecimalType(17, 2)) :: Nil) > val schema2 = StructType(StructField("col", DecimalType(12, 2)) :: Nil) > schema1.merge(schema2) {code} > > This also affects Parquet schema merge which is where this issue was > discovered originally: > {code:java} > import java.math.BigDecimal > import org.apache.spark.sql.Row > import org.apache.spark.sql.types._ > val data1 = sc.parallelize(Row(new BigDecimal("123456789.11")) :: Nil, 1) > val schema1 = StructType(StructField("col", DecimalType(17, 2)) :: Nil) > val data2 = sc.parallelize(Row(new BigDecimal("123456789.11")) :: Nil, 1) > val schema2 = StructType(StructField("col", DecimalType(12, 2)) :: Nil) > spark.createDataFrame(data2, > schema2).write.parquet("/tmp/decimal-test.parquet") > spark.createDataFrame(data1, > schema1).write.mode("append").parquet("/tmp/decimal-test.parquet") > // Reading the DataFrame fails > spark.read.option("mergeSchema", > "true").parquet("/tmp/decimal-test.parquet").show() > >>> > Failed merging schema: > root > |-- col: decimal(17,2) (nullable = true) > Caused by: Failed to merge fields 'col' and 'col'. Failed to merge decimal > types with incompatible precision 12 and 17 > {code} > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37191) Allow merging DecimalTypes with different precision values
[ https://issues.apache.org/jira/browse/SPARK-37191?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-37191: --- Assignee: Ivan > Allow merging DecimalTypes with different precision values > --- > > Key: SPARK-37191 > URL: https://issues.apache.org/jira/browse/SPARK-37191 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.3, 3.1.0, 3.1.1, 3.2.0 >Reporter: Ivan >Assignee: Ivan >Priority: Major > Fix For: 3.3.0 > > > When merging DecimalTypes with different precision but the same scale, one > would get the following error: > {code:java} > Failed to merge fields 'col' and 'col'. Failed to merge decimal types with > incompatible precision 17 and 12 at > org.apache.spark.sql.types.StructType$.$anonfun$merge$2(StructType.scala:652) > at scala.Option.map(Option.scala:230) > at > org.apache.spark.sql.types.StructType$.$anonfun$merge$1(StructType.scala:644) > at > org.apache.spark.sql.types.StructType$.$anonfun$merge$1$adapted(StructType.scala:641) > at > scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36) > at > scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33) > at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:198) > at org.apache.spark.sql.types.StructType$.merge(StructType.scala:641) > at org.apache.spark.sql.types.StructType.merge(StructType.scala:550) > {code} > > We could allow merging DecimalType values with different precision if the > scale is the same for both types since there should not be any data > correctness issues as one of the types will be extended, for example, > DECIMAL(12, 2) -> DECIMAL(17, 2); however, this is not the case for upcasting > when the scale is different - this would depend on the actual values. > > Repro code: > {code:java} > import org.apache.spark.sql.types._ > val schema1 = StructType(StructField("col", DecimalType(17, 2)) :: Nil) > val schema2 = StructType(StructField("col", DecimalType(12, 2)) :: Nil) > schema1.merge(schema2) {code} > > This also affects Parquet schema merge which is where this issue was > discovered originally: > {code:java} > import java.math.BigDecimal > import org.apache.spark.sql.Row > import org.apache.spark.sql.types._ > val data1 = sc.parallelize(Row(new BigDecimal("123456789.11")) :: Nil, 1) > val schema1 = StructType(StructField("col", DecimalType(17, 2)) :: Nil) > val data2 = sc.parallelize(Row(new BigDecimal("123456789.11")) :: Nil, 1) > val schema2 = StructType(StructField("col", DecimalType(12, 2)) :: Nil) > spark.createDataFrame(data2, > schema2).write.parquet("/tmp/decimal-test.parquet") > spark.createDataFrame(data1, > schema1).write.mode("append").parquet("/tmp/decimal-test.parquet") > // Reading the DataFrame fails > spark.read.option("mergeSchema", > "true").parquet("/tmp/decimal-test.parquet").show() > >>> > Failed merging schema: > root > |-- col: decimal(17,2) (nullable = true) > Caused by: Failed to merge fields 'col' and 'col'. Failed to merge decimal > types with incompatible precision 12 and 17 > {code} > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-32567) Code-gen for full outer shuffled hash join
[ https://issues.apache.org/jira/browse/SPARK-32567?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-32567. - Fix Version/s: 3.3.0 Resolution: Fixed Issue resolved by pull request 3 [https://github.com/apache/spark/pull/3] > Code-gen for full outer shuffled hash join > -- > > Key: SPARK-32567 > URL: https://issues.apache.org/jira/browse/SPARK-32567 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.0 >Reporter: Cheng Su >Assignee: Cheng Su >Priority: Minor > Fix For: 3.3.0 > > > As a followup for [https://github.com/apache/spark/pull/29342] (non-codegen > full outer shuffled hash join), this task is to add code-gen for it. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32567) Code-gen for full outer shuffled hash join
[ https://issues.apache.org/jira/browse/SPARK-32567?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-32567: --- Assignee: Cheng Su > Code-gen for full outer shuffled hash join > -- > > Key: SPARK-32567 > URL: https://issues.apache.org/jira/browse/SPARK-32567 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.0 >Reporter: Cheng Su >Assignee: Cheng Su >Priority: Minor > > As a followup for [https://github.com/apache/spark/pull/29342] (non-codegen > full outer shuffled hash join), this task is to add code-gen for it. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37201) Spark SQL reads unnecessary nested fields (filter after explode)
[ https://issues.apache.org/jira/browse/SPARK-37201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sergey Kotlov updated SPARK-37201: -- Summary: Spark SQL reads unnecessary nested fields (filter after explode) (was: Spark SQL reads unnecсessary nested fields (filter after explode)) > Spark SQL reads unnecessary nested fields (filter after explode) > > > Key: SPARK-37201 > URL: https://issues.apache.org/jira/browse/SPARK-37201 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.0 >Reporter: Sergey Kotlov >Priority: Major > > In this example, reading unnecessary nested fields still happens. > Data preparation: > > {code:java} > case class Struct(v1: String, v2: String, v3: String) > case class Event(struct: Struct, array: Seq[String]) > Seq( > Event(Struct("v1","v2","v3"), Seq("cx1", "cx2")) > ).toDF().write.mode("overwrite").saveAsTable("table") > {code} > > v2 and v3 columns aren't needed here, but still exist in the physical plan. > {code:java} > spark.table("table") > .select($"struct.v1", explode($"array").as("el")) > .filter($"el" === "cx1") > .explain(true) > > == Physical Plan == > ... ReadSchema: > struct,array:array> > {code} > If you just remove _filter_ or move _explode_ to second _select_, everything > is fine: > {code:java} > spark.table("table") > .select($"struct.v1", explode($"array").as("el")) > //.filter($"el" === "cx1") > .explain(true) > > // ... ReadSchema: struct,array:array> > spark.table("table") > .select($"struct.v1", $"array") > .select($"v1", explode($"array").as("el")) > .filter($"el" === "cx1") > .explain(true) > > // ... ReadSchema: struct,array:array> > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37201) Spark SQL reads unnecсessary nested fields (filter after explode)
[ https://issues.apache.org/jira/browse/SPARK-37201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sergey Kotlov updated SPARK-37201: -- Summary: Spark SQL reads unnecсessary nested fields (filter after explode) (was: Spark SQL reads unnecessary nested fields (filter after explode)) > Spark SQL reads unnecсessary nested fields (filter after explode) > - > > Key: SPARK-37201 > URL: https://issues.apache.org/jira/browse/SPARK-37201 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.0 >Reporter: Sergey Kotlov >Priority: Major > > In this example, reading unnecessary nested fields still happens. > Data preparation: > > {code:java} > case class Struct(v1: String, v2: String, v3: String) > case class Event(struct: Struct, array: Seq[String]) > Seq( > Event(Struct("v1","v2","v3"), Seq("cx1", "cx2")) > ).toDF().write.mode("overwrite").saveAsTable("table") > {code} > > v2 and v3 columns aren't needed here, but still exist in the physical plan. > {code:java} > spark.table("table") > .select($"struct.v1", explode($"array").as("el")) > .filter($"el" === "cx1") > .explain(true) > > == Physical Plan == > ... ReadSchema: > struct,array:array> > {code} > If you just remove _filter_ or move _explode_ to second _select_, everything > is fine: > {code:java} > spark.table("table") > .select($"struct.v1", explode($"array").as("el")) > //.filter($"el" === "cx1") > .explain(true) > > // ... ReadSchema: struct,array:array> > spark.table("table") > .select($"struct.v1", $"array") > .select($"v1", explode($"array").as("el")) > .filter($"el" === "cx1") > .explain(true) > > // ... ReadSchema: struct,array:array> > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-37201) Spark SQL reads unnecessary nested fields (filter after explode)
Sergey Kotlov created SPARK-37201: - Summary: Spark SQL reads unnecessary nested fields (filter after explode) Key: SPARK-37201 URL: https://issues.apache.org/jira/browse/SPARK-37201 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.2.0 Reporter: Sergey Kotlov In this example, reading unnecessary nested fields still happens. Data preparation: {code:java} case class Struct(v1: String, v2: String, v3: String) case class Event(struct: Struct, array: Seq[String]) Seq( Event(Struct("v1","v2","v3"), Seq("cx1", "cx2")) ).toDF().write.mode("overwrite").saveAsTable("table") {code} v2 and v3 columns aren't needed here, but still exist in the physical plan. {code:java} spark.table("table") .select($"struct.v1", explode($"array").as("el")) .filter($"el" === "cx1") .explain(true) == Physical Plan == ... ReadSchema: struct,array:array> {code} If you just remove _filter_ or move _explode_ to second _select_, everything is fine: {code:java} spark.table("table") .select($"struct.v1", explode($"array").as("el")) //.filter($"el" === "cx1") .explain(true) // ... ReadSchema: struct,array:array> spark.table("table") .select($"struct.v1", $"array") .select($"v1", explode($"array").as("el")) .filter($"el" === "cx1") .explain(true) // ... ReadSchema: struct,array:array> {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-36935) Enhance ParquetSchemaConverter to capture Parquet repetition & definition level
[ https://issues.apache.org/jira/browse/SPARK-36935?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-36935. --- Fix Version/s: 3.3.0 Assignee: Chao Sun Resolution: Fixed This is resolved via https://github.com/apache/spark/pull/34199 > Enhance ParquetSchemaConverter to capture Parquet repetition & definition > level > --- > > Key: SPARK-36935 > URL: https://issues.apache.org/jira/browse/SPARK-36935 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.0 >Reporter: Chao Sun >Assignee: Chao Sun >Priority: Major > Fix For: 3.3.0 > > > In order to support complex type for Parquet vectorized reader, we'll need to > capture the repetition & definition level information associated with > Catalyst Spark type converted from Parquet {{MessageType}}. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-36879) Support Parquet v2 data page encodings for the vectorized path
[ https://issues.apache.org/jira/browse/SPARK-36879?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-36879: Assignee: (was: Apache Spark) > Support Parquet v2 data page encodings for the vectorized path > -- > > Key: SPARK-36879 > URL: https://issues.apache.org/jira/browse/SPARK-36879 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.0 >Reporter: Chao Sun >Priority: Major > > Currently Spark only support Parquet V1 encodings (i.e., > PLAIN/DICTIONARY/RLE) in the vectorized path, and throws exception otherwise: > {code} > java.lang.UnsupportedOperationException: Unsupported encoding: > DELTA_BYTE_ARRAY > {code} > It will be good to support v2 encodings too, including DELTA_BINARY_PACKED, > DELTA_LENGTH_BYTE_ARRAY, DELTA_BYTE_ARRAY as well as BYTE_STREAM_SPLIT as > listed in https://github.com/apache/parquet-format/blob/master/Encodings.md -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36879) Support Parquet v2 data page encodings for the vectorized path
[ https://issues.apache.org/jira/browse/SPARK-36879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17437628#comment-17437628 ] Apache Spark commented on SPARK-36879: -- User 'parthchandra' has created a pull request for this issue: https://github.com/apache/spark/pull/34471 > Support Parquet v2 data page encodings for the vectorized path > -- > > Key: SPARK-36879 > URL: https://issues.apache.org/jira/browse/SPARK-36879 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.0 >Reporter: Chao Sun >Priority: Major > > Currently Spark only support Parquet V1 encodings (i.e., > PLAIN/DICTIONARY/RLE) in the vectorized path, and throws exception otherwise: > {code} > java.lang.UnsupportedOperationException: Unsupported encoding: > DELTA_BYTE_ARRAY > {code} > It will be good to support v2 encodings too, including DELTA_BINARY_PACKED, > DELTA_LENGTH_BYTE_ARRAY, DELTA_BYTE_ARRAY as well as BYTE_STREAM_SPLIT as > listed in https://github.com/apache/parquet-format/blob/master/Encodings.md -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-36879) Support Parquet v2 data page encodings for the vectorized path
[ https://issues.apache.org/jira/browse/SPARK-36879?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-36879: Assignee: Apache Spark > Support Parquet v2 data page encodings for the vectorized path > -- > > Key: SPARK-36879 > URL: https://issues.apache.org/jira/browse/SPARK-36879 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.0 >Reporter: Chao Sun >Assignee: Apache Spark >Priority: Major > > Currently Spark only support Parquet V1 encodings (i.e., > PLAIN/DICTIONARY/RLE) in the vectorized path, and throws exception otherwise: > {code} > java.lang.UnsupportedOperationException: Unsupported encoding: > DELTA_BYTE_ARRAY > {code} > It will be good to support v2 encodings too, including DELTA_BINARY_PACKED, > DELTA_LENGTH_BYTE_ARRAY, DELTA_BYTE_ARRAY as well as BYTE_STREAM_SPLIT as > listed in https://github.com/apache/parquet-format/blob/master/Encodings.md -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37200) Drop index support
[ https://issues.apache.org/jira/browse/SPARK-37200?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-37200: Assignee: (was: Apache Spark) > Drop index support > -- > > Key: SPARK-37200 > URL: https://issues.apache.org/jira/browse/SPARK-37200 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.0 >Reporter: Huaxin Gao >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37200) Drop index support
[ https://issues.apache.org/jira/browse/SPARK-37200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17437597#comment-17437597 ] Apache Spark commented on SPARK-37200: -- User 'huaxingao' has created a pull request for this issue: https://github.com/apache/spark/pull/34469 > Drop index support > -- > > Key: SPARK-37200 > URL: https://issues.apache.org/jira/browse/SPARK-37200 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.0 >Reporter: Huaxin Gao >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37200) Drop index support
[ https://issues.apache.org/jira/browse/SPARK-37200?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-37200: Assignee: Apache Spark > Drop index support > -- > > Key: SPARK-37200 > URL: https://issues.apache.org/jira/browse/SPARK-37200 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.0 >Reporter: Huaxin Gao >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-37200) Drop index support
Huaxin Gao created SPARK-37200: -- Summary: Drop index support Key: SPARK-37200 URL: https://issues.apache.org/jira/browse/SPARK-37200 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.3.0 Reporter: Huaxin Gao -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37185) DataFrame.take() only uses one worker
[ https://issues.apache.org/jira/browse/SPARK-37185?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17437585#comment-17437585 ] mathieu longtin commented on SPARK-37185: - It seems to try to optimize for a simple query, but not more complex queries. It kind of make sense for "select * from t", but any where clause can make it quite restrictive. It looks like it scans the first part, doesn't find enough data, then scans four parts, then decides to scan everything. This is nice, but meanwhile, I have 20 workers already reserved, it wouldn't cost anything more to just go ahead right away. Timing, table is not cached, contains 69 csv.gz files with anywhere from 1MB to 2.2GB of data: {code:java} In [1]: %time spark.sql("select * from t where x = 99").take(10) CPU times: user 83.9 ms, sys: 112 ms, total: 196 ms Wall time: 6min 44s ... In [2]: %time spark.sql("select * from t where x = 99").limit(10).rdd.collect() CPU times: user 45.7 ms, sys: 73.9 ms, total: 120 ms Wall time: 3min 59s ... {code} I ran the two tests a few times to make sure there was no OS level caching effect, the timing didn't change much. If I cache the table first, then "take(10)" is faster than "limit(10).rdd.collect()". > DataFrame.take() only uses one worker > - > > Key: SPARK-37185 > URL: https://issues.apache.org/jira/browse/SPARK-37185 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.1, 3.2.0 > Environment: CentOS 7 >Reporter: mathieu longtin >Priority: Major > > Say you have query: > {code:java} > >>> df = spark.sql("select * from mytable where x = 99"){code} > Now, out of billions of row, there's only ten rows where x is 99. > If I do: > {code:java} > >>> df.limit(10).collect() > [Stage 1:> (0 + 1) / 1]{code} > It only uses one worker. This takes a really long time since one CPU is > reading the billions of row. > However, if I do this: > {code:java} > >>> df.limit(10).rdd.collect() > [Stage 1:> (0 + 10) / 22]{code} > All the workers are running. > I think there's some optimization issue DataFrame.take(...). > This did not use to be an issue, but I'm not sure if it was working with 3.0 > or 2.4. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37199) Add a deterministic field to QueryPlan
[ https://issues.apache.org/jira/browse/SPARK-37199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17437527#comment-17437527 ] Apache Spark commented on SPARK-37199: -- User 'somani' has created a pull request for this issue: https://github.com/apache/spark/pull/34470 > Add a deterministic field to QueryPlan > -- > > Key: SPARK-37199 > URL: https://issues.apache.org/jira/browse/SPARK-37199 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.0 >Reporter: Abhishek Somani >Priority: Major > > We have a _deterministic_ field in > [Expressions|https://github.com/apache/spark/blob/b78167a2ee6b11b1f2839274e23676411f919115/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Expression.scala#L115] > to check if an expression is deterministic, but we do not have a similar > field in > [QueryPlan.|https://github.com/apache/spark/blob/b78167a2ee6b11b1f2839274e23676411f919115/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/QueryPlan.scala#L44] > We have a need for such a check in the QueryPlan sometimes, like in > [InlineCTE|https://github.com/apache/spark/blob/b78167a2ee6b11b1f2839274e23676411f919115/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/InlineCTE.scala#L56] > This proposal is to add a _deterministic_ field to QueryPlan. > More details [in this > document|https://docs.google.com/document/d/1eIiaSJf-Co2HhjsaQxFNGwUxobnHID4ZGmJMcVytREc/edit#]. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37199) Add a deterministic field to QueryPlan
[ https://issues.apache.org/jira/browse/SPARK-37199?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-37199: Assignee: Apache Spark > Add a deterministic field to QueryPlan > -- > > Key: SPARK-37199 > URL: https://issues.apache.org/jira/browse/SPARK-37199 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.0 >Reporter: Abhishek Somani >Assignee: Apache Spark >Priority: Major > > We have a _deterministic_ field in > [Expressions|https://github.com/apache/spark/blob/b78167a2ee6b11b1f2839274e23676411f919115/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Expression.scala#L115] > to check if an expression is deterministic, but we do not have a similar > field in > [QueryPlan.|https://github.com/apache/spark/blob/b78167a2ee6b11b1f2839274e23676411f919115/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/QueryPlan.scala#L44] > We have a need for such a check in the QueryPlan sometimes, like in > [InlineCTE|https://github.com/apache/spark/blob/b78167a2ee6b11b1f2839274e23676411f919115/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/InlineCTE.scala#L56] > This proposal is to add a _deterministic_ field to QueryPlan. > More details [in this > document|https://docs.google.com/document/d/1eIiaSJf-Co2HhjsaQxFNGwUxobnHID4ZGmJMcVytREc/edit#]. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37199) Add a deterministic field to QueryPlan
[ https://issues.apache.org/jira/browse/SPARK-37199?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-37199: Assignee: (was: Apache Spark) > Add a deterministic field to QueryPlan > -- > > Key: SPARK-37199 > URL: https://issues.apache.org/jira/browse/SPARK-37199 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.0 >Reporter: Abhishek Somani >Priority: Major > > We have a _deterministic_ field in > [Expressions|https://github.com/apache/spark/blob/b78167a2ee6b11b1f2839274e23676411f919115/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Expression.scala#L115] > to check if an expression is deterministic, but we do not have a similar > field in > [QueryPlan.|https://github.com/apache/spark/blob/b78167a2ee6b11b1f2839274e23676411f919115/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/QueryPlan.scala#L44] > We have a need for such a check in the QueryPlan sometimes, like in > [InlineCTE|https://github.com/apache/spark/blob/b78167a2ee6b11b1f2839274e23676411f919115/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/InlineCTE.scala#L56] > This proposal is to add a _deterministic_ field to QueryPlan. > More details [in this > document|https://docs.google.com/document/d/1eIiaSJf-Co2HhjsaQxFNGwUxobnHID4ZGmJMcVytREc/edit#]. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37199) Add a deterministic field to QueryPlan
[ https://issues.apache.org/jira/browse/SPARK-37199?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Abhishek Somani updated SPARK-37199: Description: We have a _deterministic_ field in [Expressions|https://github.com/apache/spark/blob/b78167a2ee6b11b1f2839274e23676411f919115/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Expression.scala#L115] to check if an expression is deterministic, but we do not have a similar field in [QueryPlan.|https://github.com/apache/spark/blob/b78167a2ee6b11b1f2839274e23676411f919115/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/QueryPlan.scala#L44] We have a need for such a check in the QueryPlan sometimes, like in [InlineCTE|https://github.com/apache/spark/blob/b78167a2ee6b11b1f2839274e23676411f919115/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/InlineCTE.scala#L56] This proposal is to add a _deterministic_ field to QueryPlan. More details[ in this document|https://docs.google.com/document/d/1eIiaSJf-Co2HhjsaQxFNGwUxobnHID4ZGmJMcVytREc/edit#]. was: We have a _deterministic_ field in [Expressions|https://github.com/apache/spark/blob/b78167a2ee6b11b1f2839274e23676411f919115/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Expression.scala#L115] to check if an expression is deterministic, but we do not have a similar field in [QueryPlan.|https://github.com/apache/spark/blob/b78167a2ee6b11b1f2839274e23676411f919115/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/QueryPlan.scala#L44] We have a need for such a check in the QueryPlan sometimes, like in [InlineCTE|https://github.com/apache/spark/blob/b78167a2ee6b11b1f2839274e23676411f919115/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/InlineCTE.scala#L56] This proposal is to add a _deterministic_ field to QueryPlan. More details in this document. > Add a deterministic field to QueryPlan > -- > > Key: SPARK-37199 > URL: https://issues.apache.org/jira/browse/SPARK-37199 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.0 >Reporter: Abhishek Somani >Priority: Major > > We have a _deterministic_ field in > [Expressions|https://github.com/apache/spark/blob/b78167a2ee6b11b1f2839274e23676411f919115/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Expression.scala#L115] > to check if an expression is deterministic, but we do not have a similar > field in > [QueryPlan.|https://github.com/apache/spark/blob/b78167a2ee6b11b1f2839274e23676411f919115/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/QueryPlan.scala#L44] > We have a need for such a check in the QueryPlan sometimes, like in > [InlineCTE|https://github.com/apache/spark/blob/b78167a2ee6b11b1f2839274e23676411f919115/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/InlineCTE.scala#L56] > This proposal is to add a _deterministic_ field to QueryPlan. > More details[ in this > document|https://docs.google.com/document/d/1eIiaSJf-Co2HhjsaQxFNGwUxobnHID4ZGmJMcVytREc/edit#]. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37197) PySpark pandas recent issues from chconnell
[ https://issues.apache.org/jira/browse/SPARK-37197?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chuck Connell updated SPARK-37197: -- Description: SPARK-37180 PySpark.pandas should support __version__ SPARK-37181 pyspark.pandas.read_csv() should support latin-1 encoding SPARK-37183 pyspark.pandas.DataFrame.map() should support .fillna() SPARK-37184 pyspark.pandas should support DF["column"].str.split("some_suffix").str[0] SPARK-37186 pyspark.pandas should support tseries.offsets SPARK-37187 pyspark.pandas fails to create a histogram of one column from a large DataFrame SPARK-37188 pyspark.pandas histogram accepts the title option but does not add a title to the plot SPARK-37189 pyspark.pandas histogram accepts the range option but does not use it SPARK-37198 pyspark.pandas read_csv() and to_csv() should handle local files was: SPARK-37180 PySpark.pandas should support __version__ SPARK-37181 pyspark.pandas.read_csv() should support latin-1 encoding SPARK-37183 pyspark.pandas.DataFrame.map() should support .fillna() SPARK-37184 pyspark.pandas should support DF["column"].str.split("some_suffix").str[0] SPARK-37186 pyspark.pandas should support tseries.offsets SPARK-37187 pyspark.pandas fails to create a histogram of one column from a large DataFrame SPARK-37188 pyspark.pandas histogram accepts the title option but does not add a title to the plot SPARK-37189 pyspark.pandas histogram accepts the range option but does not use it > PySpark pandas recent issues from chconnell > --- > > Key: SPARK-37197 > URL: https://issues.apache.org/jira/browse/SPARK-37197 > Project: Spark > Issue Type: Umbrella > Components: PySpark >Affects Versions: 3.2.0 >Reporter: Chuck Connell >Priority: Major > > SPARK-37180 PySpark.pandas should support __version__ > SPARK-37181 pyspark.pandas.read_csv() should support latin-1 encoding > > SPARK-37183 pyspark.pandas.DataFrame.map() should support .fillna() > > SPARK-37184 pyspark.pandas should support > DF["column"].str.split("some_suffix").str[0] > SPARK-37186 pyspark.pandas should support tseries.offsets > SPARK-37187 pyspark.pandas fails to create a histogram of one column from a > large DataFrame > SPARK-37188 pyspark.pandas histogram accepts the title option but does not > add a title to the plot > SPARK-37189 pyspark.pandas histogram accepts the range option but does not > use it > SPARK-37198 pyspark.pandas read_csv() and to_csv() should handle local files > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37199) Add a deterministic field to QueryPlan
[ https://issues.apache.org/jira/browse/SPARK-37199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17437514#comment-17437514 ] Abhishek Somani commented on SPARK-37199: - Will add a PR soon. > Add a deterministic field to QueryPlan > -- > > Key: SPARK-37199 > URL: https://issues.apache.org/jira/browse/SPARK-37199 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.0 >Reporter: Abhishek Somani >Priority: Major > > We have a _deterministic_ field in > [Expressions|https://github.com/apache/spark/blob/b78167a2ee6b11b1f2839274e23676411f919115/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Expression.scala#L115] > to check if an expression is deterministic, but we do not have a similar > field in > [QueryPlan.|https://github.com/apache/spark/blob/b78167a2ee6b11b1f2839274e23676411f919115/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/QueryPlan.scala#L44] > We have a need for such a check in the QueryPlan sometimes, like in > [InlineCTE|https://github.com/apache/spark/blob/b78167a2ee6b11b1f2839274e23676411f919115/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/InlineCTE.scala#L56] > This proposal is to add a _deterministic_ field to QueryPlan. > More details [in this > document|https://docs.google.com/document/d/1eIiaSJf-Co2HhjsaQxFNGwUxobnHID4ZGmJMcVytREc/edit#]. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-37198) pyspark.pandas read_csv() and to_csv() should handle local files
Chuck Connell created SPARK-37198: - Summary: pyspark.pandas read_csv() and to_csv() should handle local files Key: SPARK-37198 URL: https://issues.apache.org/jira/browse/SPARK-37198 Project: Spark Issue Type: Improvement Components: PySpark Affects Versions: 3.2.0 Reporter: Chuck Connell Pandas programmers who move their code to Spark would like to import and export text files to and from their local disk. I know there are technical hurdles to this (since Spark is usually in a cluster that does not know where your local computer is) but it would really help code migration. For read_csv() and to_csv(), the syntax {{*file://c:/Temp/my_file.csv* }}(or something like this) should import and export to the local disk on Windows. Similarly for Mac and Linux. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-37199) Add a deterministic field to QueryPlan
Abhishek Somani created SPARK-37199: --- Summary: Add a deterministic field to QueryPlan Key: SPARK-37199 URL: https://issues.apache.org/jira/browse/SPARK-37199 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.2.0 Reporter: Abhishek Somani We have a _deterministic_ field in [Expressions|https://github.com/apache/spark/blob/b78167a2ee6b11b1f2839274e23676411f919115/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Expression.scala#L115] to check if an expression is deterministic, but we do not have a similar field in [QueryPlan.|https://github.com/apache/spark/blob/b78167a2ee6b11b1f2839274e23676411f919115/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/QueryPlan.scala#L44] We have a need for such a check in the QueryPlan sometimes, like in [InlineCTE|https://github.com/apache/spark/blob/b78167a2ee6b11b1f2839274e23676411f919115/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/InlineCTE.scala#L56] This proposal is to add a _deterministic_ field to QueryPlan. More details in this document. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37199) Add a deterministic field to QueryPlan
[ https://issues.apache.org/jira/browse/SPARK-37199?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Abhishek Somani updated SPARK-37199: Description: We have a _deterministic_ field in [Expressions|https://github.com/apache/spark/blob/b78167a2ee6b11b1f2839274e23676411f919115/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Expression.scala#L115] to check if an expression is deterministic, but we do not have a similar field in [QueryPlan.|https://github.com/apache/spark/blob/b78167a2ee6b11b1f2839274e23676411f919115/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/QueryPlan.scala#L44] We have a need for such a check in the QueryPlan sometimes, like in [InlineCTE|https://github.com/apache/spark/blob/b78167a2ee6b11b1f2839274e23676411f919115/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/InlineCTE.scala#L56] This proposal is to add a _deterministic_ field to QueryPlan. More details [in this document|https://docs.google.com/document/d/1eIiaSJf-Co2HhjsaQxFNGwUxobnHID4ZGmJMcVytREc/edit#]. was: We have a _deterministic_ field in [Expressions|https://github.com/apache/spark/blob/b78167a2ee6b11b1f2839274e23676411f919115/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Expression.scala#L115] to check if an expression is deterministic, but we do not have a similar field in [QueryPlan.|https://github.com/apache/spark/blob/b78167a2ee6b11b1f2839274e23676411f919115/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/QueryPlan.scala#L44] We have a need for such a check in the QueryPlan sometimes, like in [InlineCTE|https://github.com/apache/spark/blob/b78167a2ee6b11b1f2839274e23676411f919115/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/InlineCTE.scala#L56] This proposal is to add a _deterministic_ field to QueryPlan. More details[ in this document|https://docs.google.com/document/d/1eIiaSJf-Co2HhjsaQxFNGwUxobnHID4ZGmJMcVytREc/edit#]. > Add a deterministic field to QueryPlan > -- > > Key: SPARK-37199 > URL: https://issues.apache.org/jira/browse/SPARK-37199 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.0 >Reporter: Abhishek Somani >Priority: Major > > We have a _deterministic_ field in > [Expressions|https://github.com/apache/spark/blob/b78167a2ee6b11b1f2839274e23676411f919115/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Expression.scala#L115] > to check if an expression is deterministic, but we do not have a similar > field in > [QueryPlan.|https://github.com/apache/spark/blob/b78167a2ee6b11b1f2839274e23676411f919115/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/QueryPlan.scala#L44] > We have a need for such a check in the QueryPlan sometimes, like in > [InlineCTE|https://github.com/apache/spark/blob/b78167a2ee6b11b1f2839274e23676411f919115/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/InlineCTE.scala#L56] > This proposal is to add a _deterministic_ field to QueryPlan. > More details [in this > document|https://docs.google.com/document/d/1eIiaSJf-Co2HhjsaQxFNGwUxobnHID4ZGmJMcVytREc/edit#]. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-35496) Upgrade Scala 2.13 to 2.13.7
[ https://issues.apache.org/jira/browse/SPARK-35496?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-35496: -- Affects Version/s: (was: 3.2.0) > Upgrade Scala 2.13 to 2.13.7 > > > Key: SPARK-35496 > URL: https://issues.apache.org/jira/browse/SPARK-35496 > Project: Spark > Issue Type: Task > Components: Build >Affects Versions: 3.3.0 >Reporter: Yang Jie >Priority: Minor > > This issue aims to upgrade to Scala 2.13.7. > Scala 2.13.6 released(https://github.com/scala/scala/releases/tag/v2.13.6). > However, we skip 2.13.6 because there is a breaking behavior change at 2.13.6 > which is different from both Scala 2.13.5 and Scala 3. > - https://github.com/scala/bug/issues/12403 > {code} > scala3-3.0.0:$ bin/scala > scala> Array.empty[Double].intersect(Array(0.0)) > val res0: Array[Double] = Array() > scala-2.13.6:$ bin/scala > Welcome to Scala 2.13.6 (OpenJDK 64-Bit Server VM, Java 1.8.0_292). > Type in expressions for evaluation. Or try :help. > scala> Array.empty[Double].intersect(Array(0.0)) > java.lang.ClassCastException: [Ljava.lang.Object; cannot be cast to [D > ... 32 elided > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37066) Improve ORC RecordReader's error message
[ https://issues.apache.org/jira/browse/SPARK-37066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen reassigned SPARK-37066: Assignee: angerszhu > Improve ORC RecordReader's error message > > > Key: SPARK-37066 > URL: https://issues.apache.org/jira/browse/SPARK-37066 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.2.0 >Reporter: angerszhu >Assignee: angerszhu >Priority: Minor > > In below error message, we don't know the actual file path > {code} > 21/10/19 18:12:58 ERROR Executor: Exception in task 34.1 in stage 14.0 (TID > 257) > java.lang.ArrayIndexOutOfBoundsException: 1024 > at > org.apache.orc.impl.TreeReaderFactory$TreeReader.nextVector(TreeReaderFactory.java:292) > at > org.apache.orc.impl.TreeReaderFactory$StringDictionaryTreeReader.nextVector(TreeReaderFactory.java:1820) > at > org.apache.orc.impl.TreeReaderFactory$StringTreeReader.nextVector(TreeReaderFactory.java:1517) > at > org.apache.orc.impl.ConvertTreeReaderFactory$DateFromStringGroupTreeReader.nextVector(ConvertTreeReaderFactory.java:1802) > at > org.apache.orc.impl.TreeReaderFactory$StructTreeReader.nextBatch(TreeReaderFactory.java:2059) > at > org.apache.orc.impl.RecordReaderImpl.nextBatch(RecordReaderImpl.java:1324) > at > org.apache.spark.sql.execution.datasources.orc.OrcColumnarBatchReader.nextBatch(OrcColumnarBatchReader.java:196) > at > org.apache.spark.sql.execution.datasources.orc.OrcColumnarBatchReader.nextKeyValue(OrcColumnarBatchReader.java:99) > at > org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:37) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:93) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:173) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:93) > at > org.apache.spark.sql.execution.FileSourceScanExec$$anon$1.hasNext(DataSourceScanExec.scala:503) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.columnartorow_nextBatch_0$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:755) > at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458) > at > org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:132) > at > org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52) > at org.apache.spark.scheduler.Task.run(Task.scala:131) > at > org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497) > at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-37066) Improve ORC RecordReader's error message
[ https://issues.apache.org/jira/browse/SPARK-37066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen resolved SPARK-37066. -- Fix Version/s: 3.3.0 Resolution: Fixed Issue resolved by pull request 34337 [https://github.com/apache/spark/pull/34337] > Improve ORC RecordReader's error message > > > Key: SPARK-37066 > URL: https://issues.apache.org/jira/browse/SPARK-37066 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.2.0 >Reporter: angerszhu >Assignee: angerszhu >Priority: Minor > Fix For: 3.3.0 > > > In below error message, we don't know the actual file path > {code} > 21/10/19 18:12:58 ERROR Executor: Exception in task 34.1 in stage 14.0 (TID > 257) > java.lang.ArrayIndexOutOfBoundsException: 1024 > at > org.apache.orc.impl.TreeReaderFactory$TreeReader.nextVector(TreeReaderFactory.java:292) > at > org.apache.orc.impl.TreeReaderFactory$StringDictionaryTreeReader.nextVector(TreeReaderFactory.java:1820) > at > org.apache.orc.impl.TreeReaderFactory$StringTreeReader.nextVector(TreeReaderFactory.java:1517) > at > org.apache.orc.impl.ConvertTreeReaderFactory$DateFromStringGroupTreeReader.nextVector(ConvertTreeReaderFactory.java:1802) > at > org.apache.orc.impl.TreeReaderFactory$StructTreeReader.nextBatch(TreeReaderFactory.java:2059) > at > org.apache.orc.impl.RecordReaderImpl.nextBatch(RecordReaderImpl.java:1324) > at > org.apache.spark.sql.execution.datasources.orc.OrcColumnarBatchReader.nextBatch(OrcColumnarBatchReader.java:196) > at > org.apache.spark.sql.execution.datasources.orc.OrcColumnarBatchReader.nextKeyValue(OrcColumnarBatchReader.java:99) > at > org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:37) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:93) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:173) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:93) > at > org.apache.spark.sql.execution.FileSourceScanExec$$anon$1.hasNext(DataSourceScanExec.scala:503) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.columnartorow_nextBatch_0$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:755) > at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458) > at > org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:132) > at > org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52) > at org.apache.spark.scheduler.Task.run(Task.scala:131) > at > org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497) > at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37066) Improve ORC RecordReader's error message
[ https://issues.apache.org/jira/browse/SPARK-37066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen updated SPARK-37066: - Priority: Minor (was: Major) > Improve ORC RecordReader's error message > > > Key: SPARK-37066 > URL: https://issues.apache.org/jira/browse/SPARK-37066 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.2.0 >Reporter: angerszhu >Priority: Minor > > In below error message, we don't know the actual file path > {code} > 21/10/19 18:12:58 ERROR Executor: Exception in task 34.1 in stage 14.0 (TID > 257) > java.lang.ArrayIndexOutOfBoundsException: 1024 > at > org.apache.orc.impl.TreeReaderFactory$TreeReader.nextVector(TreeReaderFactory.java:292) > at > org.apache.orc.impl.TreeReaderFactory$StringDictionaryTreeReader.nextVector(TreeReaderFactory.java:1820) > at > org.apache.orc.impl.TreeReaderFactory$StringTreeReader.nextVector(TreeReaderFactory.java:1517) > at > org.apache.orc.impl.ConvertTreeReaderFactory$DateFromStringGroupTreeReader.nextVector(ConvertTreeReaderFactory.java:1802) > at > org.apache.orc.impl.TreeReaderFactory$StructTreeReader.nextBatch(TreeReaderFactory.java:2059) > at > org.apache.orc.impl.RecordReaderImpl.nextBatch(RecordReaderImpl.java:1324) > at > org.apache.spark.sql.execution.datasources.orc.OrcColumnarBatchReader.nextBatch(OrcColumnarBatchReader.java:196) > at > org.apache.spark.sql.execution.datasources.orc.OrcColumnarBatchReader.nextKeyValue(OrcColumnarBatchReader.java:99) > at > org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:37) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:93) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:173) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:93) > at > org.apache.spark.sql.execution.FileSourceScanExec$$anon$1.hasNext(DataSourceScanExec.scala:503) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.columnartorow_nextBatch_0$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:755) > at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458) > at > org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:132) > at > org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52) > at org.apache.spark.scheduler.Task.run(Task.scala:131) > at > org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497) > at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-24853) Support Column type for withColumn and withColumnRenamed apis
[ https://issues.apache.org/jira/browse/SPARK-24853?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17437394#comment-17437394 ] Nicholas Chammas edited comment on SPARK-24853 at 11/2/21, 2:41 PM: [~hyukjin.kwon] - It's not just for consistency. As noted in the description, this is useful when you are trying to rename a column with an ambiguous name. For example, imagine two tables {{left}} and {{right}}, each with a column called {{count}}: {code:python} ( left_counts.alias('left') .join(right_counts.alias('right'), on='join_key') .withColumn( 'total_count', left_counts['count'] + right_counts['count'] ) .withColumnRenamed('left.count', 'left_count') # no-op; alias doesn't work .withColumnRenamed('count', 'left_count') # incorrect; it renames both count columns .withColumnRenamed(left_counts['count'], 'left_count') # what, ideally, users want to do here .show() ){code} If you don't mind, I'm going to reopen this issue. was (Author: nchammas): [~hyukjin.kwon] - It's not just for consistency. As noted in the description, this is useful when you are trying to rename a column with an ambiguous name. For example, imagine two tables {{left}} and {{right}}, each with a column called {{count}}: {code:java} ( left_counts.alias('left') .join(right_counts.alias('right'), on='join_key') .withColumn( 'total_count', left_counts['count'] + right_counts['count'] ) .withColumnRenamed('left.count', 'left_count') # no-op; alias doesn't work .withColumnRenamed('count', 'left_count') # incorrect; it renames both count columns .withColumnRenamed(left_counts['count'], 'left_count') # what, ideally, users want to do here .show() ){code} If you don't mind, I'm going to reopen this issue. > Support Column type for withColumn and withColumnRenamed apis > - > > Key: SPARK-24853 > URL: https://issues.apache.org/jira/browse/SPARK-24853 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.2, 3.2.0 >Reporter: nirav patel >Priority: Minor > > Can we add overloaded version of withColumn or withColumnRenamed that accept > Column type instead of String? That way I can specify FQN in case when there > is duplicate column names. e.g. if I have 2 columns with same name as a > result of join and I want to rename one of the field I can do it with this > new API. > > This would be similar to Drop api which supports both String and Column type. > > def > withColumn(colName: Column, col: Column): DataFrame > Returns a new Dataset by adding a column or replacing the existing column > that has the same name. > > def > withColumnRenamed(existingName: Column, newName: Column): DataFrame > Returns a new Dataset with a column renamed. > > > > I think there should also be this one: > > def > withColumnRenamed(existingName: *Column*, newName: *Column*): DataFrame > Returns a new Dataset with a column renamed. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24853) Support Column type for withColumn and withColumnRenamed apis
[ https://issues.apache.org/jira/browse/SPARK-24853?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17437396#comment-17437396 ] Nicholas Chammas commented on SPARK-24853: -- The [contributing guide|https://spark.apache.org/contributing.html] isn't clear on how to populate "Affects Version" for improvements, so I've just tagged the latest release. > Support Column type for withColumn and withColumnRenamed apis > - > > Key: SPARK-24853 > URL: https://issues.apache.org/jira/browse/SPARK-24853 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.2, 3.2.0 >Reporter: nirav patel >Priority: Minor > > Can we add overloaded version of withColumn or withColumnRenamed that accept > Column type instead of String? That way I can specify FQN in case when there > is duplicate column names. e.g. if I have 2 columns with same name as a > result of join and I want to rename one of the field I can do it with this > new API. > > This would be similar to Drop api which supports both String and Column type. > > def > withColumn(colName: Column, col: Column): DataFrame > Returns a new Dataset by adding a column or replacing the existing column > that has the same name. > > def > withColumnRenamed(existingName: Column, newName: Column): DataFrame > Returns a new Dataset with a column renamed. > > > > I think there should also be this one: > > def > withColumnRenamed(existingName: *Column*, newName: *Column*): DataFrame > Returns a new Dataset with a column renamed. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24853) Support Column type for withColumn and withColumnRenamed apis
[ https://issues.apache.org/jira/browse/SPARK-24853?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicholas Chammas updated SPARK-24853: - Priority: Minor (was: Major) > Support Column type for withColumn and withColumnRenamed apis > - > > Key: SPARK-24853 > URL: https://issues.apache.org/jira/browse/SPARK-24853 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.2 >Reporter: nirav patel >Priority: Minor > > Can we add overloaded version of withColumn or withColumnRenamed that accept > Column type instead of String? That way I can specify FQN in case when there > is duplicate column names. e.g. if I have 2 columns with same name as a > result of join and I want to rename one of the field I can do it with this > new API. > > This would be similar to Drop api which supports both String and Column type. > > def > withColumn(colName: Column, col: Column): DataFrame > Returns a new Dataset by adding a column or replacing the existing column > that has the same name. > > def > withColumnRenamed(existingName: Column, newName: Column): DataFrame > Returns a new Dataset with a column renamed. > > > > I think there should also be this one: > > def > withColumnRenamed(existingName: *Column*, newName: *Column*): DataFrame > Returns a new Dataset with a column renamed. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-24853) Support Column type for withColumn and withColumnRenamed apis
[ https://issues.apache.org/jira/browse/SPARK-24853?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicholas Chammas reopened SPARK-24853: -- > Support Column type for withColumn and withColumnRenamed apis > - > > Key: SPARK-24853 > URL: https://issues.apache.org/jira/browse/SPARK-24853 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.2, 3.2.0 >Reporter: nirav patel >Priority: Minor > > Can we add overloaded version of withColumn or withColumnRenamed that accept > Column type instead of String? That way I can specify FQN in case when there > is duplicate column names. e.g. if I have 2 columns with same name as a > result of join and I want to rename one of the field I can do it with this > new API. > > This would be similar to Drop api which supports both String and Column type. > > def > withColumn(colName: Column, col: Column): DataFrame > Returns a new Dataset by adding a column or replacing the existing column > that has the same name. > > def > withColumnRenamed(existingName: Column, newName: Column): DataFrame > Returns a new Dataset with a column renamed. > > > > I think there should also be this one: > > def > withColumnRenamed(existingName: *Column*, newName: *Column*): DataFrame > Returns a new Dataset with a column renamed. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24853) Support Column type for withColumn and withColumnRenamed apis
[ https://issues.apache.org/jira/browse/SPARK-24853?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17437394#comment-17437394 ] Nicholas Chammas commented on SPARK-24853: -- [~hyukjin.kwon] - It's not just for consistency. As noted in the description, this is useful when you are trying to rename a column with an ambiguous name. For example, imagine two tables {{left}} and {{right}}, each with a column called {{count}}: {code:java} ( left_counts.alias('left') .join(right_counts.alias('right'), on='join_key') .withColumn( 'total_count', left_counts['count'] + right_counts['count'] ) .withColumnRenamed('left.count', 'left_count') # no-op; alias doesn't work .withColumnRenamed('count', 'left_count') # incorrect; it renames both count columns .withColumnRenamed(left_counts['count'], 'left_count') # what, ideally, users want to do here .show() ){code} If you don't mind, I'm going to reopen this issue. > Support Column type for withColumn and withColumnRenamed apis > - > > Key: SPARK-24853 > URL: https://issues.apache.org/jira/browse/SPARK-24853 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.2 >Reporter: nirav patel >Priority: Major > > Can we add overloaded version of withColumn or withColumnRenamed that accept > Column type instead of String? That way I can specify FQN in case when there > is duplicate column names. e.g. if I have 2 columns with same name as a > result of join and I want to rename one of the field I can do it with this > new API. > > This would be similar to Drop api which supports both String and Column type. > > def > withColumn(colName: Column, col: Column): DataFrame > Returns a new Dataset by adding a column or replacing the existing column > that has the same name. > > def > withColumnRenamed(existingName: Column, newName: Column): DataFrame > Returns a new Dataset with a column renamed. > > > > I think there should also be this one: > > def > withColumnRenamed(existingName: *Column*, newName: *Column*): DataFrame > Returns a new Dataset with a column renamed. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24853) Support Column type for withColumn and withColumnRenamed apis
[ https://issues.apache.org/jira/browse/SPARK-24853?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicholas Chammas updated SPARK-24853: - Affects Version/s: 3.2.0 > Support Column type for withColumn and withColumnRenamed apis > - > > Key: SPARK-24853 > URL: https://issues.apache.org/jira/browse/SPARK-24853 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.2, 3.2.0 >Reporter: nirav patel >Priority: Minor > > Can we add overloaded version of withColumn or withColumnRenamed that accept > Column type instead of String? That way I can specify FQN in case when there > is duplicate column names. e.g. if I have 2 columns with same name as a > result of join and I want to rename one of the field I can do it with this > new API. > > This would be similar to Drop api which supports both String and Column type. > > def > withColumn(colName: Column, col: Column): DataFrame > Returns a new Dataset by adding a column or replacing the existing column > that has the same name. > > def > withColumnRenamed(existingName: Column, newName: Column): DataFrame > Returns a new Dataset with a column renamed. > > > > I think there should also be this one: > > def > withColumnRenamed(existingName: *Column*, newName: *Column*): DataFrame > Returns a new Dataset with a column renamed. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-37189) pyspark.pandas histogram accepts the range option but does not use it
[ https://issues.apache.org/jira/browse/SPARK-37189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17437329#comment-17437329 ] Chuck Connell edited comment on SPARK-37189 at 11/2/21, 1:18 PM: - Ok, will do. was (Author: chconnell): Ok, will do. FYI, getting covid shot today, so I may be tired for a few days. > pyspark.pandas histogram accepts the range option but does not use it > - > > Key: SPARK-37189 > URL: https://issues.apache.org/jira/browse/SPARK-37189 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.2.0 >Reporter: Chuck Connell >Priority: Major > > In pyspark.pandas if you write a line like this > {quote}DF.plot.hist(bins=30, range=[0, 20], title="US Counties -- > DeathsPer100k (<20)") > {quote} > it compiles and runs, but the plot does not respect the range. All the values > are shown. > The workaround is to create a new DataFrame that pre-selects just the rows > you want, but line above should work also. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-37197) PySpark pandas recent issues from chconnell
Chuck Connell created SPARK-37197: - Summary: PySpark pandas recent issues from chconnell Key: SPARK-37197 URL: https://issues.apache.org/jira/browse/SPARK-37197 Project: Spark Issue Type: Umbrella Components: PySpark Affects Versions: 3.2.0 Reporter: Chuck Connell SPARK-37180 PySpark.pandas should support __version__ SPARK-37181 pyspark.pandas.read_csv() should support latin-1 encoding SPARK-37183 pyspark.pandas.DataFrame.map() should support .fillna() SPARK-37184 pyspark.pandas should support DF["column"].str.split("some_suffix").str[0] SPARK-37186 pyspark.pandas should support tseries.offsets SPARK-37187 pyspark.pandas fails to create a histogram of one column from a large DataFrame SPARK-37188 pyspark.pandas histogram accepts the title option but does not add a title to the plot SPARK-37189 pyspark.pandas histogram accepts the range option but does not use it -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37189) pyspark.pandas histogram accepts the range option but does not use it
[ https://issues.apache.org/jira/browse/SPARK-37189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17437329#comment-17437329 ] Chuck Connell commented on SPARK-37189: --- Ok, will do. FYI, getting covid shot today, so I may be tired for a few days. > pyspark.pandas histogram accepts the range option but does not use it > - > > Key: SPARK-37189 > URL: https://issues.apache.org/jira/browse/SPARK-37189 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.2.0 >Reporter: Chuck Connell >Priority: Major > > In pyspark.pandas if you write a line like this > {quote}DF.plot.hist(bins=30, range=[0, 20], title="US Counties -- > DeathsPer100k (<20)") > {quote} > it compiles and runs, but the plot does not respect the range. All the values > are shown. > The workaround is to create a new DataFrame that pre-selects just the rows > you want, but line above should work also. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23977) Add commit protocol binding to Hadoop 3.1 PathOutputCommitter mechanism
[ https://issues.apache.org/jira/browse/SPARK-23977?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17437259#comment-17437259 ] Steve Loughran commented on SPARK-23977: [~gumartinm] can I draw your attention to Apache Iceberg? meanwhile MAPREDUCE-7341 adds a high performance targeting abfs and gcs; all tree scanning is in task commit, which is atomic; job commit aggressively parallelised and optimized for stores whose listStatusIterator calls are incremental with prefetching: we can start processing at the first page of task manifests found in a listing well the second Page is still being retrieved. Also in there: rate limiting, IO Statistics Collection. HADOOP-17833 I will pick up some of that work, including incremental loading and rate limiting. And if we can keep reads and writes below the S3 IOPS limits, we should be able to avoid situations where we have to start sleeping and re-trying. HADOOP-17981 is my homework this week -emergency work to deal with a rare but current failure in abfs under heavy load. > Add commit protocol binding to Hadoop 3.1 PathOutputCommitter mechanism > --- > > Key: SPARK-23977 > URL: https://issues.apache.org/jira/browse/SPARK-23977 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Steve Loughran >Assignee: Steve Loughran >Priority: Minor > Fix For: 3.0.0 > > > Hadoop 3.1 adds a mechanism for job-specific and store-specific committers > (MAPREDUCE-6823, MAPREDUCE-6956), and one key implementation, S3A committers, > HADOOP-13786 > These committers deliver high-performance output of MR and spark jobs to S3, > and offer the key semantics which Spark depends on: no visible output until > job commit, a failure of a task at an stage, including partway through task > commit, can be handled by executing and committing another task attempt. > In contrast, the FileOutputFormat commit algorithms on S3 have issues: > * Awful performance because files are copied by rename > * FileOutputFormat v1: weak task commit failure recovery semantics as the > (v1) expectation: "directory renames are atomic" doesn't hold. > * S3 metadata eventual consistency can cause rename to miss files or fail > entirely (SPARK-15849) > Note also that FileOutputFormat "v2" commit algorithm doesn't offer any of > the commit semantics w.r.t observability of or recovery from task commit > failure, on any filesystem. > The S3A committers address these by way of uploading all data to the > destination through multipart uploads, uploads which are only completed in > job commit. > The new {{PathOutputCommitter}} factory mechanism allows applications to work > with the S3A committers and any other, by adding a plugin mechanism into the > MRv2 FileOutputFormat class, where it job config and filesystem configuration > options can dynamically choose the output committer. > Spark can use these with some binding classes to > # Add a subclass of {{HadoopMapReduceCommitProtocol}} which uses the MRv2 > classes and {{PathOutputCommitterFactory}} to create the committers. > # Add a {{BindingParquetOutputCommitter extends ParquetOutputCommitter}} > to wire up Parquet output even when code requires the committer to be a > subclass of {{ParquetOutputCommitter}} > This patch builds on SPARK-23807 for setting up the dependencies. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37196) NPE in org.apache.spark.sql.hive.HiveShim$.toCatalystDecimal(HiveShim.scala:106)
[ https://issues.apache.org/jira/browse/SPARK-37196?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sergey Yavorski updated SPARK-37196: Description: Still reproducible in Spark 3.0.1. +*How to reproduce:*+ *SHELL* > export SPARK_MAJOR_VERSION=3 *SPARK PART 1* > spark-shell --master local SPARK_MAJOR_VERSION is set to 3, using Spark3 Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). 21/10/25 11:43:58 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041. 21/10/25 11:43:58 WARN Utils: Service 'SparkUI' could not bind on port 4041. Attempting port 4042. Spark context Web UI available at :4042 Spark context available as 'sc' (master = local, app id = local-1635151438558). Spark session available as 'spark'. Welcome to __ / __/__ ___ _/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 3.0.1 /_/ Using Scala version 2.12.10 (OpenJDK 64-Bit Server VM, Java 1.8.0_302) Type in expressions to have them evaluated. Type :help for more information. scala> val df = spark.sql("select 'dummy' as name, 100010.7010 as value") df: org.apache.spark.sql.DataFrame = [name: string, value: decimal(38,16)] scala> df.write.mode("Overwrite").parquet("test.parquet") *HIVE* hive> create external table test_precision(name string, value Decimal(18,6)) STORED AS PARQUET LOCATION 'test'; OK Time taken: 0.067 seconds *SPARK PART 2* scala> spark.conf.set("spark.sql.hive.convertMetastoreParquet","false"); scala> val df_hive = spark.sql("select * from test_precision"); df_hive: org.apache.spark.sql.DataFrame = [name: string, value: decimal(18,6)] scala> df_hive.show; 21/10/25 12:04:51 ERROR Executor: Exception in task 0.0 in stage 2.0 (TID 2)/ 1] *{color:red}java.lang.NullPointerException at org.apache.spark.sql.hive.HiveShim$.toCatalystDecimal(HiveShim.scala:106){color}* at org.apache.spark.sql.hive.HadoopTableReader$.$anonfun$fillObject$13(TableReader.scala:465) at org.apache.spark.sql.hive.HadoopTableReader$.$anonfun$fillObject$13$adapted(TableReader.scala:464) at org.apache.spark.sql.hive.HadoopTableReader$.$anonfun$fillObject$18(TableReader.scala:493) at scala.collection.Iterator$$anon$10.next(Iterator.scala:459) at scala.collection.Iterator$$anon$10.next(Iterator.scala:459) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:729) at org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:340) at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:872) at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:872) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349) at org.apache.spark.rdd.RDD.iterator(RDD.scala:313) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at org.apache.spark.scheduler.Task.run(Task.scala:127) at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:446) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:449) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) 21/10/25 12:04:51 WARN TaskSetManager: Lost task 0.0 in stage 2.0 (TID 2, tklis-kappd0005.dev.df.sbrf.ru, executor driver): java.lang.NullPointerException at org.apache.spark.sql.hive.HiveShim$.toCatalystDecimal(HiveShim.scala:106) at org.apache.spark.sql.hive.HadoopTableReader$.$anonfun$fillObject$13(TableReader.scala:465) at org.apache.spark.sql.hive.HadoopTableReader$.$anonfun$fillObject$13$adapted(TableReader.scala:464) at org.apache.spark.sql.hive.HadoopTableReader$.$anonfun$fillObject$18(TableReader.scala:493) at scala.collection.Iterator$$anon$10.next(Iterator.scala:459) at scala.collection.Iterator$$anon$10.next(Iterator.scala:459) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at
[jira] [Created] (SPARK-37196) NPE in org.apache.spark.sql.hive.HiveShim$.toCatalystDecimal(HiveShim.scala:106)
Sergey Yavorski created SPARK-37196: --- Summary: NPE in org.apache.spark.sql.hive.HiveShim$.toCatalystDecimal(HiveShim.scala:106) Key: SPARK-37196 URL: https://issues.apache.org/jira/browse/SPARK-37196 Project: Spark Issue Type: Bug Components: Input/Output Affects Versions: 3.0.1 Reporter: Sergey Yavorski This is still reproducible in Spark 3.0.1: -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-37176) JsonSource's infer should have the same exception handle logic as JacksonParser's parse logic
[ https://issues.apache.org/jira/browse/SPARK-37176?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Max Gekk resolved SPARK-37176. -- Fix Version/s: 3.3.0 Resolution: Fixed Issue resolved by pull request 34455 [https://github.com/apache/spark/pull/34455] > JsonSource's infer should have the same exception handle logic as > JacksonParser's parse logic > - > > Key: SPARK-37176 > URL: https://issues.apache.org/jira/browse/SPARK-37176 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.3, 3.1.2, 3.2.0 >Reporter: Xianjin YE >Assignee: Xianjin YE >Priority: Minor > Fix For: 3.3.0 > > > JacksonParser's exception handle logic is different with > org.apache.spark.sql.catalyst.json.JsonInferSchema#infer logic, the different > can be saw as below: > {code:java} > // code JacksonParser's parse > try { > Utils.tryWithResource(createParser(factory, record)) { parser => > // a null first token is equivalent to testing for input.trim.isEmpty > // but it works on any token stream and not just strings > parser.nextToken() match { > case null => None > case _ => rootConverter.apply(parser) match { > case null => throw > QueryExecutionErrors.rootConverterReturnNullError() > case rows => rows.toSeq > } > } > } > } catch { > case e: SparkUpgradeException => throw e > case e @ (_: RuntimeException | _: JsonProcessingException | _: > MalformedInputException) => > // JSON parser currently doesn't support partial results for > corrupted records. > // For such records, all fields other than the field configured by > // `columnNameOfCorruptRecord` are set to `null`. > throw BadRecordException(() => recordLiteral(record), () => None, e) > case e: CharConversionException if options.encoding.isEmpty => > val msg = > """JSON parser cannot handle a character in its input. > |Specifying encoding as an input option explicitly might help to > resolve the issue. > |""".stripMargin + e.getMessage > val wrappedCharException = new CharConversionException(msg) > wrappedCharException.initCause(e) > throw BadRecordException(() => recordLiteral(record), () => None, > wrappedCharException) > case PartialResultException(row, cause) => > throw BadRecordException( > record = () => recordLiteral(record), > partialResult = () => Some(row), > cause) > } > {code} > v.s. > {code:java} > // JsonInferSchema's infer logic > val mergedTypesFromPartitions = json.mapPartitions { iter => > val factory = options.buildJsonFactory() > iter.flatMap { row => > try { > Utils.tryWithResource(createParser(factory, row)) { parser => > parser.nextToken() > Some(inferField(parser)) > } > } catch { > case e @ (_: RuntimeException | _: JsonProcessingException) => > parseMode match { > case PermissiveMode => > Some(StructType(Seq(StructField(columnNameOfCorruptRecord, > StringType > case DropMalformedMode => > None > case FailFastMode => > throw > QueryExecutionErrors.malformedRecordsDetectedInSchemaInferenceError(e) > } > } > }.reduceOption(typeMerger).toIterator > } > {code} > They should have the same exception handle logic, otherwise it may confuse > user because of the inconsistency. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37176) JsonSource's infer should have the same exception handle logic as JacksonParser's parse logic
[ https://issues.apache.org/jira/browse/SPARK-37176?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Max Gekk reassigned SPARK-37176: Assignee: Xianjin YE > JsonSource's infer should have the same exception handle logic as > JacksonParser's parse logic > - > > Key: SPARK-37176 > URL: https://issues.apache.org/jira/browse/SPARK-37176 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.3, 3.1.2, 3.2.0 >Reporter: Xianjin YE >Assignee: Xianjin YE >Priority: Minor > > JacksonParser's exception handle logic is different with > org.apache.spark.sql.catalyst.json.JsonInferSchema#infer logic, the different > can be saw as below: > {code:java} > // code JacksonParser's parse > try { > Utils.tryWithResource(createParser(factory, record)) { parser => > // a null first token is equivalent to testing for input.trim.isEmpty > // but it works on any token stream and not just strings > parser.nextToken() match { > case null => None > case _ => rootConverter.apply(parser) match { > case null => throw > QueryExecutionErrors.rootConverterReturnNullError() > case rows => rows.toSeq > } > } > } > } catch { > case e: SparkUpgradeException => throw e > case e @ (_: RuntimeException | _: JsonProcessingException | _: > MalformedInputException) => > // JSON parser currently doesn't support partial results for > corrupted records. > // For such records, all fields other than the field configured by > // `columnNameOfCorruptRecord` are set to `null`. > throw BadRecordException(() => recordLiteral(record), () => None, e) > case e: CharConversionException if options.encoding.isEmpty => > val msg = > """JSON parser cannot handle a character in its input. > |Specifying encoding as an input option explicitly might help to > resolve the issue. > |""".stripMargin + e.getMessage > val wrappedCharException = new CharConversionException(msg) > wrappedCharException.initCause(e) > throw BadRecordException(() => recordLiteral(record), () => None, > wrappedCharException) > case PartialResultException(row, cause) => > throw BadRecordException( > record = () => recordLiteral(record), > partialResult = () => Some(row), > cause) > } > {code} > v.s. > {code:java} > // JsonInferSchema's infer logic > val mergedTypesFromPartitions = json.mapPartitions { iter => > val factory = options.buildJsonFactory() > iter.flatMap { row => > try { > Utils.tryWithResource(createParser(factory, row)) { parser => > parser.nextToken() > Some(inferField(parser)) > } > } catch { > case e @ (_: RuntimeException | _: JsonProcessingException) => > parseMode match { > case PermissiveMode => > Some(StructType(Seq(StructField(columnNameOfCorruptRecord, > StringType > case DropMalformedMode => > None > case FailFastMode => > throw > QueryExecutionErrors.malformedRecordsDetectedInSchemaInferenceError(e) > } > } > }.reduceOption(typeMerger).toIterator > } > {code} > They should have the same exception handle logic, otherwise it may confuse > user because of the inconsistency. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37194) Avoid unnecessary sort in FileFormatWriter if it's not dynamic partition
[ https://issues.apache.org/jira/browse/SPARK-37194?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-37194: Assignee: (was: Apache Spark) > Avoid unnecessary sort in FileFormatWriter if it's not dynamic partition > > > Key: SPARK-37194 > URL: https://issues.apache.org/jira/browse/SPARK-37194 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.0 >Reporter: XiDuo You >Priority: Major > > `FileFormatWriter.write` will sort the partition and bucket column before > writing. I think this code path assumed the input `partitionColumns` are > dynamic but actually it's not. It now is used by three code path: > - `FileStreamSink`; it should be always dynamic partition > - `SaveAsHiveFile`; it followed the assuming that `InsertIntoHiveTable` has > removed the static partition and `InsertIntoHiveDirCommand` has no partition > - `InsertIntoHadoopFsRelationCommand`; it passed `partitionColumns` into > `FileFormatWriter.write` without removing static partition because we need it > to generate the partition path in `DynamicPartitionDataWriter` > It shows that the unnecessary sort only affected the > `InsertIntoHadoopFsRelationCommand` if we write data with static partition. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37194) Avoid unnecessary sort in FileFormatWriter if it's not dynamic partition
[ https://issues.apache.org/jira/browse/SPARK-37194?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-37194: Assignee: Apache Spark > Avoid unnecessary sort in FileFormatWriter if it's not dynamic partition > > > Key: SPARK-37194 > URL: https://issues.apache.org/jira/browse/SPARK-37194 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.0 >Reporter: XiDuo You >Assignee: Apache Spark >Priority: Major > > `FileFormatWriter.write` will sort the partition and bucket column before > writing. I think this code path assumed the input `partitionColumns` are > dynamic but actually it's not. It now is used by three code path: > - `FileStreamSink`; it should be always dynamic partition > - `SaveAsHiveFile`; it followed the assuming that `InsertIntoHiveTable` has > removed the static partition and `InsertIntoHiveDirCommand` has no partition > - `InsertIntoHadoopFsRelationCommand`; it passed `partitionColumns` into > `FileFormatWriter.write` without removing static partition because we need it > to generate the partition path in `DynamicPartitionDataWriter` > It shows that the unnecessary sort only affected the > `InsertIntoHadoopFsRelationCommand` if we write data with static partition. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37194) Avoid unnecessary sort in FileFormatWriter if it's not dynamic partition
[ https://issues.apache.org/jira/browse/SPARK-37194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17437226#comment-17437226 ] Apache Spark commented on SPARK-37194: -- User 'ulysses-you' has created a pull request for this issue: https://github.com/apache/spark/pull/34468 > Avoid unnecessary sort in FileFormatWriter if it's not dynamic partition > > > Key: SPARK-37194 > URL: https://issues.apache.org/jira/browse/SPARK-37194 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.0 >Reporter: XiDuo You >Priority: Major > > `FileFormatWriter.write` will sort the partition and bucket column before > writing. I think this code path assumed the input `partitionColumns` are > dynamic but actually it's not. It now is used by three code path: > - `FileStreamSink`; it should be always dynamic partition > - `SaveAsHiveFile`; it followed the assuming that `InsertIntoHiveTable` has > removed the static partition and `InsertIntoHiveDirCommand` has no partition > - `InsertIntoHadoopFsRelationCommand`; it passed `partitionColumns` into > `FileFormatWriter.write` without removing static partition because we need it > to generate the partition path in `DynamicPartitionDataWriter` > It shows that the unnecessary sort only affected the > `InsertIntoHadoopFsRelationCommand` if we write data with static partition. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36895) Add Create Index syntax support
[ https://issues.apache.org/jira/browse/SPARK-36895?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17437216#comment-17437216 ] Apache Spark commented on SPARK-36895: -- User 'cloud-fan' has created a pull request for this issue: https://github.com/apache/spark/pull/34467 > Add Create Index syntax support > --- > > Key: SPARK-36895 > URL: https://issues.apache.org/jira/browse/SPARK-36895 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.0 >Reporter: Huaxin Gao >Assignee: Huaxin Gao >Priority: Major > Fix For: 3.3.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37034) What's the progress of vectorized execution for spark?
[ https://issues.apache.org/jira/browse/SPARK-37034?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17437213#comment-17437213 ] xiaoli commented on SPARK-37034: [~cloud_fan] Thanks very much. The hint in your answer helps me find ApplyColumnarRulesAndInsertTransitions class in spark, I could do more investigation now. > What's the progress of vectorized execution for spark? > -- > > Key: SPARK-37034 > URL: https://issues.apache.org/jira/browse/SPARK-37034 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Affects Versions: 3.2.0 >Reporter: xiaoli >Priority: Major > > Spark has support vectorized read for ORC and parquet. What's the progress of > other vectorized execution, e.g. vectorized write, join, aggr, simple > operator (string function, math function)? > Hive support vectorized execution in early version > (https://cwiki.apache.org/confluence/display/hive/vectorized+query+execution) > As we know, Spark is replacement of Hive. I guess the reason why Spark does > not support vectorized execution maybe the difficulty of design or > implementation in Spark is larger. What's the main issue for Spark to support > vectorized execution? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-37195) Unify v1 and v2 SHOW TBLPROPERTIES tests
PengLei created SPARK-37195: --- Summary: Unify v1 and v2 SHOW TBLPROPERTIES tests Key: SPARK-37195 URL: https://issues.apache.org/jira/browse/SPARK-37195 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.3.0 Reporter: PengLei Fix For: 3.3.0 Unify v1 and v2 SHOW TBLPROPERTIES tests -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36924) CAST between ANSI intervals and numerics
[ https://issues.apache.org/jira/browse/SPARK-36924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17437198#comment-17437198 ] PengLei commented on SPARK-36924: - woking on this later > CAST between ANSI intervals and numerics > > > Key: SPARK-36924 > URL: https://issues.apache.org/jira/browse/SPARK-36924 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.0 >Reporter: Max Gekk >Priority: Major > > Support casting between ANSI intervals and numerics. The implementation > should follow ANSI SQL standard. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37194) Avoid unnecessary sort in FileFormatWriter if it's not dynamic partition
[ https://issues.apache.org/jira/browse/SPARK-37194?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] XiDuo You updated SPARK-37194: -- Description: `FileFormatWriter.write` will sort the partition and bucket column before writing. I think this code path assumed the input `partitionColumns` are dynamic but actually it's not. It now is used by three code path: - `FileStreamSink`; it should be always dynamic partition - `SaveAsHiveFile`; it followed the assuming that `InsertIntoHiveTable` has removed the static partition and `InsertIntoHiveDirCommand` has no partition - `InsertIntoHadoopFsRelationCommand`; it passed `partitionColumns` into `FileFormatWriter.write` without removing static partition because we need it to generate the partition path in `DynamicPartitionDataWriter` It shows that the unnecessary sort only affected the `InsertIntoHadoopFsRelationCommand` if we write data with static partition. was: `FileFormatWriter.write` will sort the partition and bucket column before writing. I think this code path assumed the input `partitionColumns` are dynamic but actually it's not. It now is used by three code path: - `FileStreamSink`; it should be always dynamic partition - `SaveAsHiveFile`; it followed the assuming that `InsertIntoHiveTable` has removed the static partition and `InsertIntoHiveDirCommand` has no partition - `InsertIntoHadoopFsRelationCommand`; it passed `partitionColumns` into `FileFormatWriter.write` without removing static partition because we need it to generate the partition path in `DynamicPartitionDataWriter` It shows that the unnecessary sort only affected the `InsertIntoHadoopFsRelationCommand` if we write data with static partition. Do a simple benchmak: {code:java} CREATE TABLE test (id long) USING PARQUET PARTITIONED BY (d string); -- before this PR, it tooks 1.82 seconds -- after this PR, it tooks 1.072 seconds INSERT OVERWRITE TABLE test PARTITION(d='a') SELECT id FROM range(1000); {code} > Avoid unnecessary sort in FileFormatWriter if it's not dynamic partition > > > Key: SPARK-37194 > URL: https://issues.apache.org/jira/browse/SPARK-37194 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.0 >Reporter: XiDuo You >Priority: Major > > `FileFormatWriter.write` will sort the partition and bucket column before > writing. I think this code path assumed the input `partitionColumns` are > dynamic but actually it's not. It now is used by three code path: > - `FileStreamSink`; it should be always dynamic partition > - `SaveAsHiveFile`; it followed the assuming that `InsertIntoHiveTable` has > removed the static partition and `InsertIntoHiveDirCommand` has no partition > - `InsertIntoHadoopFsRelationCommand`; it passed `partitionColumns` into > `FileFormatWriter.write` without removing static partition because we need it > to generate the partition path in `DynamicPartitionDataWriter` > It shows that the unnecessary sort only affected the > `InsertIntoHadoopFsRelationCommand` if we write data with static partition. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26365) spark-submit for k8s cluster doesn't propagate exit code
[ https://issues.apache.org/jira/browse/SPARK-26365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17437197#comment-17437197 ] Oscar Bonilla commented on SPARK-26365: --- Hi [~Gangishetty], unfortunately I'm not a contributor to the apache Spark project. I only reported the issue because I found it. You'll have to follow the usual channels if you want this issue prioritized. Although I assume it's probably already been fixed in the 3.x branch > spark-submit for k8s cluster doesn't propagate exit code > > > Key: SPARK-26365 > URL: https://issues.apache.org/jira/browse/SPARK-26365 > Project: Spark > Issue Type: Bug > Components: Kubernetes, Spark Core, Spark Submit >Affects Versions: 2.3.2, 2.4.0 >Reporter: Oscar Bonilla >Priority: Minor > Attachments: spark-2.4.5-raise-exception-k8s-failure.patch, > spark-3.0.0-raise-exception-k8s-failure.patch > > > When launching apps using spark-submit in a kubernetes cluster, if the Spark > applications fails (returns exit code = 1 for example), spark-submit will > still exit gracefully and return exit code = 0. > This is problematic, since there's no way to know if there's been a problem > with the Spark application. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37194) Avoid unnecessary sort in FileFormatWriter if it's not dynamic partition
[ https://issues.apache.org/jira/browse/SPARK-37194?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] XiDuo You updated SPARK-37194: -- Description: `FileFormatWriter.write` will sort the partition and bucket column before writing. I think this code path assumed the input `partitionColumns` are dynamic but actually it's not. It now is used by three code path: - `FileStreamSink`; it should be always dynamic partition - `SaveAsHiveFile`; it followed the assuming that `InsertIntoHiveTable` has removed the static partition and `InsertIntoHiveDirCommand` has no partition - `InsertIntoHadoopFsRelationCommand`; it passed `partitionColumns` into `FileFormatWriter.write` without removing static partition because we need it to generate the partition path in `DynamicPartitionDataWriter` It shows that the unnecessary sort only affected the `InsertIntoHadoopFsRelationCommand` if we write data with static partition. Do a simple benchmak: {code:java} CREATE TABLE test (id long) USING PARQUET PARTITIONED BY (d string); -- before this PR, it tooks 1.82 seconds -- after this PR, it tooks 1.072 seconds INSERT OVERWRITE TABLE test PARTITION(d='a') SELECT id FROM range(1000); {code} was: `FileFormatWriter.write` will sort the partition and bucket column before do writing. I think this code path assumed the input `partitionColumns` are dynamic but actually it's not. It now is used by three code path: - `FileStreamSink`; it should be always dynamic partition - `SaveAsHiveFile`; it followed the assuming that `InsertIntoHiveTable` has removed the static partition and `InsertIntoHiveDirCommand` has no partition - `InsertIntoHadoopFsRelationCommand`; it passed `partitionColumns` into `FileFormatWriter.write` without removing static partition because we need it to generate the partition path in `DynamicPartitionDataWriter` It shows that the unnecessary sort only affected the `InsertIntoHadoopFsRelationCommand` if we write data with static partition. Do a simple benchmak: {code:java} CREATE TABLE test (id long) USING PARQUET PARTITIONED BY (d string); -- before this PR, it tooks 1.82 seconds -- after this PR, it tooks 1.072 seconds INSERT OVERWRITE TABLE test PARTITION(d='a') SELECT id FROM range(1000); {code} > Avoid unnecessary sort in FileFormatWriter if it's not dynamic partition > > > Key: SPARK-37194 > URL: https://issues.apache.org/jira/browse/SPARK-37194 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.0 >Reporter: XiDuo You >Priority: Major > > `FileFormatWriter.write` will sort the partition and bucket column before > writing. I think this code path assumed the input `partitionColumns` are > dynamic but actually it's not. It now is used by three code path: > - `FileStreamSink`; it should be always dynamic partition > - `SaveAsHiveFile`; it followed the assuming that `InsertIntoHiveTable` has > removed the static partition and `InsertIntoHiveDirCommand` has no partition > - `InsertIntoHadoopFsRelationCommand`; it passed `partitionColumns` into > `FileFormatWriter.write` without removing static partition because we need it > to generate the partition path in `DynamicPartitionDataWriter` > It shows that the unnecessary sort only affected the > `InsertIntoHadoopFsRelationCommand` if we write data with static partition. > > Do a simple benchmak: > {code:java} > CREATE TABLE test (id long) USING PARQUET PARTITIONED BY (d string); > -- before this PR, it tooks 1.82 seconds > -- after this PR, it tooks 1.072 seconds > INSERT OVERWRITE TABLE test PARTITION(d='a') SELECT id FROM range(1000); > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-37194) Avoid unnecessary sort in FileFormatWriter if it's not dynamic partition
XiDuo You created SPARK-37194: - Summary: Avoid unnecessary sort in FileFormatWriter if it's not dynamic partition Key: SPARK-37194 URL: https://issues.apache.org/jira/browse/SPARK-37194 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.3.0 Reporter: XiDuo You `FileFormatWriter.write` will sort the partition and bucket column before do writing. I think this code path assumed the input `partitionColumns` are dynamic but actually it's not. It now is used by three code path: - `FileStreamSink`; it should be always dynamic partition - `SaveAsHiveFile`; it followed the assuming that `InsertIntoHiveTable` has removed the static partition and `InsertIntoHiveDirCommand` has no partition - `InsertIntoHadoopFsRelationCommand`; it passed `partitionColumns` into `FileFormatWriter.write` without removing static partition because we need it to generate the partition path in `DynamicPartitionDataWriter` It shows that the unnecessary sort only affected the `InsertIntoHadoopFsRelationCommand` if we write data with static partition. Do a simple benchmak: {code:java} CREATE TABLE test (id long) USING PARQUET PARTITIONED BY (d string); -- before this PR, it tooks 1.82 seconds -- after this PR, it tooks 1.072 seconds INSERT OVERWRITE TABLE test PARTITION(d='a') SELECT id FROM range(1000); {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-37174) WARN WindowExec: No Partition Defined is being printed 4 times.
[ https://issues.apache.org/jira/browse/SPARK-37174?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17437184#comment-17437184 ] Hyukjin Kwon edited comment on SPARK-37174 at 11/2/21, 7:57 AM: This is related to default index, see also https://spark.apache.org/docs/latest/api/python/user_guide/pandas_on_spark/options.html#default-index-type. Spark 3.3 targets to remove such warnings for natively supporting global windows. That's slightly orthogonal from this issue though. was (Author: hyukjin.kwon): This is related to default index, see also https://spark.apache.org/docs/latest/api/python/user_guide/pandas_on_spark/options.html#default-index-type. Spark 3.3 targets to remove such warnings. > WARN WindowExec: No Partition Defined is being printed 4 times. > > > Key: SPARK-37174 > URL: https://issues.apache.org/jira/browse/SPARK-37174 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.3.0 >Reporter: Bjørn Jørgensen >Priority: Major > > Hi I use this code > {code:java} > f01 = spark.read.json("/home/test_files/falk/flatted110721/F01.json/*.json") > pf01 = f01.to_pandas_on_spark() > pf01 = pf01.rename(columns=lambda x: re.sub(':P$', '', x)) > pf01["OBJECT_CONTRACT:DATE_PUBLICATION_NOTICE"] = > ps.to_datetime(pf01["OBJECT_CONTRACT:DATE_PUBLICATION_NOTICE"]) > pf01.info(){code} > > sometimes it prints > > {code:java} > 21/10/31 20:38:04 WARN WindowExec: No Partition Defined for Window > operation! Moving all data to a single partition, this can cause serious > performance degradation. > 21/10/31 20:38:04 WARN package: Truncated the string representation of a > plan since it was too large. This behavior can be adjusted by setting > 'spark.sql.debug.maxToStringFields'. > 21/10/31 20:38:08 WARN WindowExec: No Partition Defined for Window > operation! Moving all data to a single partition, this can cause serious > performance degradation. > /opt/spark/python/pyspark/sql/pandas/conversion.py:214: PerformanceWarning: > DataFrame is highly fragmented. This is usually the result of calling > `frame.insert` many times, which has poor performance. Consider joining all > columns at once using pd.concat(axis=1) instead. To get a de-fragmented > frame, use `newframe = frame.copy()` > df[column_name] = series > /opt/spark/python/pyspark/pandas/utils.py:967: UserWarning: `to_pandas` > loads all data into the driver's memory. It should only be used if the > resulting pandas Series is expected to be small. > warnings.warn(message, UserWarning) > 21/10/31 20:38:16 WARN WindowExec: No Partition Defined for Window > operation! Moving all data to a single partition, this can cause serious > performance degradation. > 21/10/31 20:38:18 WARN WindowExec: No Partition Defined for Window > operation! Moving all data to a single partition, this can cause serious > performance degradation.{code} > > and some other times it "just" prints > > {code:java} > 21/10/31 21:24:13 WARN WindowExec: No Partition Defined for Window > operation! Moving all data to a single partition, this can cause serious > performance degradation. > 21/10/31 21:24:16 WARN WindowExec: No Partition Defined for Window > operation! Moving all data to a single partition, this can cause serious > performance degradation. > 21/10/31 21:24:22 WARN WindowExec: No Partition Defined for Window > operation! Moving all data to a single partition, this can cause serious > performance degradation. > 21/10/31 21:24:24 WARN WindowExec: No Partition Defined for Window > operation! Moving all data to a single partition, this can cause serious > performance degradation.{code} > Why does it print df[column_name] = series ? > > can we remove /opt/spark/python/pyspark/pandas/utils.py:967: ? > and warnings.warn(message, UserWarning) ? > and 3 of WARN WindowExec: No Partition Defined for Window operation! Moving > all data to a single partition, this can cause serious performance > degradation.? > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37174) WARN WindowExec: No Partition Defined is being printed 4 times.
[ https://issues.apache.org/jira/browse/SPARK-37174?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17437184#comment-17437184 ] Hyukjin Kwon commented on SPARK-37174: -- This is related to default index, see also https://spark.apache.org/docs/latest/api/python/user_guide/pandas_on_spark/options.html#default-index-type. Spark 3.3 targets to remove such warnings. > WARN WindowExec: No Partition Defined is being printed 4 times. > > > Key: SPARK-37174 > URL: https://issues.apache.org/jira/browse/SPARK-37174 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.3.0 >Reporter: Bjørn Jørgensen >Priority: Major > > Hi I use this code > {code:java} > f01 = spark.read.json("/home/test_files/falk/flatted110721/F01.json/*.json") > pf01 = f01.to_pandas_on_spark() > pf01 = pf01.rename(columns=lambda x: re.sub(':P$', '', x)) > pf01["OBJECT_CONTRACT:DATE_PUBLICATION_NOTICE"] = > ps.to_datetime(pf01["OBJECT_CONTRACT:DATE_PUBLICATION_NOTICE"]) > pf01.info(){code} > > sometimes it prints > > {code:java} > 21/10/31 20:38:04 WARN WindowExec: No Partition Defined for Window > operation! Moving all data to a single partition, this can cause serious > performance degradation. > 21/10/31 20:38:04 WARN package: Truncated the string representation of a > plan since it was too large. This behavior can be adjusted by setting > 'spark.sql.debug.maxToStringFields'. > 21/10/31 20:38:08 WARN WindowExec: No Partition Defined for Window > operation! Moving all data to a single partition, this can cause serious > performance degradation. > /opt/spark/python/pyspark/sql/pandas/conversion.py:214: PerformanceWarning: > DataFrame is highly fragmented. This is usually the result of calling > `frame.insert` many times, which has poor performance. Consider joining all > columns at once using pd.concat(axis=1) instead. To get a de-fragmented > frame, use `newframe = frame.copy()` > df[column_name] = series > /opt/spark/python/pyspark/pandas/utils.py:967: UserWarning: `to_pandas` > loads all data into the driver's memory. It should only be used if the > resulting pandas Series is expected to be small. > warnings.warn(message, UserWarning) > 21/10/31 20:38:16 WARN WindowExec: No Partition Defined for Window > operation! Moving all data to a single partition, this can cause serious > performance degradation. > 21/10/31 20:38:18 WARN WindowExec: No Partition Defined for Window > operation! Moving all data to a single partition, this can cause serious > performance degradation.{code} > > and some other times it "just" prints > > {code:java} > 21/10/31 21:24:13 WARN WindowExec: No Partition Defined for Window > operation! Moving all data to a single partition, this can cause serious > performance degradation. > 21/10/31 21:24:16 WARN WindowExec: No Partition Defined for Window > operation! Moving all data to a single partition, this can cause serious > performance degradation. > 21/10/31 21:24:22 WARN WindowExec: No Partition Defined for Window > operation! Moving all data to a single partition, this can cause serious > performance degradation. > 21/10/31 21:24:24 WARN WindowExec: No Partition Defined for Window > operation! Moving all data to a single partition, this can cause serious > performance degradation.{code} > Why does it print df[column_name] = series ? > > can we remove /opt/spark/python/pyspark/pandas/utils.py:967: ? > and warnings.warn(message, UserWarning) ? > and 3 of WARN WindowExec: No Partition Defined for Window operation! Moving > all data to a single partition, this can cause serious performance > degradation.? > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37177) Support LONG argument to the Spark SQL LIMIT clause
[ https://issues.apache.org/jira/browse/SPARK-37177?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-37177: - Component/s: (was: Spark Core) SQL > Support LONG argument to the Spark SQL LIMIT clause > --- > > Key: SPARK-37177 > URL: https://issues.apache.org/jira/browse/SPARK-37177 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.0 >Reporter: Douglas Moore >Priority: Major > > Big data sets may exceed INT max thus Limit clause in Spark SQL `LIMIT > ` should be expanded to support . > Thus > `SELECT * FROM LIMIT 200` would run and not return an error. > > See docs: > [https://spark.apache.org/docs/latest/sql-ref-syntax-qry-select-limit.html] > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37185) DataFrame.take() only uses one worker
[ https://issues.apache.org/jira/browse/SPARK-37185?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17437182#comment-17437182 ] Hyukjin Kwon commented on SPARK-37185: -- can you show the perf diff between both codes? > DataFrame.take() only uses one worker > - > > Key: SPARK-37185 > URL: https://issues.apache.org/jira/browse/SPARK-37185 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.1, 3.2.0 > Environment: CentOS 7 >Reporter: mathieu longtin >Priority: Major > > Say you have query: > {code:java} > >>> df = spark.sql("select * from mytable where x = 99"){code} > Now, out of billions of row, there's only ten rows where x is 99. > If I do: > {code:java} > >>> df.limit(10).collect() > [Stage 1:> (0 + 1) / 1]{code} > It only uses one worker. This takes a really long time since one CPU is > reading the billions of row. > However, if I do this: > {code:java} > >>> df.limit(10).rdd.collect() > [Stage 1:> (0 + 10) / 22]{code} > All the workers are running. > I think there's some optimization issue DataFrame.take(...). > This did not use to be an issue, but I'm not sure if it was working with 3.0 > or 2.4. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37185) DataFrame.take() only uses one worker
[ https://issues.apache.org/jira/browse/SPARK-37185?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17437180#comment-17437180 ] Hyukjin Kwon commented on SPARK-37185: -- isn't it more optimized to use only one partition on one worker if less data is required? > DataFrame.take() only uses one worker > - > > Key: SPARK-37185 > URL: https://issues.apache.org/jira/browse/SPARK-37185 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.1, 3.2.0 > Environment: CentOS 7 >Reporter: mathieu longtin >Priority: Major > > Say you have query: > {code:java} > >>> df = spark.sql("select * from mytable where x = 99"){code} > Now, out of billions of row, there's only ten rows where x is 99. > If I do: > {code:java} > >>> df.limit(10).collect() > [Stage 1:> (0 + 1) / 1]{code} > It only uses one worker. This takes a really long time since one CPU is > reading the billions of row. > However, if I do this: > {code:java} > >>> df.limit(10).rdd.collect() > [Stage 1:> (0 + 10) / 22]{code} > All the workers are running. > I think there's some optimization issue DataFrame.take(...). > This did not use to be an issue, but I'm not sure if it was working with 3.0 > or 2.4. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37189) pyspark.pandas histogram accepts the range option but does not use it
[ https://issues.apache.org/jira/browse/SPARK-37189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17437179#comment-17437179 ] Hyukjin Kwon commented on SPARK-37189: -- Hey, thanks for reporting issues. [~chconnell]would you mind creating an umbrella JIRA and group all the JIRAs you opened? cc [~ueshin] [~itholic] [~XinrongM] FYI > pyspark.pandas histogram accepts the range option but does not use it > - > > Key: SPARK-37189 > URL: https://issues.apache.org/jira/browse/SPARK-37189 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.2.0 >Reporter: Chuck Connell >Priority: Major > > In pyspark.pandas if you write a line like this > {quote}DF.plot.hist(bins=30, range=[0, 20], title="US Counties -- > DeathsPer100k (<20)") > {quote} > it compiles and runs, but the plot does not respect the range. All the values > are shown. > The workaround is to create a new DataFrame that pre-selects just the rows > you want, but line above should work also. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37030) Maven build failed in windows!
[ https://issues.apache.org/jira/browse/SPARK-37030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17437178#comment-17437178 ] Hyukjin Kwon commented on SPARK-37030: -- You need {{bash}} on the path on Windows, either from Cigwin or native bash support from Windows. > Maven build failed in windows! > -- > > Key: SPARK-37030 > URL: https://issues.apache.org/jira/browse/SPARK-37030 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 3.2.0 > Environment: OS: Windows 10 Professional > OS Version: 21H1 > Maven Version: 3.6.3 > >Reporter: Shockang >Priority: Minor > Fix For: 3.2.0 > > Attachments: image-2021-10-17-22-18-16-616.png > > > I pulled the latest Spark master code on my local windows 10 computer and > executed the following command: > {code:java} > mvn -DskipTests clean install{code} > Build failed! > !image-2021-10-17-22-18-16-616.png! > {code:java} > Failed to execute goal org.apache.maven.plugins:maven-antrun-plugin:1.8:run > (default) on project spark-core_2.12: An Ant BuildException has occured: > Execute failed: java.io.IOException: Cannot run program "bash" (in directory > "C:\bigdata\spark\core"): CreateProcess error=2{code} > It seems that the plugin: maven-antrun-plugin cannot run because of windows > no bash. > The following code comes from pom.xml in spark-core module. > {code:java} > > org.apache.maven.plugins > maven-antrun-plugin > > > generate-resources > > > > > > > > > > > > run > > > > > {code} > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
Subscribe
- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-37156) Inline type hints for python/pyspark/storagelevel.py
[ https://issues.apache.org/jira/browse/SPARK-37156?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-37156. -- Fix Version/s: 3.3.0 Resolution: Fixed Issue resolved by pull request 34437 [https://github.com/apache/spark/pull/34437] > Inline type hints for python/pyspark/storagelevel.py > > > Key: SPARK-37156 > URL: https://issues.apache.org/jira/browse/SPARK-37156 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.2.0 >Reporter: Byron Hsu >Assignee: Apache Spark >Priority: Major > Fix For: 3.3.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37156) Inline type hints for python/pyspark/storagelevel.py
[ https://issues.apache.org/jira/browse/SPARK-37156?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-37156: Assignee: Apache Spark > Inline type hints for python/pyspark/storagelevel.py > > > Key: SPARK-37156 > URL: https://issues.apache.org/jira/browse/SPARK-37156 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.2.0 >Reporter: Byron Hsu >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37030) Maven build failed in windows!
[ https://issues.apache.org/jira/browse/SPARK-37030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17437168#comment-17437168 ] Shockang commented on SPARK-37030: -- [~hyukjin.kwon] Even if the community does not support Spark in windows, why no one cares about programmers who use windows ... > Maven build failed in windows! > -- > > Key: SPARK-37030 > URL: https://issues.apache.org/jira/browse/SPARK-37030 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 3.2.0 > Environment: OS: Windows 10 Professional > OS Version: 21H1 > Maven Version: 3.6.3 > >Reporter: Shockang >Priority: Minor > Fix For: 3.2.0 > > Attachments: image-2021-10-17-22-18-16-616.png > > > I pulled the latest Spark master code on my local windows 10 computer and > executed the following command: > {code:java} > mvn -DskipTests clean install{code} > Build failed! > !image-2021-10-17-22-18-16-616.png! > {code:java} > Failed to execute goal org.apache.maven.plugins:maven-antrun-plugin:1.8:run > (default) on project spark-core_2.12: An Ant BuildException has occured: > Execute failed: java.io.IOException: Cannot run program "bash" (in directory > "C:\bigdata\spark\core"): CreateProcess error=2{code} > It seems that the plugin: maven-antrun-plugin cannot run because of windows > no bash. > The following code comes from pom.xml in spark-core module. > {code:java} > > org.apache.maven.plugins > maven-antrun-plugin > > > generate-resources > > > > > > > > > > > > run > > > > > {code} > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37152) Inline type hints for python/pyspark/context.py
[ https://issues.apache.org/jira/browse/SPARK-37152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17437161#comment-17437161 ] Apache Spark commented on SPARK-37152: -- User 'ByronHsu' has created a pull request for this issue: https://github.com/apache/spark/pull/34466 > Inline type hints for python/pyspark/context.py > --- > > Key: SPARK-37152 > URL: https://issues.apache.org/jira/browse/SPARK-37152 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.2.0 >Reporter: Byron Hsu >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37152) Inline type hints for python/pyspark/context.py
[ https://issues.apache.org/jira/browse/SPARK-37152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17437159#comment-17437159 ] Apache Spark commented on SPARK-37152: -- User 'ByronHsu' has created a pull request for this issue: https://github.com/apache/spark/pull/34466 > Inline type hints for python/pyspark/context.py > --- > > Key: SPARK-37152 > URL: https://issues.apache.org/jira/browse/SPARK-37152 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.2.0 >Reporter: Byron Hsu >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37152) Inline type hints for python/pyspark/context.py
[ https://issues.apache.org/jira/browse/SPARK-37152?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-37152: Assignee: Apache Spark > Inline type hints for python/pyspark/context.py > --- > > Key: SPARK-37152 > URL: https://issues.apache.org/jira/browse/SPARK-37152 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.2.0 >Reporter: Byron Hsu >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37152) Inline type hints for python/pyspark/context.py
[ https://issues.apache.org/jira/browse/SPARK-37152?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-37152: Assignee: (was: Apache Spark) > Inline type hints for python/pyspark/context.py > --- > > Key: SPARK-37152 > URL: https://issues.apache.org/jira/browse/SPARK-37152 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.2.0 >Reporter: Byron Hsu >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37168) Improve error messages for SQL functions and operators under ANSI mode
[ https://issues.apache.org/jira/browse/SPARK-37168?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-37168: --- Assignee: Allison Wang > Improve error messages for SQL functions and operators under ANSI mode > -- > > Key: SPARK-37168 > URL: https://issues.apache.org/jira/browse/SPARK-37168 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.0 >Reporter: Allison Wang >Assignee: Allison Wang >Priority: Major > > Make error messages more actionable when ANSI mode is enabled. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-37168) Improve error messages for SQL functions and operators under ANSI mode
[ https://issues.apache.org/jira/browse/SPARK-37168?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-37168. - Fix Version/s: 3.3.0 Resolution: Fixed Issue resolved by pull request 34443 [https://github.com/apache/spark/pull/34443] > Improve error messages for SQL functions and operators under ANSI mode > -- > > Key: SPARK-37168 > URL: https://issues.apache.org/jira/browse/SPARK-37168 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.0 >Reporter: Allison Wang >Assignee: Allison Wang >Priority: Major > Fix For: 3.3.0 > > > Make error messages more actionable when ANSI mode is enabled. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37164) Add ExpressionBuilder for functions with complex overloads
[ https://issues.apache.org/jira/browse/SPARK-37164?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-37164: --- Assignee: Wenchen Fan > Add ExpressionBuilder for functions with complex overloads > -- > > Key: SPARK-37164 > URL: https://issues.apache.org/jira/browse/SPARK-37164 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-37164) Add ExpressionBuilder for functions with complex overloads
[ https://issues.apache.org/jira/browse/SPARK-37164?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-37164. - Fix Version/s: 3.3.0 Resolution: Fixed Issue resolved by pull request 34441 [https://github.com/apache/spark/pull/34441] > Add ExpressionBuilder for functions with complex overloads > -- > > Key: SPARK-37164 > URL: https://issues.apache.org/jira/browse/SPARK-37164 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > Fix For: 3.3.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org