[jira] [Updated] (SPARK-29611) Sort Kafka metadata by the number of messages
[ https://issues.apache.org/jira/browse/SPARK-29611?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-29611: -- Affects Version/s: 2.4.4 > Sort Kafka metadata by the number of messages > - > > Key: SPARK-29611 > URL: https://issues.apache.org/jira/browse/SPARK-29611 > Project: Spark > Issue Type: Improvement > Components: Web UI >Affects Versions: 2.3.0, 2.4.4, 3.0.0 >Reporter: dengziming >Assignee: dengziming >Priority: Minor > Fix For: 3.0.0 > > Attachments: kafka-skew.jpeg > > > we are suffering from data skewness from Kafka and need to analyze it by UI. > As the image shows, the UI of the streaming is confused, it's better to > improve it by adding a count metric of each Kafka partition and sorting by > count -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-29611) Optimize the display of Kafka metadata and sort by the number of messages
[ https://issues.apache.org/jira/browse/SPARK-29611?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-29611. --- Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 26266 [https://github.com/apache/spark/pull/26266] > Optimize the display of Kafka metadata and sort by the number of messages > - > > Key: SPARK-29611 > URL: https://issues.apache.org/jira/browse/SPARK-29611 > Project: Spark > Issue Type: Improvement > Components: Web UI >Affects Versions: 2.3.0, 3.0.0 >Reporter: dengziming >Assignee: dengziming >Priority: Minor > Fix For: 3.0.0 > > Attachments: kafka-skew.jpeg > > > we are suffering from data skewness from Kafka and need to analyze it by UI. > As the image shows, the UI of the streaming is confused, it's better to > improve it by adding a count metric of each Kafka partition and sorting by > count -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29611) Sort Kafka metadata by the number of messages
[ https://issues.apache.org/jira/browse/SPARK-29611?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-29611: -- Summary: Sort Kafka metadata by the number of messages (was: Optimize the display of Kafka metadata and sort by the number of messages) > Sort Kafka metadata by the number of messages > - > > Key: SPARK-29611 > URL: https://issues.apache.org/jira/browse/SPARK-29611 > Project: Spark > Issue Type: Improvement > Components: Web UI >Affects Versions: 2.3.0, 3.0.0 >Reporter: dengziming >Assignee: dengziming >Priority: Minor > Fix For: 3.0.0 > > Attachments: kafka-skew.jpeg > > > we are suffering from data skewness from Kafka and need to analyze it by UI. > As the image shows, the UI of the streaming is confused, it's better to > improve it by adding a count metric of each Kafka partition and sorting by > count -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-29611) Optimize the display of Kafka metadata and sort by the number of messages
[ https://issues.apache.org/jira/browse/SPARK-29611?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-29611: - Assignee: dengziming > Optimize the display of Kafka metadata and sort by the number of messages > - > > Key: SPARK-29611 > URL: https://issues.apache.org/jira/browse/SPARK-29611 > Project: Spark > Issue Type: Improvement > Components: Web UI >Affects Versions: 2.3.0, 3.0.0 >Reporter: dengziming >Assignee: dengziming >Priority: Minor > Attachments: kafka-skew.jpeg > > > we are suffering from data skewness from Kafka and need to analyze it by UI. > As the image shows, the UI of the streaming is confused, it's better to > improve it by adding a count metric of each Kafka partition and sorting by > count -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29721) Spark SQL reads unnecessary nested fields from Parquet after using explode
[ https://issues.apache.org/jira/browse/SPARK-29721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kai Kang updated SPARK-29721: - Description: This is a follow up for SPARK-4502. SPARK-4502 correctly addressed column pruning for nested structures. However, when explode() is called on a nested field, all columns for that nested structure is still fetched from data source. We are working on a project to create a parquet store for a big pre-joined table between two tables that has one-to-many relationship, and this is a blocking issue for us. The following code illustrates the issue. Part 1: loading some nested data {noformat} val jsonStr = """{ "items": [ {"itemId": 1, "itemData": "a"}, {"itemId": 2, "itemData": "b"} ] }""" val df = spark.read.json(Seq(jsonStr).toDS) df.write.format("parquet").mode("overwrite").saveAsTable("persisted") {noformat} Part 2: reading it back and explaining the queries {noformat} val read = spark.table("persisted") spark.conf.set("spark.sql.optimizer.nestedSchemaPruning.enabled", true) // pruned, only loading itemId // ReadSchema: struct>> read.select($"items.itemId").explain(true) // not pruned, loading both itemId // ReadSchema: struct>> read.select(explode($"items.itemId")).explain(true) and itemData {noformat} was: This is a follow up for SPARK-4502. SPARK-4502 correctly addressed column pruning for nested structures. However, when explode() is called on a nested field, all columns for that nested structure is still fetched from data source. We are working on a project to create a parquet store for a big pre-joined table between two tables that has one-to-many relationship, and this is a blocking issue for us. The following code illustrates the issue. Part 1: loading some nested data {noformat} val jsonStr = """{ "items": [ {"itemId": 1, "itemData": "a"}, {"itemId": 2, "itemData": "b"} ] }""" val df = spark.read.json(Seq(jsonStr).toDS) df.write.format("parquet").mode("overwrite").saveAsTable("persisted") {noformat} Part 2: reading it back and explaining the queries {noformat} val read = spark.table("persisted") spark.conf.set("spark.sql.optimizer.nestedSchemaPruning.enabled", true) read.select($"items.itemId").explain(true) // pruned, only loading itemId read.select(explode($"items.itemId")).explain(true) // not pruned, loading both itemId and itemData {noformat} > Spark SQL reads unnecessary nested fields from Parquet after using explode > -- > > Key: SPARK-29721 > URL: https://issues.apache.org/jira/browse/SPARK-29721 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.4 >Reporter: Kai Kang >Priority: Critical > > This is a follow up for SPARK-4502. SPARK-4502 correctly addressed column > pruning for nested structures. However, when explode() is called on a nested > field, all columns for that nested structure is still fetched from data > source. > We are working on a project to create a parquet store for a big pre-joined > table between two tables that has one-to-many relationship, and this is a > blocking issue for us. > > The following code illustrates the issue. > Part 1: loading some nested data > {noformat} > val jsonStr = """{ > "items": [ >{"itemId": 1, "itemData": "a"}, >{"itemId": 2, "itemData": "b"} > ] > }""" > val df = spark.read.json(Seq(jsonStr).toDS) > df.write.format("parquet").mode("overwrite").saveAsTable("persisted") > {noformat} > > Part 2: reading it back and explaining the queries > {noformat} > val read = spark.table("persisted") > spark.conf.set("spark.sql.optimizer.nestedSchemaPruning.enabled", true) > // pruned, only loading itemId > // ReadSchema: struct>> > read.select($"items.itemId").explain(true) > // not pruned, loading both itemId > // ReadSchema: struct>> > read.select(explode($"items.itemId")).explain(true) and itemData > {noformat} > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29721) Spark SQL reads unnecessary nested fields from Parquet after using explode
[ https://issues.apache.org/jira/browse/SPARK-29721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kai Kang updated SPARK-29721: - Description: This is a follow up for SPARK-4502. SPARK-4502 correctly addressed column pruning for nested structures. However, when explode() is called on a nested field, all columns for that nested structure is still fetched from data source. We are working on a project to create a parquet store for a big pre-joined table between two tables that has one-to-many relationship, and this is a blocking issue for us. The following code illustrates the issue. Part 1: loading some nested data {noformat} val jsonStr = """{ "items": [ {"itemId": 1, "itemData": "a"}, {"itemId": 2, "itemData": "b"} ] }""" val df = spark.read.json(Seq(jsonStr).toDS) df.write.format("parquet").mode("overwrite").saveAsTable("persisted") {noformat} Part 2: reading it back and explaining the queries val read = spark.table("persisted") spark.conf.set("spark.sql.optimizer.nestedSchemaPruning.enabled", true) read.select($"items.itemId").explain(true) // pruned, only loading itemId read.select(explode($"items.itemId")).explain(true) // not pruned, loading both itemId and itemData was: This is a follow up for SPARK-4502. SPARK-4502 correctly addressed column pruning for nested structures. However, when explode() is called on a nested field, all columns for that nested structure is still fetched from data source. We are working on a project to create a parquet store for a big pre-joined table between two tables that has one-to-many relationship, and this is a blocking issue for us. The following code illustrates the issue. Part 1: loading some nested data {quote} val jsonStr = """{ "items": [ {"itemId": 1, "itemData": "a"}, {"itemId": 2, "itemData": "b"} ] }""" val df = spark.read.json(Seq(jsonStr).toDS) df.write.format("parquet").mode("overwrite").saveAsTable("persisted") {quote} Part 2: reading it back and explaining the queries val read = spark.table("persisted") spark.conf.set("spark.sql.optimizer.nestedSchemaPruning.enabled", true) read.select($"items.itemId").explain(true) // pruned, only loading itemId read.select(explode($"items.itemId")).explain(true) // not pruned, loading both itemId and itemData > Spark SQL reads unnecessary nested fields from Parquet after using explode > -- > > Key: SPARK-29721 > URL: https://issues.apache.org/jira/browse/SPARK-29721 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.4 >Reporter: Kai Kang >Priority: Critical > > This is a follow up for SPARK-4502. SPARK-4502 correctly addressed column > pruning for nested structures. However, when explode() is called on a nested > field, all columns for that nested structure is still fetched from data > source. > We are working on a project to create a parquet store for a big pre-joined > table between two tables that has one-to-many relationship, and this is a > blocking issue for us. > > The following code illustrates the issue. > Part 1: loading some nested data > {noformat} > val jsonStr = """{ > "items": [ >{"itemId": 1, "itemData": "a"}, >{"itemId": 2, "itemData": "b"} > ] > }""" > val df = spark.read.json(Seq(jsonStr).toDS) > df.write.format("parquet").mode("overwrite").saveAsTable("persisted") > {noformat} > > Part 2: reading it back and explaining the queries > val read = spark.table("persisted") > spark.conf.set("spark.sql.optimizer.nestedSchemaPruning.enabled", true) > read.select($"items.itemId").explain(true) // pruned, only loading itemId > read.select(explode($"items.itemId")).explain(true) // not pruned, loading > both itemId and itemData > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29721) Spark SQL reads unnecessary nested fields from Parquet after using explode
[ https://issues.apache.org/jira/browse/SPARK-29721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kai Kang updated SPARK-29721: - Description: This is a follow up for SPARK-4502. SPARK-4502 correctly addressed column pruning for nested structures. However, when explode() is called on a nested field, all columns for that nested structure is still fetched from data source. We are working on a project to create a parquet store for a big pre-joined table between two tables that has one-to-many relationship, and this is a blocking issue for us. The following code illustrates the issue. Part 1: loading some nested data {noformat} val jsonStr = """{ "items": [ {"itemId": 1, "itemData": "a"}, {"itemId": 2, "itemData": "b"} ] }""" val df = spark.read.json(Seq(jsonStr).toDS) df.write.format("parquet").mode("overwrite").saveAsTable("persisted") {noformat} Part 2: reading it back and explaining the queries {noformat} val read = spark.table("persisted") spark.conf.set("spark.sql.optimizer.nestedSchemaPruning.enabled", true) read.select($"items.itemId").explain(true) // pruned, only loading itemId read.select(explode($"items.itemId")).explain(true) // not pruned, loading both itemId and itemData {noformat} was: This is a follow up for SPARK-4502. SPARK-4502 correctly addressed column pruning for nested structures. However, when explode() is called on a nested field, all columns for that nested structure is still fetched from data source. We are working on a project to create a parquet store for a big pre-joined table between two tables that has one-to-many relationship, and this is a blocking issue for us. The following code illustrates the issue. Part 1: loading some nested data {noformat} val jsonStr = """{ "items": [ {"itemId": 1, "itemData": "a"}, {"itemId": 2, "itemData": "b"} ] }""" val df = spark.read.json(Seq(jsonStr).toDS) df.write.format("parquet").mode("overwrite").saveAsTable("persisted") {noformat} Part 2: reading it back and explaining the queries val read = spark.table("persisted") spark.conf.set("spark.sql.optimizer.nestedSchemaPruning.enabled", true) read.select($"items.itemId").explain(true) // pruned, only loading itemId read.select(explode($"items.itemId")).explain(true) // not pruned, loading both itemId and itemData > Spark SQL reads unnecessary nested fields from Parquet after using explode > -- > > Key: SPARK-29721 > URL: https://issues.apache.org/jira/browse/SPARK-29721 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.4 >Reporter: Kai Kang >Priority: Critical > > This is a follow up for SPARK-4502. SPARK-4502 correctly addressed column > pruning for nested structures. However, when explode() is called on a nested > field, all columns for that nested structure is still fetched from data > source. > We are working on a project to create a parquet store for a big pre-joined > table between two tables that has one-to-many relationship, and this is a > blocking issue for us. > > The following code illustrates the issue. > Part 1: loading some nested data > {noformat} > val jsonStr = """{ > "items": [ >{"itemId": 1, "itemData": "a"}, >{"itemId": 2, "itemData": "b"} > ] > }""" > val df = spark.read.json(Seq(jsonStr).toDS) > df.write.format("parquet").mode("overwrite").saveAsTable("persisted") > {noformat} > > Part 2: reading it back and explaining the queries > {noformat} > val read = spark.table("persisted") > spark.conf.set("spark.sql.optimizer.nestedSchemaPruning.enabled", true) > read.select($"items.itemId").explain(true) // pruned, only loading itemId > read.select(explode($"items.itemId")).explain(true) // not pruned, loading > both itemId and itemData > {noformat} > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29721) Spark SQL reads unnecessary nested fields from Parquet after using explode
[ https://issues.apache.org/jira/browse/SPARK-29721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kai Kang updated SPARK-29721: - Description: This is a follow up for SPARK-4502. SPARK-4502 correctly addressed column pruning for nested structures. However, when explode() is called on a nested field, all columns for that nested structure is still fetched from data source. We are working on a project to create a parquet store for a big pre-joined table between two tables that has one-to-many relationship, and this is a blocking issue for us. The following code illustrates the issue. Part 1: loading some nested data {quote} val jsonStr = """{ "items": [ {"itemId": 1, "itemData": "a"}, {"itemId": 2, "itemData": "b"} ] }""" val df = spark.read.json(Seq(jsonStr).toDS) df.write.format("parquet").mode("overwrite").saveAsTable("persisted") {quote} Part 2: reading it back and explaining the queries val read = spark.table("persisted") spark.conf.set("spark.sql.optimizer.nestedSchemaPruning.enabled", true) read.select($"items.itemId").explain(true) // pruned, only loading itemId read.select(explode($"items.itemId")).explain(true) // not pruned, loading both itemId and itemData was: This is a follow up for SPARK-4502. SPARK-4502 correctly addressed column pruning for nested structures. However, when explode() is called on a nested field, all columns for that nested structure is still fetched from data source. We are working on a project to create a parquet store for a big pre-joined table between two tables that has one-to-many relationship, and this is a blocking issue for us. The following code illustrates the issue. Part 1: loading some nested data bq. val jsonStr = """{ "items": [ {"itemId": 1, "itemData": "a"}, {"itemId": 2, "itemData": "b"} ] }""" val df = spark.read.json(Seq(jsonStr).toDS) df.write.format("parquet").mode("overwrite").saveAsTable("persisted") Part 2: reading it back and explaining the queries val read = spark.table("persisted") spark.conf.set("spark.sql.optimizer.nestedSchemaPruning.enabled", true) read.select($"items.itemId").explain(true) // pruned, only loading itemId read.select(explode($"items.itemId")).explain(true) // not pruned, loading both itemId and itemData > Spark SQL reads unnecessary nested fields from Parquet after using explode > -- > > Key: SPARK-29721 > URL: https://issues.apache.org/jira/browse/SPARK-29721 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.4 >Reporter: Kai Kang >Priority: Critical > > This is a follow up for SPARK-4502. SPARK-4502 correctly addressed column > pruning for nested structures. However, when explode() is called on a nested > field, all columns for that nested structure is still fetched from data > source. > We are working on a project to create a parquet store for a big pre-joined > table between two tables that has one-to-many relationship, and this is a > blocking issue for us. > > The following code illustrates the issue. > Part 1: loading some nested data > {quote} > val jsonStr = """{ > "items": [ >{"itemId": 1, "itemData": "a"}, >{"itemId": 2, "itemData": "b"} > ] > }""" > val df = spark.read.json(Seq(jsonStr).toDS) > df.write.format("parquet").mode("overwrite").saveAsTable("persisted") > {quote} > > Part 2: reading it back and explaining the queries > val read = spark.table("persisted") > spark.conf.set("spark.sql.optimizer.nestedSchemaPruning.enabled", true) > read.select($"items.itemId").explain(true) // pruned, only loading itemId > read.select(explode($"items.itemId")).explain(true) // not pruned, loading > both itemId and itemData > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29721) Spark SQL reads unnecessary nested fields from Parquet after using explode
[ https://issues.apache.org/jira/browse/SPARK-29721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kai Kang updated SPARK-29721: - Description: This is a follow up for SPARK-4502. SPARK-4502 correctly addressed column pruning for nested structures. However, when explode() is called on a nested field, all columns for that nested structure is still fetched from data source. We are working on a project to create a parquet store for a big pre-joined table between two tables that has one-to-many relationship, and this is a blocking issue for us. The following code illustrates the issue. Part 1: loading some nested data bq. val jsonStr = """{ "items": [ {"itemId": 1, "itemData": "a"}, {"itemId": 2, "itemData": "b"} ] }""" val df = spark.read.json(Seq(jsonStr).toDS) df.write.format("parquet").mode("overwrite").saveAsTable("persisted") Part 2: reading it back and explaining the queries val read = spark.table("persisted") spark.conf.set("spark.sql.optimizer.nestedSchemaPruning.enabled", true) read.select($"items.itemId").explain(true) // pruned, only loading itemId read.select(explode($"items.itemId")).explain(true) // not pruned, loading both itemId and itemData was: This is a follow up for SPARK-4502. SPARK-4502 correctly addressed column pruning for nested structures. However, when explode() is called on a nested field, all columns for that nested structure is still fetched from data source. We are working on a project to create a parquet store for a big pre-joined table between two tables that has one-to-many relationship, and this is a blocking issue for us. The following code illustrates the issue. Part 1: loading some nested data {quote}val jsonStr = """{ "items": [ {"itemId": 1, "itemData": "a"}, {"itemId": 2, "itemData": "b"} ] }""" val df = spark.read.json(Seq(jsonStr).toDS) df.write.format("parquet").mode("overwrite").saveAsTable("persisted") {quote} Part 2: reading it back and explaining the queries val read = spark.table("persisted") spark.conf.set("spark.sql.optimizer.nestedSchemaPruning.enabled", true) read.select($"items.itemId").explain(true) // pruned, only loading itemId read.select(explode($"items.itemId")).explain(true) // not pruned, loading both itemId and itemData > Spark SQL reads unnecessary nested fields from Parquet after using explode > -- > > Key: SPARK-29721 > URL: https://issues.apache.org/jira/browse/SPARK-29721 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.4 >Reporter: Kai Kang >Priority: Critical > > This is a follow up for SPARK-4502. SPARK-4502 correctly addressed column > pruning for nested structures. However, when explode() is called on a nested > field, all columns for that nested structure is still fetched from data > source. > We are working on a project to create a parquet store for a big pre-joined > table between two tables that has one-to-many relationship, and this is a > blocking issue for us. > > The following code illustrates the issue. > Part 1: loading some nested data > bq. val jsonStr = """{ > "items": [ >{"itemId": 1, "itemData": "a"}, >{"itemId": 2, "itemData": "b"} > ] > }""" > val df = spark.read.json(Seq(jsonStr).toDS) > df.write.format("parquet").mode("overwrite").saveAsTable("persisted") > > Part 2: reading it back and explaining the queries > val read = spark.table("persisted") > spark.conf.set("spark.sql.optimizer.nestedSchemaPruning.enabled", true) > read.select($"items.itemId").explain(true) // pruned, only loading itemId > read.select(explode($"items.itemId")).explain(true) // not pruned, loading > both itemId and itemData > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29721) Spark SQL reads unnecessary nested fields from Parquet after using explode
[ https://issues.apache.org/jira/browse/SPARK-29721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kai Kang updated SPARK-29721: - Description: This is a follow up for SPARK-4502. SPARK-4502 correctly addressed column pruning for nested structures. However, when explode() is called on a nested field, all columns for that nested structure is still fetched from data source. We are working on a project to create a parquet store for a big pre-joined table between two tables that has one-to-many relationship, and this is a blocking issue for us. The following code illustrates the issue. Part 1: loading some nested data {quote}val jsonStr = """{ "items": [ {"itemId": 1, "itemData": "a"}, {"itemId": 2, "itemData": "b"} ] }""" val df = spark.read.json(Seq(jsonStr).toDS) df.write.format("parquet").mode("overwrite").saveAsTable("persisted") {quote} Part 2: reading it back and explaining the queries val read = spark.table("persisted") spark.conf.set("spark.sql.optimizer.nestedSchemaPruning.enabled", true) read.select($"items.itemId").explain(true) // pruned, only loading itemId read.select(explode($"items.itemId")).explain(true) // not pruned, loading both itemId and itemData was: This is a follow up for SPARK-4502. SPARK-4502 correctly addressed column pruning for nested structures. However, when explode() is called on a nested field, all columns for that nested structure is still fetched from data source. We are working on a project to create a parquet store for a big pre-joined table between two tables that has one-to-many relationship, and this is a blocking issue for us. The following code illustrates the issue. Part 1: loading some nested data val jsonStr = """{ "items": [ {"itemId": 1, "itemData": "a"}, {"itemId": 2, "itemData": "b"} ] }""" val df = spark.read.json(Seq(jsonStr).toDS) df.write.format("parquet").mode("overwrite").saveAsTable("persisted") Part 2: reading it back and explaining the queries val read = spark.table("persisted") spark.conf.set("spark.sql.optimizer.nestedSchemaPruning.enabled", true) read.select($"items.itemId").explain(true) // pruned, only loading itemId read.select(explode($"items.itemId")).explain(true) // not pruned, loading both itemId and itemData > Spark SQL reads unnecessary nested fields from Parquet after using explode > -- > > Key: SPARK-29721 > URL: https://issues.apache.org/jira/browse/SPARK-29721 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.4 >Reporter: Kai Kang >Priority: Critical > > This is a follow up for SPARK-4502. SPARK-4502 correctly addressed column > pruning for nested structures. However, when explode() is called on a nested > field, all columns for that nested structure is still fetched from data > source. > We are working on a project to create a parquet store for a big pre-joined > table between two tables that has one-to-many relationship, and this is a > blocking issue for us. > > The following code illustrates the issue. > Part 1: loading some nested data > {quote}val jsonStr = """{ > "items": [ >{"itemId": 1, "itemData": "a"}, >{"itemId": 2, "itemData": "b"} > ] > }""" > val df = spark.read.json(Seq(jsonStr).toDS) > df.write.format("parquet").mode("overwrite").saveAsTable("persisted") {quote} > > Part 2: reading it back and explaining the queries > val read = spark.table("persisted") > spark.conf.set("spark.sql.optimizer.nestedSchemaPruning.enabled", true) > read.select($"items.itemId").explain(true) // pruned, only loading itemId > read.select(explode($"items.itemId")).explain(true) // not pruned, loading > both itemId and itemData > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29721) Spark SQL reads unnecessary nested fields from Parquet after using explode
[ https://issues.apache.org/jira/browse/SPARK-29721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kai Kang updated SPARK-29721: - Description: This is a follow up for SPARK-4502. SPARK-4502 correctly addressed column pruning for nested structures. However, when explode() is called on a nested field, all columns for that nested structure is still fetched from data source. We are working on a project to create a parquet store for a big pre-joined table between two tables that has one-to-many relationship, and this is a blocking issue for us. The following code illustrates the issue. Part 1: loading some nested data val jsonStr = """{ "items": [ {"itemId": 1, "itemData": "a"}, {"itemId": 2, "itemData": "b"} ] }""" val df = spark.read.json(Seq(jsonStr).toDS) df.write.format("parquet").mode("overwrite").saveAsTable("persisted") Part 2: reading it back and explaining the queries val read = spark.table("persisted") spark.conf.set("spark.sql.optimizer.nestedSchemaPruning.enabled", true) read.select($"items.itemId").explain(true) // pruned, only loading itemId read.select(explode($"items.itemId")).explain(true) // not pruned, loading both itemId and itemData was: This is a follow up for SPARK-4502. SPARK-4502 correctly addressed column pruning for nested structures. However, when explode() is called on a nested field, all columns for that nested structure is still fetched from data source. We are working on a project to create a parquet store for a big pre-joined table between two tables that has one-to-many relationship, and this is a blocking issue for us. The following code illustrates the issue. Part 1: loading some nested data val jsonStr = """{ "items": [ { "itemId": 1, "itemData": "a" }, { "itemId": 1, "itemData": "b" } ] }""" val df = spark.read.json(Seq(jsonStr).toDS) df.write.format("parquet").mode("overwrite").saveAsTable("persisted") Part 2: reading it back and explaining the queries val read = spark.table("persisted") spark.conf.set("spark.sql.optimizer.nestedSchemaPruning.enabled", true) read.select($"items.itemId").explain(true) // pruned, only loading itemId read.select(explode($"items.itemId")).explain(true) // not pruned, loading both itemId and itemData > Spark SQL reads unnecessary nested fields from Parquet after using explode > -- > > Key: SPARK-29721 > URL: https://issues.apache.org/jira/browse/SPARK-29721 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.4 >Reporter: Kai Kang >Priority: Critical > > This is a follow up for SPARK-4502. SPARK-4502 correctly addressed column > pruning for nested structures. However, when explode() is called on a nested > field, all columns for that nested structure is still fetched from data > source. > We are working on a project to create a parquet store for a big pre-joined > table between two tables that has one-to-many relationship, and this is a > blocking issue for us. > > The following code illustrates the issue. > Part 1: loading some nested data > val jsonStr = """{ > "items": [ >{"itemId": 1, "itemData": "a"}, >{"itemId": 2, "itemData": "b"} > ] > }""" > val df = spark.read.json(Seq(jsonStr).toDS) > df.write.format("parquet").mode("overwrite").saveAsTable("persisted") > > Part 2: reading it back and explaining the queries > val read = spark.table("persisted") > spark.conf.set("spark.sql.optimizer.nestedSchemaPruning.enabled", true) > read.select($"items.itemId").explain(true) // pruned, only loading itemId > read.select(explode($"items.itemId")).explain(true) // not pruned, loading > both itemId and itemData > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29721) Spark SQL reads unnecessary nested fields from Parquet after using explode
[ https://issues.apache.org/jira/browse/SPARK-29721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kai Kang updated SPARK-29721: - Description: This is a follow up for SPARK-4502. SPARK-4502 correctly addressed column pruning for nested structures. However, when explode() is called on a nested field, all columns for that nested structure is still fetched from data source. We are working on a project to create a parquet store for a big pre-joined table between two tables that has one-to-many relationship, and this is a blocking issue for us. The following code illustrates the issue. Part 1: loading some nested data val jsonStr = """{ "items": [ { "itemId": 1, "itemData": "a" }, { "itemId": 1, "itemData": "b" } ] }""" val df = spark.read.json(Seq(jsonStr).toDS) df.write.format("parquet").mode("overwrite").saveAsTable("persisted") Part 2: reading it back and explaining the queries val read = spark.table("persisted") spark.conf.set("spark.sql.optimizer.nestedSchemaPruning.enabled", true) read.select($"items.itemId").explain(true) // pruned, only loading itemId read.select(explode($"items.itemId")).explain(true) // not pruned, loading both itemId and itemData was: This is a follow up for SPARK-4502. SPARK-4502 correctly addressed column pruning for nested structures. However, when explode() is called on a nested field, all columns for that nested structure is still fetched from data source. We are working on a project to create a parquet store for a big pre-joined table between two tables that has one-to-many relationship, and this is a blocking issue for us. The following code illustrates the issue. Part 1: loading some nested data val jsonStr = """{ "items": [ { "itemId": 1, "itemData": "a" }, { "itemId": 1, "itemData": "b" } ] }""" val df = spark.read.json(Seq(jsonStr).toDS) df.write.format("parquet").mode("overwrite").saveAsTable("persisted") Part 2: reading it back and explaining the queries {{val read = spark.table("persisted")}} \{{ spark.conf.set("spark.sql.optimizer.nestedSchemaPruning.enabled", true)}} \{{ read.select($"items.itemId").explain(true) // pruned, only loading itemIdread.select(explode($"items.itemId")).explain(true) // not pruned, loading both itemId and itemData}}{{ }} > Spark SQL reads unnecessary nested fields from Parquet after using explode > -- > > Key: SPARK-29721 > URL: https://issues.apache.org/jira/browse/SPARK-29721 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.4 >Reporter: Kai Kang >Priority: Critical > > This is a follow up for SPARK-4502. SPARK-4502 correctly addressed column > pruning for nested structures. However, when explode() is called on a nested > field, all columns for that nested structure is still fetched from data > source. > We are working on a project to create a parquet store for a big pre-joined > table between two tables that has one-to-many relationship, and this is a > blocking issue for us. > > The following code illustrates the issue. > Part 1: loading some nested data > val jsonStr = """{ > "items": [ > { > "itemId": 1, > "itemData": "a" > }, > { > "itemId": 1, > "itemData": "b" > } > ] > }""" > val df = spark.read.json(Seq(jsonStr).toDS) > df.write.format("parquet").mode("overwrite").saveAsTable("persisted") > > Part 2: reading it back and explaining the queries > val read = spark.table("persisted") > spark.conf.set("spark.sql.optimizer.nestedSchemaPruning.enabled", true) > read.select($"items.itemId").explain(true) // pruned, only loading itemId > read.select(explode($"items.itemId")).explain(true) // not pruned, loading > both itemId and itemData > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29721) Spark SQL reads unnecessary nested fields from Parquet after using explode
[ https://issues.apache.org/jira/browse/SPARK-29721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kai Kang updated SPARK-29721: - Description: This is a follow up for SPARK-4502. SPARK-4502 correctly addressed column pruning for nested structures. However, when explode() is called on a nested field, all columns for that nested structure is still fetched from data source. We are working on a project to create a parquet store for a big pre-joined table between two tables that has one-to-many relationship, and this is a blocking issue for us. The following code illustrates the issue. Part 1: loading some nested data val jsonStr = """{ "items": [ { "itemId": 1, "itemData": "a" }, { "itemId": 1, "itemData": "b" } ] }""" val df = spark.read.json(Seq(jsonStr).toDS) df.write.format("parquet").mode("overwrite").saveAsTable("persisted") Part 2: reading it back and explaining the queries {{val read = spark.table("persisted")}} \{{ spark.conf.set("spark.sql.optimizer.nestedSchemaPruning.enabled", true)}} \{{ read.select($"items.itemId").explain(true) // pruned, only loading itemIdread.select(explode($"items.itemId")).explain(true) // not pruned, loading both itemId and itemData}}{{ }} was: This is a follow up for SPARK-4502. SPARK-4502 correctly addressed column pruning for nested structures. However, when explode() is called on a nested field, all columns for that nested structure is still fetched from data source. We are working on a project to create a parquet store for a big pre-joined table between two tables that has one-to-many relationship, and this is a blocking issue for us. The following code illustrates the issue. Part 1: loading some nested data val jsonStr = """{ "items": [ { "itemId": 1, "itemData": "a" }, { "itemId": 1, "itemData": "b" } ] }""" val df = spark.read.json(Seq(jsonStr).toDS) df.write.format("parquet").mode("overwrite").saveAsTable("persisted") Part 2: reading it back and explaining the queries {{val read = spark.table("persisted")}} \{{ spark.conf.set("spark.sql.optimizer.nestedSchemaPruning.enabled", true)}} \{{ read.select($"items.itemId").explain(true) // pruned, only loading itemIdread.select(explode($"items.itemId")).explain(true) // not pruned, loading both itemId and itemData}}{{ }} > Spark SQL reads unnecessary nested fields from Parquet after using explode > -- > > Key: SPARK-29721 > URL: https://issues.apache.org/jira/browse/SPARK-29721 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.4 >Reporter: Kai Kang >Priority: Critical > > This is a follow up for SPARK-4502. SPARK-4502 correctly addressed column > pruning for nested structures. However, when explode() is called on a nested > field, all columns for that nested structure is still fetched from data > source. > We are working on a project to create a parquet store for a big pre-joined > table between two tables that has one-to-many relationship, and this is a > blocking issue for us. > > The following code illustrates the issue. > Part 1: loading some nested data > val jsonStr = """{ > "items": [ > { > "itemId": 1, > "itemData": "a" > }, > { > "itemId": 1, > "itemData": "b" > } > ] > }""" > val df = spark.read.json(Seq(jsonStr).toDS) > df.write.format("parquet").mode("overwrite").saveAsTable("persisted") > > Part 2: reading it back and explaining the queries > {{val read = spark.table("persisted")}} > \{{ spark.conf.set("spark.sql.optimizer.nestedSchemaPruning.enabled", true)}} > \{{ read.select($"items.itemId").explain(true) // pruned, only loading > itemIdread.select(explode($"items.itemId")).explain(true) // not pruned, > loading both itemId and itemData}}{{ }} > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29721) Spark SQL reads unnecessary nested fields from Parquet after using explode
[ https://issues.apache.org/jira/browse/SPARK-29721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kai Kang updated SPARK-29721: - Description: This is a follow up for SPARK-4502. SPARK-4502 correctly addressed column pruning for nested structures. However, when explode() is called on a nested field, all columns for that nested structure is still fetched from data source. We are working on a project to create a parquet store for a big pre-joined table between two tables that has one-to-many relationship, and this is a blocking issue for us. The following code illustrates the issue. Part 1: loading some nested data {quote}{{import spark.implicits._}} val jsonStr = """{ "items": [ { "itemId": 1, "itemData": "a" }, { "itemId": 1, "itemData": "b" } ] }""" {{val df = spark.read.json(Seq(jsonStr).toDS)}} {{df.write.format("parquet").mode("overwrite").saveAsTable("persisted")}} {quote} Part 2: reading it back and explaining the queries {quote}val read = spark.table("persisted") spark.conf.set("spark.sql.optimizer.nestedSchemaPruning.enabled", true) read.select($"items.itemId").explain(true) // pruned, only loading itemId read.select(explode($"items.itemId")).explain(true) // not pruned, loading both itemId and itemData {quote} was: This is a follow up for SPARK-4502. SPARK-4502 correctly addressed column pruning for nested structures. However, when explode() is called on a nested field, all columns for that nested structure is still fetched from data source. We are working on a project to create a parquet store for a big pre-joined table between two tables that has one-to-many relationship, and this is a blocking issue for us. The following code illustrates the issue. Part 1: loading some nested data {quote}{{import spark.implicits._}} {{val jsonStr = """{}} {{ "items": [}} {{ {}} {{ "itemId": 1,}} {{ "itemData": "a"}} {{ },}} {{ {}} {{ "itemId": 1,}} {{ "itemData": "b"}} {{ }}} {{ ]}"""}} {{val df = spark.read.json(Seq(jsonStr).toDS)}} {{df.write.format("parquet").mode("overwrite").saveAsTable("persisted")}} {quote} Part 2: reading it back and explaining the queries {quote}val read = spark.table("persisted") spark.conf.set("spark.sql.optimizer.nestedSchemaPruning.enabled", true) read.select($"items.itemId").explain(true) // pruned, only loading itemId read.select(explode($"items.itemId")).explain(true) // not pruned, loading both itemId and itemData {quote} > Spark SQL reads unnecessary nested fields from Parquet after using explode > -- > > Key: SPARK-29721 > URL: https://issues.apache.org/jira/browse/SPARK-29721 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.4 >Reporter: Kai Kang >Priority: Critical > > This is a follow up for SPARK-4502. SPARK-4502 correctly addressed column > pruning for nested structures. However, when explode() is called on a nested > field, all columns for that nested structure is still fetched from data > source. > We are working on a project to create a parquet store for a big pre-joined > table between two tables that has one-to-many relationship, and this is a > blocking issue for us. > > The following code illustrates the issue. > Part 1: loading some nested data > {quote}{{import spark.implicits._}} > val jsonStr = """{ > "items": [ > { > "itemId": 1, > "itemData": "a" > }, > { > "itemId": 1, > "itemData": "b" > } > ] > }""" > {{val df = spark.read.json(Seq(jsonStr).toDS)}} > {{df.write.format("parquet").mode("overwrite").saveAsTable("persisted")}} > {quote} > Part 2: reading it back and explaining the queries > {quote}val read = spark.table("persisted") > spark.conf.set("spark.sql.optimizer.nestedSchemaPruning.enabled", true) > read.select($"items.itemId").explain(true) // pruned, only loading itemId > read.select(explode($"items.itemId")).explain(true) // not pruned, loading > both itemId and itemData > {quote} > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29721) Spark SQL reads unnecessary nested fields from Parquet after using explode
[ https://issues.apache.org/jira/browse/SPARK-29721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kai Kang updated SPARK-29721: - Description: This is a follow up for SPARK-4502. SPARK-4502 correctly addressed column pruning for nested structures. However, when explode() is called on a nested field, all columns for that nested structure is still fetched from data source. We are working on a project to create a parquet store for a big pre-joined table between two tables that has one-to-many relationship, and this is a blocking issue for us. The following code illustrates the issue. Part 1: loading some nested data val jsonStr = """{ "items": [ { "itemId": 1, "itemData": "a" }, { "itemId": 1, "itemData": "b" } ] }""" val df = spark.read.json(Seq(jsonStr).toDS) df.write.format("parquet").mode("overwrite").saveAsTable("persisted") Part 2: reading it back and explaining the queries {{val read = spark.table("persisted")}} \{{ spark.conf.set("spark.sql.optimizer.nestedSchemaPruning.enabled", true)}} \{{ read.select($"items.itemId").explain(true) // pruned, only loading itemIdread.select(explode($"items.itemId")).explain(true) // not pruned, loading both itemId and itemData}}{{ }} was: This is a follow up for SPARK-4502. SPARK-4502 correctly addressed column pruning for nested structures. However, when explode() is called on a nested field, all columns for that nested structure is still fetched from data source. We are working on a project to create a parquet store for a big pre-joined table between two tables that has one-to-many relationship, and this is a blocking issue for us. The following code illustrates the issue. Part 1: loading some nested data {{val jsonStr = """{}} {{ "items": [}} {{ {}} {{ "itemId": 1,}} {{ "itemData": "a"}} {{ },}} {{ {}} {{ "itemId": 1,}} {{ "itemData": "b"}} {{ }}} {{ ]}} {{}"""}} Part 2: reading it back and explaining the queries {{val read = spark.table("persisted")}} {{ spark.conf.set("spark.sql.optimizer.nestedSchemaPruning.enabled", true)}} {{ read.select($"items.itemId").explain(true) // pruned, only loading itemIdread.select(explode($"items.itemId")).explain(true) // not pruned, loading both itemId and itemData}}{{ }} > Spark SQL reads unnecessary nested fields from Parquet after using explode > -- > > Key: SPARK-29721 > URL: https://issues.apache.org/jira/browse/SPARK-29721 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.4 >Reporter: Kai Kang >Priority: Critical > > This is a follow up for SPARK-4502. SPARK-4502 correctly addressed column > pruning for nested structures. However, when explode() is called on a nested > field, all columns for that nested structure is still fetched from data > source. > We are working on a project to create a parquet store for a big pre-joined > table between two tables that has one-to-many relationship, and this is a > blocking issue for us. > > The following code illustrates the issue. > Part 1: loading some nested data > val jsonStr = """{ > "items": [ > { > "itemId": 1, > "itemData": "a" > }, > { > "itemId": 1, > "itemData": "b" > } > ] > }""" > val df = spark.read.json(Seq(jsonStr).toDS) > df.write.format("parquet").mode("overwrite").saveAsTable("persisted") > > > Part 2: reading it back and explaining the queries > {{val read = spark.table("persisted")}} > \{{ spark.conf.set("spark.sql.optimizer.nestedSchemaPruning.enabled", true)}} > \{{ read.select($"items.itemId").explain(true) // pruned, only loading > itemIdread.select(explode($"items.itemId")).explain(true) // not pruned, > loading both itemId and itemData}}{{ }} > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29721) Spark SQL reads unnecessary nested fields from Parquet after using explode
[ https://issues.apache.org/jira/browse/SPARK-29721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kai Kang updated SPARK-29721: - Description: This is a follow up for SPARK-4502. SPARK-4502 correctly addressed column pruning for nested structures. However, when explode() is called on a nested field, all columns for that nested structure is still fetched from data source. We are working on a project to create a parquet store for a big pre-joined table between two tables that has one-to-many relationship, and this is a blocking issue for us. The following code illustrates the issue. Part 1: loading some nested data {{val jsonStr = """{}} {{ "items": [}} {{ {}} {{ "itemId": 1,}} {{ "itemData": "a"}} {{ },}} {{ {}} {{ "itemId": 1,}} {{ "itemData": "b"}} {{ }}} {{ ]}} {{}"""}} Part 2: reading it back and explaining the queries {{val read = spark.table("persisted")}} {{ spark.conf.set("spark.sql.optimizer.nestedSchemaPruning.enabled", true)}} {{ read.select($"items.itemId").explain(true) // pruned, only loading itemIdread.select(explode($"items.itemId")).explain(true) // not pruned, loading both itemId and itemData}}{{ }} was: This is a follow up for SPARK-4502. SPARK-4502 correctly addressed column pruning for nested structures. However, when explode() is called on a nested field, all columns for that nested structure is still fetched from data source. We are working on a project to create a parquet store for a big pre-joined table between two tables that has one-to-many relationship, and this is a blocking issue for us. The following code illustrates the issue. Part 1: loading some nested data {quote}{{import spark.implicits._}} val jsonStr = """{ "items": [ { "itemId": 1, "itemData": "a" }, { "itemId": 1, "itemData": "b" } ] }""" {{val df = spark.read.json(Seq(jsonStr).toDS)}} {{df.write.format("parquet").mode("overwrite").saveAsTable("persisted")}} {quote} Part 2: reading it back and explaining the queries {quote}val read = spark.table("persisted") spark.conf.set("spark.sql.optimizer.nestedSchemaPruning.enabled", true) read.select($"items.itemId").explain(true) // pruned, only loading itemId read.select(explode($"items.itemId")).explain(true) // not pruned, loading both itemId and itemData {quote} > Spark SQL reads unnecessary nested fields from Parquet after using explode > -- > > Key: SPARK-29721 > URL: https://issues.apache.org/jira/browse/SPARK-29721 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.4 >Reporter: Kai Kang >Priority: Critical > > This is a follow up for SPARK-4502. SPARK-4502 correctly addressed column > pruning for nested structures. However, when explode() is called on a nested > field, all columns for that nested structure is still fetched from data > source. > We are working on a project to create a parquet store for a big pre-joined > table between two tables that has one-to-many relationship, and this is a > blocking issue for us. > > The following code illustrates the issue. > Part 1: loading some nested data > {{val jsonStr = """{}} > {{ "items": [}} > {{ {}} > {{ "itemId": 1,}} > {{ "itemData": "a"}} > {{ },}} > {{ {}} > {{ "itemId": 1,}} > {{ "itemData": "b"}} > {{ }}} > {{ ]}} > {{}"""}} > Part 2: reading it back and explaining the queries > {{val read = spark.table("persisted")}} > {{ spark.conf.set("spark.sql.optimizer.nestedSchemaPruning.enabled", true)}} > {{ read.select($"items.itemId").explain(true) // pruned, only loading > itemIdread.select(explode($"items.itemId")).explain(true) // not pruned, > loading both itemId and itemData}}{{ }} > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29721) Spark SQL reads unnecessary nested fields from Parquet after using explode
Kai Kang created SPARK-29721: Summary: Spark SQL reads unnecessary nested fields from Parquet after using explode Key: SPARK-29721 URL: https://issues.apache.org/jira/browse/SPARK-29721 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.4.4 Reporter: Kai Kang This is a follow up for SPARK-4502. SPARK-4502 correctly addressed column pruning for nested structures. However, when explode() is called on a nested field, all columns for that nested structure is still fetched from data source. We are working on a project to create a parquet store for a big pre-joined table between two tables that has one-to-many relationship, and this is a blocking issue for us. The following code illustrates the issue. Part 1: loading some nested data {quote}{{import spark.implicits._}} {{val jsonStr = """{}} {{ "items": [}} {{ {}} {{ "itemId": 1,}} {{ "itemData": "a"}} {{ },}} {{ {}} {{ "itemId": 1,}} {{ "itemData": "b"}} {{ }}} {{ ]}"""}} {{val df = spark.read.json(Seq(jsonStr).toDS)}} {{df.write.format("parquet").mode("overwrite").saveAsTable("persisted")}} {quote} Part 2: reading it back and explaining the queries {quote}val read = spark.table("persisted") spark.conf.set("spark.sql.optimizer.nestedSchemaPruning.enabled", true) read.select($"items.itemId").explain(true) // pruned, only loading itemId read.select(explode($"items.itemId")).explain(true) // not pruned, loading both itemId and itemData {quote} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29682) Failure when resolving conflicting references in Join:
[ https://issues.apache.org/jira/browse/SPARK-29682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16965216#comment-16965216 ] sandeshyapuram commented on SPARK-29682: [~imback82] & [~cloud_fan] Currently I've worked my way around by renaming every column in the dataframes to perform joins and that works. Let me know if you have a better workaround to deal with it. > Failure when resolving conflicting references in Join: > -- > > Key: SPARK-29682 > URL: https://issues.apache.org/jira/browse/SPARK-29682 > Project: Spark > Issue Type: Bug > Components: Spark Core, Spark Shell, Spark Submit >Affects Versions: 2.4.3 >Reporter: sandeshyapuram >Priority: Major > > When I try to self join a parentDf with multiple childDf say childDf1 ... ... > where childDfs are derived after a cube or rollup and are filtered based on > group bys, > I get and error > {{Failure when resolving conflicting references in Join: }} > This shows a long error message which is quite unreadable. On the other hand, > if I replace cube or rollup with old groupBy, it works without issues. > > *Sample code:* > {code:java} > val numsDF = sc.parallelize(Seq(1,2,3,4,5,6)).toDF("nums") > val cubeDF = numsDF > .cube("nums") > .agg( > max(lit(0)).as("agcol"), > grouping_id().as("gid") > ) > > val group0 = cubeDF.filter(col("gid") <=> lit(0)) > val group1 = cubeDF.filter(col("gid") <=> lit(1)) > cubeDF.printSchema > group0.printSchema > group1.printSchema > //Recreating cubeDf > cubeDF.select("nums").distinct > .join(group0, Seq("nums"), "inner") > .join(group1, Seq("nums"), "inner") > .show > {code} > *Sample output:* > {code:java} > numsDF: org.apache.spark.sql.DataFrame = [nums: int] > cubeDF: org.apache.spark.sql.DataFrame = [nums: int, agcol: int ... 1 more > field] > group0: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [nums: int, > agcol: int ... 1 more field] > group1: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [nums: int, > agcol: int ... 1 more field] > root > |-- nums: integer (nullable = true) > |-- agcol: integer (nullable = true) > |-- gid: integer (nullable = false) > root > |-- nums: integer (nullable = true) > |-- agcol: integer (nullable = true) > |-- gid: integer (nullable = false) > root > |-- nums: integer (nullable = true) > |-- agcol: integer (nullable = true) > |-- gid: integer (nullable = false) > org.apache.spark.sql.AnalysisException: > Failure when resolving conflicting references in Join: > 'Join Inner > :- Deduplicate [nums#220] > : +- Project [nums#220] > : +- Aggregate [nums#220, spark_grouping_id#218], [nums#220, max(0) AS > agcol#216, spark_grouping_id#218 AS gid#217] > :+- Expand [List(nums#212, nums#219, 0), List(nums#212, null, 1)], > [nums#212, nums#220, spark_grouping_id#218] > : +- Project [nums#212, nums#212 AS nums#219] > : +- Project [value#210 AS nums#212] > : +- SerializeFromObject [input[0, int, false] AS value#210] > :+- ExternalRDD [obj#209] > +- Filter (gid#217 <=> 0) >+- Aggregate [nums#220, spark_grouping_id#218], [nums#220, max(0) AS > agcol#216, spark_grouping_id#218 AS gid#217] > +- Expand [List(nums#212, nums#219, 0), List(nums#212, null, 1)], > [nums#212, nums#220, spark_grouping_id#218] > +- Project [nums#212, nums#212 AS nums#219] > +- Project [value#210 AS nums#212] >+- SerializeFromObject [input[0, int, false] AS value#210] > +- ExternalRDD [obj#209] > Conflicting attributes: nums#220 > ;; > 'Join Inner > :- Deduplicate [nums#220] > : +- Project [nums#220] > : +- Aggregate [nums#220, spark_grouping_id#218], [nums#220, max(0) AS > agcol#216, spark_grouping_id#218 AS gid#217] > :+- Expand [List(nums#212, nums#219, 0), List(nums#212, null, 1)], > [nums#212, nums#220, spark_grouping_id#218] > : +- Project [nums#212, nums#212 AS nums#219] > : +- Project [value#210 AS nums#212] > : +- SerializeFromObject [input[0, int, false] AS value#210] > :+- ExternalRDD [obj#209] > +- Filter (gid#217 <=> 0) >+- Aggregate [nums#220, spark_grouping_id#218], [nums#220, max(0) AS > agcol#216, spark_grouping_id#218 AS gid#217] > +- Expand [List(nums#212, nums#219, 0), List(nums#212, null, 1)], > [nums#212, nums#220, spark_grouping_id#218] > +- Project [nums#212, nums#212 AS nums#219] > +- Project [value#210 AS nums#212] >+- SerializeFromObject [input[0, int, false] AS value#210] > +- ExternalRDD [obj#209] > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:42) > at >
[jira] [Created] (SPARK-29720) Add linux condition to make ProcfsMetricsGetter more complete
ulysses you created SPARK-29720: --- Summary: Add linux condition to make ProcfsMetricsGetter more complete Key: SPARK-29720 URL: https://issues.apache.org/jira/browse/SPARK-29720 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 3.0.0 Reporter: ulysses you Now decide whether it can be gather proc stat to executor metrics is that {code:java} procDirExists.get && shouldLogStageExecutorProcessTreeMetrics && shouldLogStageExecutorMetrics {code} But the proc is only support for linux, so it should add a condition that isLinux . -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29719) Converted Metastore relations (ORC, Parquet) wouldn't update InMemoryFileIndex
[ https://issues.apache.org/jira/browse/SPARK-29719?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16965183#comment-16965183 ] Yuming Wang commented on SPARK-29719: - You should refresh {{my_table}}. A similar issue: https://github.com/apache/spark/pull/22721 > Converted Metastore relations (ORC, Parquet) wouldn't update InMemoryFileIndex > -- > > Key: SPARK-29719 > URL: https://issues.apache.org/jira/browse/SPARK-29719 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Alexander Bessonov >Priority: Major > > Spark attempts to convert Hive tables backed by Parquet and ORC into an > internal logical relationships which cache file locations for underlying > data. That cache wouldn't be invalidated when attempting to re-read > partitioned table later on. The table might have new files by the time it is > re-read which might be ignored. > > > {code:java} > val spark = SparkSession.builder() > .master("yarn") > .enableHiveSupport > .config("spark.sql.hive.caseSensitiveInferenceMode", "NEVER_INFER") > .getOrCreate() > val df1 = spark.table("my_table").filter("date=20191101") > // Do something with `df1` > // External process writes to the partition > val df2 = spark.table("my_table").filter("date=20191101") > // Do something with `df2`. Data in `df1` and `df2` should be different, but > is equal.{code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-29691) Estimator fit method fails to copy params (in PySpark)
[ https://issues.apache.org/jira/browse/SPARK-29691?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16965128#comment-16965128 ] Huaxin Gao edited comment on SPARK-29691 at 11/1/19 10:33 PM: -- I checked the doc and implementation. The Estimator fits the model using the passed in optional params instead of the embedded params, but it doesn't overwrite the estimator's embedded params values. In your case, the estimator uses 0.75 to fit the model, but it still keeps 0.8 for it's own elasticNetParam. If you get the model's parameters, it should have 0.75 for elasticNetParam. This seems to work as designed. {code:java} # Fit the model, but with an updated parameter setting: lrModel = lr.fit(training, params= {lor.elasticNetParam : 0.75}) print("After:", lrModel.getOrDefault("elasticNetParam")) # print 0.75 {code} was (Author: huaxingao): I checked the doc and implementation. The Estimator fits the model using the passed in optional params instead of the embedded params, but it doesn't overwrite the estimator's embedded params values. In your case, the estimator uses 0.75 to fit the model, but it still keeps 0.8 for it's own elasticNetParam. If you get the model's parameters, it should have 0.75 for elasticNetParam. This seems to work as designed. {code:java} # Fit the model, but with an updated parameter setting:lrModel = lr.fit(training, params= {lor.elasticNetParam : 0.75} )print("After:", lrModel.getOrDefault("elasticNetParam")) # print 0.75 {code} > Estimator fit method fails to copy params (in PySpark) > -- > > Key: SPARK-29691 > URL: https://issues.apache.org/jira/browse/SPARK-29691 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.4.4 >Reporter: John Bauer >Priority: Minor > > Estimator `fit` method is supposed to copy a dictionary of params, > overwriting the estimator's previous values, before fitting the model. > However, the parameter values are not updated. This was observed in PySpark, > but may be present in the Java objects, as the PySpark code appears to be > functioning correctly. (The copy method that interacts with Java is > actually implemented in Params.) > For example, this prints > Before: 0.8 > After: 0.8 > but After should be 0.75 > {code:python} > from pyspark.ml.classification import LogisticRegression > # Load training data > training = spark \ > .read \ > .format("libsvm") \ > .load("data/mllib/sample_multiclass_classification_data.txt") > lr = LogisticRegression(maxIter=10, regParam=0.3, elasticNetParam=0.8) > print("Before:", lr.getOrDefault("elasticNetParam")) > # Fit the model, but with an updated parameter setting: > lrModel = lr.fit(training, params={"elasticNetParam" : 0.75}) > print("After:", lr.getOrDefault("elasticNetParam")) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-29691) Estimator fit method fails to copy params (in PySpark)
[ https://issues.apache.org/jira/browse/SPARK-29691?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16965128#comment-16965128 ] Huaxin Gao edited comment on SPARK-29691 at 11/1/19 10:32 PM: -- I checked the doc and implementation. The Estimator fits the model using the passed in optional params instead of the embedded params, but it doesn't overwrite the estimator's embedded params values. In your case, the estimator uses 0.75 to fit the model, but it still keeps 0.8 for it's own elasticNetParam. If you get the model's parameters, it should have 0.75 for elasticNetParam. This seems to work as designed. {code:java} # Fit the model, but with an updated parameter setting:lrModel = lr.fit(training, params= {lor.elasticNetParam : 0.75} )print("After:", lrModel.getOrDefault("elasticNetParam")) # print 0.75 {code} was (Author: huaxingao): I checked the doc and implementation. The Estimator fits the model using the passed in optional params instead of the embedded params, but it doesn't overwrite the estimator's embedded params values. In your case, the estimator uses 0.75 to fit the model, but it still keeps 0.8 for it's own elasticNetParam. If you get the model's parameters, it should have 0.75 for elasticNetParam. This seems to work as designed. # Fit the model, but with an updated parameter setting:lrModel = lr.fit(training, params={lor.elasticNetParam : 0.75})print("After:", lrModel.getOrDefault("elasticNetParam")) # print 0.75 > Estimator fit method fails to copy params (in PySpark) > -- > > Key: SPARK-29691 > URL: https://issues.apache.org/jira/browse/SPARK-29691 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.4.4 >Reporter: John Bauer >Priority: Minor > > Estimator `fit` method is supposed to copy a dictionary of params, > overwriting the estimator's previous values, before fitting the model. > However, the parameter values are not updated. This was observed in PySpark, > but may be present in the Java objects, as the PySpark code appears to be > functioning correctly. (The copy method that interacts with Java is > actually implemented in Params.) > For example, this prints > Before: 0.8 > After: 0.8 > but After should be 0.75 > {code:python} > from pyspark.ml.classification import LogisticRegression > # Load training data > training = spark \ > .read \ > .format("libsvm") \ > .load("data/mllib/sample_multiclass_classification_data.txt") > lr = LogisticRegression(maxIter=10, regParam=0.3, elasticNetParam=0.8) > print("Before:", lr.getOrDefault("elasticNetParam")) > # Fit the model, but with an updated parameter setting: > lrModel = lr.fit(training, params={"elasticNetParam" : 0.75}) > print("After:", lr.getOrDefault("elasticNetParam")) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29691) Estimator fit method fails to copy params (in PySpark)
[ https://issues.apache.org/jira/browse/SPARK-29691?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16965128#comment-16965128 ] Huaxin Gao commented on SPARK-29691: I checked the doc and implementation. The Estimator fits the model using the passed in optional params instead of the embedded params, but it doesn't overwrite the estimator's embedded params values. In your case, the estimator uses 0.75 to fit the model, but it still keeps 0.8 for it's own elasticNetParam. If you get the model's parameters, it should have 0.75 for elasticNetParam. This seems to work as designed. # Fit the model, but with an updated parameter setting:lrModel = lr.fit(training, params={lor.elasticNetParam : 0.75})print("After:", lrModel.getOrDefault("elasticNetParam")) # print 0.75 > Estimator fit method fails to copy params (in PySpark) > -- > > Key: SPARK-29691 > URL: https://issues.apache.org/jira/browse/SPARK-29691 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.4.4 >Reporter: John Bauer >Priority: Minor > > Estimator `fit` method is supposed to copy a dictionary of params, > overwriting the estimator's previous values, before fitting the model. > However, the parameter values are not updated. This was observed in PySpark, > but may be present in the Java objects, as the PySpark code appears to be > functioning correctly. (The copy method that interacts with Java is > actually implemented in Params.) > For example, this prints > Before: 0.8 > After: 0.8 > but After should be 0.75 > {code:python} > from pyspark.ml.classification import LogisticRegression > # Load training data > training = spark \ > .read \ > .format("libsvm") \ > .load("data/mllib/sample_multiclass_classification_data.txt") > lr = LogisticRegression(maxIter=10, regParam=0.3, elasticNetParam=0.8) > print("Before:", lr.getOrDefault("elasticNetParam")) > # Fit the model, but with an updated parameter setting: > lrModel = lr.fit(training, params={"elasticNetParam" : 0.75}) > print("After:", lr.getOrDefault("elasticNetParam")) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29719) Converted Metastore relations (ORC, Parquet) wouldn't update InMemoryFileIndex
Alexander Bessonov created SPARK-29719: -- Summary: Converted Metastore relations (ORC, Parquet) wouldn't update InMemoryFileIndex Key: SPARK-29719 URL: https://issues.apache.org/jira/browse/SPARK-29719 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.4.0 Reporter: Alexander Bessonov Spark attempts to convert Hive tables backed by Parquet and ORC into an internal logical relationships which cache file locations for underlying data. That cache wouldn't be invalidated when attempting to re-read partitioned table later on. The table might have new files by the time it is re-read which might be ignored. {code:java} val spark = SparkSession.builder() .master("yarn") .enableHiveSupport .config("spark.sql.hive.caseSensitiveInferenceMode", "NEVER_INFER") .getOrCreate() val df1 = spark.table("my_table").filter("date=20191101") // Do something with `df1` // External process writes to the partition val df2 = spark.table("my_table").filter("date=20191101") // Do something with `df2`. Data in `df1` and `df2` should be different, but is equal.{code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-18910) Can't use UDF that jar file in hdfs
[ https://issues.apache.org/jira/browse/SPARK-18910?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun closed SPARK-18910. - > Can't use UDF that jar file in hdfs > --- > > Key: SPARK-18910 > URL: https://issues.apache.org/jira/browse/SPARK-18910 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.2 >Reporter: Hong Shen >Priority: Major > > When I create a UDF that jar file in hdfs, I can't use the UDF. > {code} > spark-sql> create function trans_array as 'com.test.udf.TransArray' using > jar > 'hdfs://host1:9000/spark/dev/share/libs/spark-proxy-server-biz-service-impl-1.0.0.jar'; > spark-sql> describe function trans_array; > Function: test_db.trans_array > Class: com.alipay.spark.proxy.server.biz.service.impl.udf.TransArray > Usage: N/A. > Time taken: 0.127 seconds, Fetched 3 row(s) > spark-sql> select trans_array(1, '\\|', id, position) as (id0, position0) > from test_spark limit 10; > Error in query: Undefined function: 'trans_array'. This function is neither a > registered temporary function nor a permanent function registered in the > database 'test_db'.; line 1 pos 7 > {code} > The reason is when > org.apache.spark.sql.internal.SessionState.FunctionResourceLoader.loadResource, > the uri.toURL throw exception with " failed unknown protocol: hdfs" > {code} > def addJar(path: String): Unit = { > sparkSession.sparkContext.addJar(path) > val uri = new Path(path).toUri > val jarURL = if (uri.getScheme == null) { > // `path` is a local file path without a URL scheme > new File(path).toURI.toURL > } else { > // `path` is a URL with a scheme > {color:red}uri.toURL{color} > } > jarClassLoader.addURL(jarURL) > Thread.currentThread().setContextClassLoader(jarClassLoader) > } > {code} > I think we should setURLStreamHandlerFactory method on URL with an instance > of FsUrlStreamHandlerFactory, just like: > {code} > static { > URL.setURLStreamHandlerFactory(new FsUrlStreamHandlerFactory()); > } > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-21697) NPE & ExceptionInInitializerError trying to load UDF from HDFS
[ https://issues.apache.org/jira/browse/SPARK-21697?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun closed SPARK-21697. - > NPE & ExceptionInInitializerError trying to load UDF from HDFS > -- > > Key: SPARK-21697 > URL: https://issues.apache.org/jira/browse/SPARK-21697 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.1 > Environment: Spark Client mode, Hadoop 2.6.0 >Reporter: Steve Loughran >Priority: Minor > Labels: bulk-closed > > Reported on [the > PR|https://github.com/apache/spark/pull/17342#issuecomment-321438157] for > SPARK-12868: trying to load a UDF of HDFS is triggering an > {{ExceptionInInitializerError}}, caused by an NPE which should only happen if > the commons-logging {{LOG}} log is null. > Hypothesis: the commons logging scan for {{commons-logging.properties}} is > happening in the classpath with the HDFS JAR; this is triggering a D/L of the > JAR, which needs to force in commons-logging, and, as that's not inited yet, > NPEs -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-21697) NPE & ExceptionInInitializerError trying to load UDF from HDFS
[ https://issues.apache.org/jira/browse/SPARK-21697?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-21697. --- Resolution: Duplicate > NPE & ExceptionInInitializerError trying to load UDF from HDFS > -- > > Key: SPARK-21697 > URL: https://issues.apache.org/jira/browse/SPARK-21697 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.1 > Environment: Spark Client mode, Hadoop 2.6.0 >Reporter: Steve Loughran >Priority: Minor > Labels: bulk-closed > > Reported on [the > PR|https://github.com/apache/spark/pull/17342#issuecomment-321438157] for > SPARK-12868: trying to load a UDF of HDFS is triggering an > {{ExceptionInInitializerError}}, caused by an NPE which should only happen if > the commons-logging {{LOG}} log is null. > Hypothesis: the commons logging scan for {{commons-logging.properties}} is > happening in the classpath with the HDFS JAR; this is triggering a D/L of the > JAR, which needs to force in commons-logging, and, as that's not inited yet, > NPEs -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-21697) NPE & ExceptionInInitializerError trying to load UTF from HDFS
[ https://issues.apache.org/jira/browse/SPARK-21697?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reopened SPARK-21697: --- > NPE & ExceptionInInitializerError trying to load UTF from HDFS > -- > > Key: SPARK-21697 > URL: https://issues.apache.org/jira/browse/SPARK-21697 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.1 > Environment: Spark Client mode, Hadoop 2.6.0 >Reporter: Steve Loughran >Priority: Minor > Labels: bulk-closed > > Reported on [the > PR|https://github.com/apache/spark/pull/17342#issuecomment-321438157] for > SPARK-12868: trying to load a UDF of HDFS is triggering an > {{ExceptionInInitializerError}}, caused by an NPE which should only happen if > the commons-logging {{LOG}} log is null. > Hypothesis: the commons logging scan for {{commons-logging.properties}} is > happening in the classpath with the HDFS JAR; this is triggering a D/L of the > JAR, which needs to force in commons-logging, and, as that's not inited yet, > NPEs -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21697) NPE & ExceptionInInitializerError trying to load UDF from HDFS
[ https://issues.apache.org/jira/browse/SPARK-21697?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-21697: -- Summary: NPE & ExceptionInInitializerError trying to load UDF from HDFS (was: NPE & ExceptionInInitializerError trying to load UTF from HDFS) > NPE & ExceptionInInitializerError trying to load UDF from HDFS > -- > > Key: SPARK-21697 > URL: https://issues.apache.org/jira/browse/SPARK-21697 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.1 > Environment: Spark Client mode, Hadoop 2.6.0 >Reporter: Steve Loughran >Priority: Minor > Labels: bulk-closed > > Reported on [the > PR|https://github.com/apache/spark/pull/17342#issuecomment-321438157] for > SPARK-12868: trying to load a UDF of HDFS is triggering an > {{ExceptionInInitializerError}}, caused by an NPE which should only happen if > the commons-logging {{LOG}} log is null. > Hypothesis: the commons logging scan for {{commons-logging.properties}} is > happening in the classpath with the HDFS JAR; this is triggering a D/L of the > JAR, which needs to force in commons-logging, and, as that's not inited yet, > NPEs -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21697) NPE & ExceptionInInitializerError trying to load UDF from HDFS
[ https://issues.apache.org/jira/browse/SPARK-21697?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-21697: -- Labels: (was: bulk-closed) > NPE & ExceptionInInitializerError trying to load UDF from HDFS > -- > > Key: SPARK-21697 > URL: https://issues.apache.org/jira/browse/SPARK-21697 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.1 > Environment: Spark Client mode, Hadoop 2.6.0 >Reporter: Steve Loughran >Priority: Minor > > Reported on [the > PR|https://github.com/apache/spark/pull/17342#issuecomment-321438157] for > SPARK-12868: trying to load a UDF of HDFS is triggering an > {{ExceptionInInitializerError}}, caused by an NPE which should only happen if > the commons-logging {{LOG}} log is null. > Hypothesis: the commons logging scan for {{commons-logging.properties}} is > happening in the classpath with the HDFS JAR; this is triggering a D/L of the > JAR, which needs to force in commons-logging, and, as that's not inited yet, > NPEs -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25694) URL.setURLStreamHandlerFactory causing incompatible HttpURLConnection issue
[ https://issues.apache.org/jira/browse/SPARK-25694?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-25694: -- Affects Version/s: 3.0.0 2.4.4 > URL.setURLStreamHandlerFactory causing incompatible HttpURLConnection issue > --- > > Key: SPARK-25694 > URL: https://issues.apache.org/jira/browse/SPARK-25694 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Affects Versions: 2.3.0, 2.3.1, 2.3.2, 2.4.4, 3.0.0 >Reporter: Bo Yang >Priority: Minor > > URL.setURLStreamHandlerFactory() in SharedState causes URL.openConnection() > returns FsUrlConnection object, which is not compatible with > HttpURLConnection. This will cause exception when using some third party http > library (e.g. scalaj.http). > The following code in Spark 2.3.0 introduced the issue: > sql/core/src/main/scala/org/apache/spark/sql/internal/SharedState.scala: > {code} > object SharedState extends Logging { ... > URL.setURLStreamHandlerFactory(new FsUrlStreamHandlerFactory()) ... > } > {code} > Here is the example exception when using scalaj.http in Spark: > {code} > StackTrace: scala.MatchError: > org.apache.hadoop.fs.FsUrlConnection:[http://.example.com|http://.example.com/] > (of class org.apache.hadoop.fs.FsUrlConnection) > at > scalaj.http.HttpRequest.scalaj$http$HttpRequest$$doConnection(Http.scala:343) > at scalaj.http.HttpRequest.exec(Http.scala:335) > at scalaj.http.HttpRequest.asString(Http.scala:455) > {code} > > One option to fix the issue is to return null in > URLStreamHandlerFactory.createURLStreamHandler when the protocol is > http/https, so it will use the default behavior and be compatible with > scalaj.http. Following is the code example: > {code} > class SparkUrlStreamHandlerFactory extends URLStreamHandlerFactory with > Logging { > private val fsUrlStreamHandlerFactory = new FsUrlStreamHandlerFactory() > override def createURLStreamHandler(protocol: String): URLStreamHandler = { > val handler = fsUrlStreamHandlerFactory.createURLStreamHandler(protocol) > if (handler == null) { > return null > } > if (protocol != null && > (protocol.equalsIgnoreCase("http") > || protocol.equalsIgnoreCase("https"))) { > // return null to use system default URLStreamHandler > null > } else { > handler > } > } > } > {code} > I would like to get some discussion here before submitting a pull request. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25694) URL.setURLStreamHandlerFactory causing incompatible HttpURLConnection issue
[ https://issues.apache.org/jira/browse/SPARK-25694?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16965036#comment-16965036 ] Dongjoon Hyun commented on SPARK-25694: --- Although SPARK-12868 adds `setURLStreamHandlerFactory` for `ADD JARS` commands, this method can be called at most once in a given Java Virtual Machine. So, there is another issue with this. - https://docs.oracle.com/en/java/javase/11/docs/api/java.base/java/net/URL.html#setURLStreamHandlerFactory(java.net.URLStreamHandlerFactory) > URL.setURLStreamHandlerFactory causing incompatible HttpURLConnection issue > --- > > Key: SPARK-25694 > URL: https://issues.apache.org/jira/browse/SPARK-25694 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Affects Versions: 2.3.0, 2.3.1, 2.3.2 >Reporter: Bo Yang >Priority: Minor > > URL.setURLStreamHandlerFactory() in SharedState causes URL.openConnection() > returns FsUrlConnection object, which is not compatible with > HttpURLConnection. This will cause exception when using some third party http > library (e.g. scalaj.http). > The following code in Spark 2.3.0 introduced the issue: > sql/core/src/main/scala/org/apache/spark/sql/internal/SharedState.scala: > {code} > object SharedState extends Logging { ... > URL.setURLStreamHandlerFactory(new FsUrlStreamHandlerFactory()) ... > } > {code} > Here is the example exception when using scalaj.http in Spark: > {code} > StackTrace: scala.MatchError: > org.apache.hadoop.fs.FsUrlConnection:[http://.example.com|http://.example.com/] > (of class org.apache.hadoop.fs.FsUrlConnection) > at > scalaj.http.HttpRequest.scalaj$http$HttpRequest$$doConnection(Http.scala:343) > at scalaj.http.HttpRequest.exec(Http.scala:335) > at scalaj.http.HttpRequest.asString(Http.scala:455) > {code} > > One option to fix the issue is to return null in > URLStreamHandlerFactory.createURLStreamHandler when the protocol is > http/https, so it will use the default behavior and be compatible with > scalaj.http. Following is the code example: > {code} > class SparkUrlStreamHandlerFactory extends URLStreamHandlerFactory with > Logging { > private val fsUrlStreamHandlerFactory = new FsUrlStreamHandlerFactory() > override def createURLStreamHandler(protocol: String): URLStreamHandler = { > val handler = fsUrlStreamHandlerFactory.createURLStreamHandler(protocol) > if (handler == null) { > return null > } > if (protocol != null && > (protocol.equalsIgnoreCase("http") > || protocol.equalsIgnoreCase("https"))) { > // return null to use system default URLStreamHandler > null > } else { > handler > } > } > } > {code} > I would like to get some discussion here before submitting a pull request. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-29452) Improve tootip information for storage tab
[ https://issues.apache.org/jira/browse/SPARK-29452?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen reassigned SPARK-29452: Assignee: Rakesh Raushan > Improve tootip information for storage tab > -- > > Key: SPARK-29452 > URL: https://issues.apache.org/jira/browse/SPARK-29452 > Project: Spark > Issue Type: Sub-task > Components: Web UI >Affects Versions: 3.0.0 >Reporter: Sandeep Katta >Assignee: Rakesh Raushan >Priority: Minor > Fix For: 3.0.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29651) Incorrect parsing of interval seconds fraction
[ https://issues.apache.org/jira/browse/SPARK-29651?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan updated SPARK-29651: Fix Version/s: 2.4.5 > Incorrect parsing of interval seconds fraction > -- > > Key: SPARK-29651 > URL: https://issues.apache.org/jira/browse/SPARK-29651 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0, 2.1.0, 2.2.0, 2.3.0, 2.4.0 >Reporter: Maxim Gekk >Assignee: Maxim Gekk >Priority: Minor > Fix For: 2.4.5, 3.0.0 > > > * The fractional part of interval seconds unit is incorrectly parsed if the > number of digits is less than 9, for example: > {code} > spark-sql> select interval '10.123456 seconds'; > interval 10 seconds 123 microseconds > {code} > The result must be *interval 10 seconds 123 milliseconds 456 microseconds* > * If the seconds unit of an interval is negative, it is incorrectly converted > to `CalendarInterval`, for example: > {code} > spark-sql> select interval '-10.123456789 seconds'; > interval -9 seconds -876 milliseconds -544 microseconds > {code} > Taking into account truncation to microseconds, the result must be *interval > -10 seconds -123 milliseconds -456 microseconds* -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29718) Support PARTITION BY [RANGE|LIST|HASH] and PARTITION OF in CREATE TABLE
Takeshi Yamamuro created SPARK-29718: Summary: Support PARTITION BY [RANGE|LIST|HASH] and PARTITION OF in CREATE TABLE Key: SPARK-29718 URL: https://issues.apache.org/jira/browse/SPARK-29718 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.0.0 Reporter: Takeshi Yamamuro 5.10. Table Partitioning: https://www.postgresql.org/docs/current/ddl-partitioning.html -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29717) Support [CREATE|DROP] RULE - define a new plan rewrite rule
Takeshi Yamamuro created SPARK-29717: Summary: Support [CREATE|DROP] RULE - define a new plan rewrite rule Key: SPARK-29717 URL: https://issues.apache.org/jira/browse/SPARK-29717 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.0.0 Reporter: Takeshi Yamamuro https://www.postgresql.org/docs/current/sql-createrule.html -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29716) Support [CREATE|DROP] TYPE
Takeshi Yamamuro created SPARK-29716: Summary: Support [CREATE|DROP] TYPE Key: SPARK-29716 URL: https://issues.apache.org/jira/browse/SPARK-29716 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.0.0 Reporter: Takeshi Yamamuro https://www.postgresql.org/docs/current/sql-createtype.html -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29715) Support SELECT statements in VALUES of INSERT INTO
Takeshi Yamamuro created SPARK-29715: Summary: Support SELECT statements in VALUES of INSERT INTO Key: SPARK-29715 URL: https://issues.apache.org/jira/browse/SPARK-29715 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.0.0 Reporter: Takeshi Yamamuro In PgSQL, we can use SELECT statements in VALUES of INSERT INTO; {code} postgres=# create table t (c0 int, c1 int); CREATE TABLE postgres=# insert into t values (3, (select 1)); INSERT 0 1 postgres=# select * from t; c0 | c1 + 3 | 1 (1 row) {code} {code} scala> sql("""create table t (c0 int, c1 int) using parquet""") scala> sql("""insert into t values (3, (select 1))""") org.apache.spark.sql.AnalysisException: unresolved operator 'Project [unresolvedalias(1, None)];; 'InsertIntoStatement 'UnresolvedRelation [t], false, false +- 'UnresolvedInlineTable [col1, col2], [List(3, scalar-subquery#0 [])] +- 'Project [unresolvedalias(1, None)] +- OneRowRelation at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.failAnalysis(CheckAnalysis.scala:47) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.failAnalysis$(CheckAnalysis.scala:46) at org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:122) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis$36(CheckAnalysis.scala:540) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis$36$adapted(CheckAnalysis.scala:538) at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:154) {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-29452) Improve tootip information for storage tab
[ https://issues.apache.org/jira/browse/SPARK-29452?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen resolved SPARK-29452. -- Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 26226 [https://github.com/apache/spark/pull/26226] > Improve tootip information for storage tab > -- > > Key: SPARK-29452 > URL: https://issues.apache.org/jira/browse/SPARK-29452 > Project: Spark > Issue Type: Sub-task > Components: Web UI >Affects Versions: 3.0.0 >Reporter: Sandeep Katta >Priority: Minor > Fix For: 3.0.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29714) Add insert.sql
Takeshi Yamamuro created SPARK-29714: Summary: Add insert.sql Key: SPARK-29714 URL: https://issues.apache.org/jira/browse/SPARK-29714 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.0.0 Reporter: Takeshi Yamamuro -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-29109) Add window.sql - Part 3
[ https://issues.apache.org/jira/browse/SPARK-29109?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takeshi Yamamuro resolved SPARK-29109. -- Fix Version/s: 3.0.0 Assignee: Dylan Guedes Resolution: Fixed Resolved by https://github.com/apache/spark/pull/26274 > Add window.sql - Part 3 > --- > > Key: SPARK-29109 > URL: https://issues.apache.org/jira/browse/SPARK-29109 > Project: Spark > Issue Type: Sub-task > Components: SQL, Tests >Affects Versions: 3.0.0 >Reporter: Dylan Guedes >Assignee: Dylan Guedes >Priority: Major > Fix For: 3.0.0 > > > In this ticket, we plan to add the regression test cases of > [https://github.com/postgres/postgres/blob/REL_12_BETA3/src/test/regress/sql/window.sql#L553-L911|https://github.com/postgres/postgres/blob/REL_12_BETA3/src/test/regress/sql/window.sql#L553-L911] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29713) Support Interval Unit Abbreviations in Interval Literals
Kent Yao created SPARK-29713: Summary: Support Interval Unit Abbreviations in Interval Literals Key: SPARK-29713 URL: https://issues.apache.org/jira/browse/SPARK-29713 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.0.0 Reporter: Kent Yao "year" | "years" | "y" | "yr" | "yrs" => YEAR "month" | "months" | "mon" | "mons" => MONTH "week" | "weeks" | "w" => WEEK "day" | "days" | "d" => DAY "hour" | "hours" | "h" | "hr" | "hrs" => HOUR "minute" | "minutes" | "m" | "min" | "mins" => MINUTE "second" | "seconds" | "s" | "sec" | "secs" => SECOND "millisecond" | "milliseconds" | "ms" | "msec" | "msecs" | "mseconds" => MILLISECOND "microsecond" | "microseconds" | "us" | "usec" | "usecs" | "useconds" => MICROSECOND -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29712) fromDayTimeString() does not take into account the left bound
Maxim Gekk created SPARK-29712: -- Summary: fromDayTimeString() does not take into account the left bound Key: SPARK-29712 URL: https://issues.apache.org/jira/browse/SPARK-29712 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.4.0, 3.0.0 Reporter: Maxim Gekk Currently, fromDayTimeString() takes into account the right bound but not the left one. For example: {code} spark-sql> SELECT interval '1 2:03:04' hour to minute; interval 1 days 2 hours 3 minutes {code} The result should be *interval 2 hours 3 minutes* -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-29643) ALTER TABLE (DROP PARTITION) should look up catalog/table like v2 commands
[ https://issues.apache.org/jira/browse/SPARK-29643?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-29643: --- Assignee: Huaxin Gao > ALTER TABLE (DROP PARTITION) should look up catalog/table like v2 commands > -- > > Key: SPARK-29643 > URL: https://issues.apache.org/jira/browse/SPARK-29643 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Huaxin Gao >Assignee: Huaxin Gao >Priority: Major > > ALTER TABLE (DROP PARTITION) should look up catalog/table like v2 commands -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-29643) ALTER TABLE (DROP PARTITION) should look up catalog/table like v2 commands
[ https://issues.apache.org/jira/browse/SPARK-29643?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-29643. - Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 26303 [https://github.com/apache/spark/pull/26303] > ALTER TABLE (DROP PARTITION) should look up catalog/table like v2 commands > -- > > Key: SPARK-29643 > URL: https://issues.apache.org/jira/browse/SPARK-29643 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Huaxin Gao >Assignee: Huaxin Gao >Priority: Major > Fix For: 3.0.0 > > > ALTER TABLE (DROP PARTITION) should look up catalog/table like v2 commands -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-29486) CalendarInterval should have 3 fields: months, days and microseconds
[ https://issues.apache.org/jira/browse/SPARK-29486?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-29486. - Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 26134 [https://github.com/apache/spark/pull/26134] > CalendarInterval should have 3 fields: months, days and microseconds > > > Key: SPARK-29486 > URL: https://issues.apache.org/jira/browse/SPARK-29486 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.4 >Reporter: Liu, Linhong >Assignee: Liu, Linhong >Priority: Major > Fix For: 3.0.0 > > > Current CalendarInterval has 2 fields: months and microseconds. This PR try > to change it > to 3 fields: months, days and microseconds. This is because one logical day > interval may > have different number of microseconds (daylight saving). -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-29486) CalendarInterval should have 3 fields: months, days and microseconds
[ https://issues.apache.org/jira/browse/SPARK-29486?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-29486: --- Assignee: Liu, Linhong > CalendarInterval should have 3 fields: months, days and microseconds > > > Key: SPARK-29486 > URL: https://issues.apache.org/jira/browse/SPARK-29486 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.4 >Reporter: Liu, Linhong >Assignee: Liu, Linhong >Priority: Major > > Current CalendarInterval has 2 fields: months and microseconds. This PR try > to change it > to 3 fields: months, days and microseconds. This is because one logical day > interval may > have different number of microseconds (daylight saving). -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29711) Dynamic adjust spark sql class log level in beeline
deshanxiao created SPARK-29711: -- Summary: Dynamic adjust spark sql class log level in beeline Key: SPARK-29711 URL: https://issues.apache.org/jira/browse/SPARK-29711 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.0.0 Reporter: deshanxiao We can change the log level in beeline like: set spark.log.level=debug. It will not change a lot but useful -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29710) Seeing offsets not resetting even when reset policy is configured explicitly
[ https://issues.apache.org/jira/browse/SPARK-29710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16964696#comment-16964696 ] Shyam commented on SPARK-29710: --- @Gabor Somogyi Can you please help me , what is wrong here > Seeing offsets not resetting even when reset policy is configured explicitly > > > Key: SPARK-29710 > URL: https://issues.apache.org/jira/browse/SPARK-29710 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.4.1 > Environment: Window10 , eclipse neos >Reporter: Shyam >Priority: Major > > > even after setting *"auto.offset.reset" to "latest"* I am getting below > error > > org.apache.kafka.clients.consumer.OffsetOutOfRangeException: Offsets out of > range with no configured reset policy for partitions: > \{COMPANY_TRANSACTIONS_INBOUND-16=168}org.apache.kafka.clients.consumer.OffsetOutOfRangeException: > Offsets out of range with no configured reset policy for partitions: > \{COMPANY_TRANSACTIONS_INBOUND-16=168} at > org.apache.kafka.clients.consumer.internals.Fetcher.throwIfOffsetOutOfRange(Fetcher.java:348) > at > org.apache.kafka.clients.consumer.internals.Fetcher.fetchedRecords(Fetcher.java:396) > at > org.apache.kafka.clients.consumer.KafkaConsumer.pollOnce(KafkaConsumer.java:999) > at > org.apache.kafka.clients.consumer.KafkaConsumer.poll(KafkaConsumer.java:937) > at > org.apache.spark.sql.kafka010.InternalKafkaConsumer.fetchData(KafkaDataConsumer.scala:470) > at > org.apache.spark.sql.kafka010.InternalKafkaConsumer.org$apache$spark$sql$kafka010$InternalKafkaConsumer$$fetchRecord(KafkaDataConsumer.scala:361) > at > org.apache.spark.sql.kafka010.InternalKafkaConsumer$$anonfun$get$1.apply(KafkaDataConsumer.scala:251) > at > org.apache.spark.sql.kafka010.InternalKafkaConsumer$$anonfun$get$1.apply(KafkaDataConsumer.scala:234) > at > org.apache.spark.util.UninterruptibleThread.runUninterruptibly(UninterruptibleThread.scala:77) > at > org.apache.spark.sql.kafka010.InternalKafkaConsumer.runUninterruptiblyIfPossible(KafkaDataConsumer.scala:209) > at > org.apache.spark.sql.kafka010.InternalKafkaConsumer.get(KafkaDataConsumer.scala:234) > > [https://stackoverflow.com/questions/58653885/even-after-setting-auto-offset-reset-to-latest-getting-error-offsetoutofrang] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29707) Make PartitionFilters and PushedFilters abbreviate configurable in metadata
[ https://issues.apache.org/jira/browse/SPARK-29707?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16964655#comment-16964655 ] Hu Fuwang commented on SPARK-29707: --- I am working on this. > Make PartitionFilters and PushedFilters abbreviate configurable in metadata > --- > > Key: SPARK-29707 > URL: https://issues.apache.org/jira/browse/SPARK-29707 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Priority: Major > Attachments: screenshot-1.png > > > !screenshot-1.png! > It lost some key information. > Related code: > https://github.com/apache/spark/blob/ec5d698d99634e5bb8fc7b0fa1c270dd67c129c8/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala#L58-L66 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29710) Seeing offsets not resetting even when reset policy is configured explicitly
Shyam created SPARK-29710: - Summary: Seeing offsets not resetting even when reset policy is configured explicitly Key: SPARK-29710 URL: https://issues.apache.org/jira/browse/SPARK-29710 Project: Spark Issue Type: Bug Components: Structured Streaming Affects Versions: 2.4.1 Environment: Window10 , eclipse neos Reporter: Shyam even after setting *"auto.offset.reset" to "latest"* I am getting below error org.apache.kafka.clients.consumer.OffsetOutOfRangeException: Offsets out of range with no configured reset policy for partitions: \{COMPANY_TRANSACTIONS_INBOUND-16=168}org.apache.kafka.clients.consumer.OffsetOutOfRangeException: Offsets out of range with no configured reset policy for partitions: \{COMPANY_TRANSACTIONS_INBOUND-16=168} at org.apache.kafka.clients.consumer.internals.Fetcher.throwIfOffsetOutOfRange(Fetcher.java:348) at org.apache.kafka.clients.consumer.internals.Fetcher.fetchedRecords(Fetcher.java:396) at org.apache.kafka.clients.consumer.KafkaConsumer.pollOnce(KafkaConsumer.java:999) at org.apache.kafka.clients.consumer.KafkaConsumer.poll(KafkaConsumer.java:937) at org.apache.spark.sql.kafka010.InternalKafkaConsumer.fetchData(KafkaDataConsumer.scala:470) at org.apache.spark.sql.kafka010.InternalKafkaConsumer.org$apache$spark$sql$kafka010$InternalKafkaConsumer$$fetchRecord(KafkaDataConsumer.scala:361) at org.apache.spark.sql.kafka010.InternalKafkaConsumer$$anonfun$get$1.apply(KafkaDataConsumer.scala:251) at org.apache.spark.sql.kafka010.InternalKafkaConsumer$$anonfun$get$1.apply(KafkaDataConsumer.scala:234) at org.apache.spark.util.UninterruptibleThread.runUninterruptibly(UninterruptibleThread.scala:77) at org.apache.spark.sql.kafka010.InternalKafkaConsumer.runUninterruptiblyIfPossible(KafkaDataConsumer.scala:209) at org.apache.spark.sql.kafka010.InternalKafkaConsumer.get(KafkaDataConsumer.scala:234) [https://stackoverflow.com/questions/58653885/even-after-setting-auto-offset-reset-to-latest-getting-error-offsetoutofrang] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29709) structured streaming The offset in the checkpoint is suddenly reset to the earliest
test created SPARK-29709: Summary: structured streaming The offset in the checkpoint is suddenly reset to the earliest Key: SPARK-29709 URL: https://issues.apache.org/jira/browse/SPARK-29709 Project: Spark Issue Type: Bug Components: Structured Streaming Affects Versions: 2.4.0 Reporter: test structured streaming The offset in the checkpoint is suddenly reset to the earliest, -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org