[jira] [Updated] (SPARK-29611) Sort Kafka metadata by the number of messages

2019-11-01 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29611?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-29611:
--
Affects Version/s: 2.4.4

> Sort Kafka metadata by the number of messages
> -
>
> Key: SPARK-29611
> URL: https://issues.apache.org/jira/browse/SPARK-29611
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 2.3.0, 2.4.4, 3.0.0
>Reporter: dengziming
>Assignee: dengziming
>Priority: Minor
> Fix For: 3.0.0
>
> Attachments: kafka-skew.jpeg
>
>
> we are suffering from data skewness from Kafka and need to analyze it by UI.
> As the image shows, the UI of the streaming is confused, it's better to 
> improve it by adding a count metric of each Kafka partition and sorting by 
> count



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-29611) Optimize the display of Kafka metadata and sort by the number of messages

2019-11-01 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29611?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-29611.
---
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 26266
[https://github.com/apache/spark/pull/26266]

> Optimize the display of Kafka metadata and sort by the number of messages
> -
>
> Key: SPARK-29611
> URL: https://issues.apache.org/jira/browse/SPARK-29611
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 2.3.0, 3.0.0
>Reporter: dengziming
>Assignee: dengziming
>Priority: Minor
> Fix For: 3.0.0
>
> Attachments: kafka-skew.jpeg
>
>
> we are suffering from data skewness from Kafka and need to analyze it by UI.
> As the image shows, the UI of the streaming is confused, it's better to 
> improve it by adding a count metric of each Kafka partition and sorting by 
> count



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29611) Sort Kafka metadata by the number of messages

2019-11-01 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29611?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-29611:
--
Summary: Sort Kafka metadata by the number of messages  (was: Optimize the 
display of Kafka metadata and sort by the number of messages)

> Sort Kafka metadata by the number of messages
> -
>
> Key: SPARK-29611
> URL: https://issues.apache.org/jira/browse/SPARK-29611
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 2.3.0, 3.0.0
>Reporter: dengziming
>Assignee: dengziming
>Priority: Minor
> Fix For: 3.0.0
>
> Attachments: kafka-skew.jpeg
>
>
> we are suffering from data skewness from Kafka and need to analyze it by UI.
> As the image shows, the UI of the streaming is confused, it's better to 
> improve it by adding a count metric of each Kafka partition and sorting by 
> count



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-29611) Optimize the display of Kafka metadata and sort by the number of messages

2019-11-01 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29611?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-29611:
-

Assignee: dengziming

> Optimize the display of Kafka metadata and sort by the number of messages
> -
>
> Key: SPARK-29611
> URL: https://issues.apache.org/jira/browse/SPARK-29611
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 2.3.0, 3.0.0
>Reporter: dengziming
>Assignee: dengziming
>Priority: Minor
> Attachments: kafka-skew.jpeg
>
>
> we are suffering from data skewness from Kafka and need to analyze it by UI.
> As the image shows, the UI of the streaming is confused, it's better to 
> improve it by adding a count metric of each Kafka partition and sorting by 
> count



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29721) Spark SQL reads unnecessary nested fields from Parquet after using explode

2019-11-01 Thread Kai Kang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kai Kang updated SPARK-29721:
-
Description: 
This is a follow up for SPARK-4502. SPARK-4502 correctly addressed column 
pruning for nested structures. However, when explode() is called on a nested 
field, all columns for that nested structure is still fetched from data source.

We are working on a project to create a parquet store for a big pre-joined 
table between two tables that has one-to-many relationship, and this is a 
blocking issue for us.

 

The following code illustrates the issue. 

Part 1: loading some nested data
{noformat}
val jsonStr = """{
 "items": [
   {"itemId": 1, "itemData": "a"},
   {"itemId": 2, "itemData": "b"}
 ]
}"""
val df = spark.read.json(Seq(jsonStr).toDS)
df.write.format("parquet").mode("overwrite").saveAsTable("persisted")
{noformat}
 
Part 2: reading it back and explaining the queries
{noformat}
val read = spark.table("persisted")
spark.conf.set("spark.sql.optimizer.nestedSchemaPruning.enabled", true)

// pruned, only loading itemId
// ReadSchema: struct>>
read.select($"items.itemId").explain(true) 

// not pruned, loading both itemId 
// ReadSchema: struct>>
read.select(explode($"items.itemId")).explain(true) and itemData
{noformat}
 

  was:
This is a follow up for SPARK-4502. SPARK-4502 correctly addressed column 
pruning for nested structures. However, when explode() is called on a nested 
field, all columns for that nested structure is still fetched from data source.

We are working on a project to create a parquet store for a big pre-joined 
table between two tables that has one-to-many relationship, and this is a 
blocking issue for us.

 

The following code illustrates the issue. 

Part 1: loading some nested data
{noformat}
val jsonStr = """{
 "items": [
   {"itemId": 1, "itemData": "a"},
   {"itemId": 2, "itemData": "b"}
 ]
}"""
val df = spark.read.json(Seq(jsonStr).toDS)
df.write.format("parquet").mode("overwrite").saveAsTable("persisted")
{noformat}
 
Part 2: reading it back and explaining the queries
{noformat}
val read = spark.table("persisted")
spark.conf.set("spark.sql.optimizer.nestedSchemaPruning.enabled", true)
read.select($"items.itemId").explain(true) // pruned, only loading itemId
read.select(explode($"items.itemId")).explain(true) // not pruned, loading both 
itemId and itemData
{noformat}
 


> Spark SQL reads unnecessary nested fields from Parquet after using explode
> --
>
> Key: SPARK-29721
> URL: https://issues.apache.org/jira/browse/SPARK-29721
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.4
>Reporter: Kai Kang
>Priority: Critical
>
> This is a follow up for SPARK-4502. SPARK-4502 correctly addressed column 
> pruning for nested structures. However, when explode() is called on a nested 
> field, all columns for that nested structure is still fetched from data 
> source.
> We are working on a project to create a parquet store for a big pre-joined 
> table between two tables that has one-to-many relationship, and this is a 
> blocking issue for us.
>  
> The following code illustrates the issue. 
> Part 1: loading some nested data
> {noformat}
> val jsonStr = """{
>  "items": [
>{"itemId": 1, "itemData": "a"},
>{"itemId": 2, "itemData": "b"}
>  ]
> }"""
> val df = spark.read.json(Seq(jsonStr).toDS)
> df.write.format("parquet").mode("overwrite").saveAsTable("persisted")
> {noformat}
>  
> Part 2: reading it back and explaining the queries
> {noformat}
> val read = spark.table("persisted")
> spark.conf.set("spark.sql.optimizer.nestedSchemaPruning.enabled", true)
> // pruned, only loading itemId
> // ReadSchema: struct>>
> read.select($"items.itemId").explain(true) 
> // not pruned, loading both itemId 
> // ReadSchema: struct>>
> read.select(explode($"items.itemId")).explain(true) and itemData
> {noformat}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29721) Spark SQL reads unnecessary nested fields from Parquet after using explode

2019-11-01 Thread Kai Kang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kai Kang updated SPARK-29721:
-
Description: 
This is a follow up for SPARK-4502. SPARK-4502 correctly addressed column 
pruning for nested structures. However, when explode() is called on a nested 
field, all columns for that nested structure is still fetched from data source.

We are working on a project to create a parquet store for a big pre-joined 
table between two tables that has one-to-many relationship, and this is a 
blocking issue for us.

 

The following code illustrates the issue. 

Part 1: loading some nested data
{noformat}
val jsonStr = """{
 "items": [
   {"itemId": 1, "itemData": "a"},
   {"itemId": 2, "itemData": "b"}
 ]
}"""
val df = spark.read.json(Seq(jsonStr).toDS)
df.write.format("parquet").mode("overwrite").saveAsTable("persisted")
{noformat}
 
Part 2: reading it back and explaining the queries
val read = spark.table("persisted")
spark.conf.set("spark.sql.optimizer.nestedSchemaPruning.enabled", true)
read.select($"items.itemId").explain(true) // pruned, only loading itemId
read.select(explode($"items.itemId")).explain(true) // not pruned, loading both 
itemId and itemData
 

  was:
This is a follow up for SPARK-4502. SPARK-4502 correctly addressed column 
pruning for nested structures. However, when explode() is called on a nested 
field, all columns for that nested structure is still fetched from data source.

We are working on a project to create a parquet store for a big pre-joined 
table between two tables that has one-to-many relationship, and this is a 
blocking issue for us.

 

The following code illustrates the issue. 

Part 1: loading some nested data
{quote}
val jsonStr = """{
 "items": [
   {"itemId": 1, "itemData": "a"},
   {"itemId": 2, "itemData": "b"}
 ]
}"""
val df = spark.read.json(Seq(jsonStr).toDS)
df.write.format("parquet").mode("overwrite").saveAsTable("persisted")
{quote}
 
Part 2: reading it back and explaining the queries
val read = spark.table("persisted")
spark.conf.set("spark.sql.optimizer.nestedSchemaPruning.enabled", true)
read.select($"items.itemId").explain(true) // pruned, only loading itemId
read.select(explode($"items.itemId")).explain(true) // not pruned, loading both 
itemId and itemData
 


> Spark SQL reads unnecessary nested fields from Parquet after using explode
> --
>
> Key: SPARK-29721
> URL: https://issues.apache.org/jira/browse/SPARK-29721
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.4
>Reporter: Kai Kang
>Priority: Critical
>
> This is a follow up for SPARK-4502. SPARK-4502 correctly addressed column 
> pruning for nested structures. However, when explode() is called on a nested 
> field, all columns for that nested structure is still fetched from data 
> source.
> We are working on a project to create a parquet store for a big pre-joined 
> table between two tables that has one-to-many relationship, and this is a 
> blocking issue for us.
>  
> The following code illustrates the issue. 
> Part 1: loading some nested data
> {noformat}
> val jsonStr = """{
>  "items": [
>{"itemId": 1, "itemData": "a"},
>{"itemId": 2, "itemData": "b"}
>  ]
> }"""
> val df = spark.read.json(Seq(jsonStr).toDS)
> df.write.format("parquet").mode("overwrite").saveAsTable("persisted")
> {noformat}
>  
> Part 2: reading it back and explaining the queries
> val read = spark.table("persisted")
> spark.conf.set("spark.sql.optimizer.nestedSchemaPruning.enabled", true)
> read.select($"items.itemId").explain(true) // pruned, only loading itemId
> read.select(explode($"items.itemId")).explain(true) // not pruned, loading 
> both itemId and itemData
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29721) Spark SQL reads unnecessary nested fields from Parquet after using explode

2019-11-01 Thread Kai Kang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kai Kang updated SPARK-29721:
-
Description: 
This is a follow up for SPARK-4502. SPARK-4502 correctly addressed column 
pruning for nested structures. However, when explode() is called on a nested 
field, all columns for that nested structure is still fetched from data source.

We are working on a project to create a parquet store for a big pre-joined 
table between two tables that has one-to-many relationship, and this is a 
blocking issue for us.

 

The following code illustrates the issue. 

Part 1: loading some nested data
{noformat}
val jsonStr = """{
 "items": [
   {"itemId": 1, "itemData": "a"},
   {"itemId": 2, "itemData": "b"}
 ]
}"""
val df = spark.read.json(Seq(jsonStr).toDS)
df.write.format("parquet").mode("overwrite").saveAsTable("persisted")
{noformat}
 
Part 2: reading it back and explaining the queries
{noformat}
val read = spark.table("persisted")
spark.conf.set("spark.sql.optimizer.nestedSchemaPruning.enabled", true)
read.select($"items.itemId").explain(true) // pruned, only loading itemId
read.select(explode($"items.itemId")).explain(true) // not pruned, loading both 
itemId and itemData
{noformat}
 

  was:
This is a follow up for SPARK-4502. SPARK-4502 correctly addressed column 
pruning for nested structures. However, when explode() is called on a nested 
field, all columns for that nested structure is still fetched from data source.

We are working on a project to create a parquet store for a big pre-joined 
table between two tables that has one-to-many relationship, and this is a 
blocking issue for us.

 

The following code illustrates the issue. 

Part 1: loading some nested data
{noformat}
val jsonStr = """{
 "items": [
   {"itemId": 1, "itemData": "a"},
   {"itemId": 2, "itemData": "b"}
 ]
}"""
val df = spark.read.json(Seq(jsonStr).toDS)
df.write.format("parquet").mode("overwrite").saveAsTable("persisted")
{noformat}
 
Part 2: reading it back and explaining the queries
val read = spark.table("persisted")
spark.conf.set("spark.sql.optimizer.nestedSchemaPruning.enabled", true)
read.select($"items.itemId").explain(true) // pruned, only loading itemId
read.select(explode($"items.itemId")).explain(true) // not pruned, loading both 
itemId and itemData
 


> Spark SQL reads unnecessary nested fields from Parquet after using explode
> --
>
> Key: SPARK-29721
> URL: https://issues.apache.org/jira/browse/SPARK-29721
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.4
>Reporter: Kai Kang
>Priority: Critical
>
> This is a follow up for SPARK-4502. SPARK-4502 correctly addressed column 
> pruning for nested structures. However, when explode() is called on a nested 
> field, all columns for that nested structure is still fetched from data 
> source.
> We are working on a project to create a parquet store for a big pre-joined 
> table between two tables that has one-to-many relationship, and this is a 
> blocking issue for us.
>  
> The following code illustrates the issue. 
> Part 1: loading some nested data
> {noformat}
> val jsonStr = """{
>  "items": [
>{"itemId": 1, "itemData": "a"},
>{"itemId": 2, "itemData": "b"}
>  ]
> }"""
> val df = spark.read.json(Seq(jsonStr).toDS)
> df.write.format("parquet").mode("overwrite").saveAsTable("persisted")
> {noformat}
>  
> Part 2: reading it back and explaining the queries
> {noformat}
> val read = spark.table("persisted")
> spark.conf.set("spark.sql.optimizer.nestedSchemaPruning.enabled", true)
> read.select($"items.itemId").explain(true) // pruned, only loading itemId
> read.select(explode($"items.itemId")).explain(true) // not pruned, loading 
> both itemId and itemData
> {noformat}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29721) Spark SQL reads unnecessary nested fields from Parquet after using explode

2019-11-01 Thread Kai Kang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kai Kang updated SPARK-29721:
-
Description: 
This is a follow up for SPARK-4502. SPARK-4502 correctly addressed column 
pruning for nested structures. However, when explode() is called on a nested 
field, all columns for that nested structure is still fetched from data source.

We are working on a project to create a parquet store for a big pre-joined 
table between two tables that has one-to-many relationship, and this is a 
blocking issue for us.

 

The following code illustrates the issue. 

Part 1: loading some nested data
{quote}
val jsonStr = """{
 "items": [
   {"itemId": 1, "itemData": "a"},
   {"itemId": 2, "itemData": "b"}
 ]
}"""
val df = spark.read.json(Seq(jsonStr).toDS)
df.write.format("parquet").mode("overwrite").saveAsTable("persisted")
{quote}
 
Part 2: reading it back and explaining the queries
val read = spark.table("persisted")
spark.conf.set("spark.sql.optimizer.nestedSchemaPruning.enabled", true)
read.select($"items.itemId").explain(true) // pruned, only loading itemId
read.select(explode($"items.itemId")).explain(true) // not pruned, loading both 
itemId and itemData
 

  was:
This is a follow up for SPARK-4502. SPARK-4502 correctly addressed column 
pruning for nested structures. However, when explode() is called on a nested 
field, all columns for that nested structure is still fetched from data source.

We are working on a project to create a parquet store for a big pre-joined 
table between two tables that has one-to-many relationship, and this is a 
blocking issue for us.

 

The following code illustrates the issue. 

Part 1: loading some nested data

bq. val jsonStr = """{
 "items": [
   {"itemId": 1, "itemData": "a"},
   {"itemId": 2, "itemData": "b"}
 ]
}"""
val df = spark.read.json(Seq(jsonStr).toDS)
df.write.format("parquet").mode("overwrite").saveAsTable("persisted")
 
Part 2: reading it back and explaining the queries
val read = spark.table("persisted")
spark.conf.set("spark.sql.optimizer.nestedSchemaPruning.enabled", true)
read.select($"items.itemId").explain(true) // pruned, only loading itemId
read.select(explode($"items.itemId")).explain(true) // not pruned, loading both 
itemId and itemData
 


> Spark SQL reads unnecessary nested fields from Parquet after using explode
> --
>
> Key: SPARK-29721
> URL: https://issues.apache.org/jira/browse/SPARK-29721
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.4
>Reporter: Kai Kang
>Priority: Critical
>
> This is a follow up for SPARK-4502. SPARK-4502 correctly addressed column 
> pruning for nested structures. However, when explode() is called on a nested 
> field, all columns for that nested structure is still fetched from data 
> source.
> We are working on a project to create a parquet store for a big pre-joined 
> table between two tables that has one-to-many relationship, and this is a 
> blocking issue for us.
>  
> The following code illustrates the issue. 
> Part 1: loading some nested data
> {quote}
> val jsonStr = """{
>  "items": [
>{"itemId": 1, "itemData": "a"},
>{"itemId": 2, "itemData": "b"}
>  ]
> }"""
> val df = spark.read.json(Seq(jsonStr).toDS)
> df.write.format("parquet").mode("overwrite").saveAsTable("persisted")
> {quote}
>  
> Part 2: reading it back and explaining the queries
> val read = spark.table("persisted")
> spark.conf.set("spark.sql.optimizer.nestedSchemaPruning.enabled", true)
> read.select($"items.itemId").explain(true) // pruned, only loading itemId
> read.select(explode($"items.itemId")).explain(true) // not pruned, loading 
> both itemId and itemData
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29721) Spark SQL reads unnecessary nested fields from Parquet after using explode

2019-11-01 Thread Kai Kang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kai Kang updated SPARK-29721:
-
Description: 
This is a follow up for SPARK-4502. SPARK-4502 correctly addressed column 
pruning for nested structures. However, when explode() is called on a nested 
field, all columns for that nested structure is still fetched from data source.

We are working on a project to create a parquet store for a big pre-joined 
table between two tables that has one-to-many relationship, and this is a 
blocking issue for us.

 

The following code illustrates the issue. 

Part 1: loading some nested data

bq. val jsonStr = """{
 "items": [
   {"itemId": 1, "itemData": "a"},
   {"itemId": 2, "itemData": "b"}
 ]
}"""
val df = spark.read.json(Seq(jsonStr).toDS)
df.write.format("parquet").mode("overwrite").saveAsTable("persisted")
 
Part 2: reading it back and explaining the queries
val read = spark.table("persisted")
spark.conf.set("spark.sql.optimizer.nestedSchemaPruning.enabled", true)
read.select($"items.itemId").explain(true) // pruned, only loading itemId
read.select(explode($"items.itemId")).explain(true) // not pruned, loading both 
itemId and itemData
 

  was:
This is a follow up for SPARK-4502. SPARK-4502 correctly addressed column 
pruning for nested structures. However, when explode() is called on a nested 
field, all columns for that nested structure is still fetched from data source.

We are working on a project to create a parquet store for a big pre-joined 
table between two tables that has one-to-many relationship, and this is a 
blocking issue for us.

 

The following code illustrates the issue. 

Part 1: loading some nested data

{quote}val jsonStr = """{
 "items": [
   {"itemId": 1, "itemData": "a"},
   {"itemId": 2, "itemData": "b"}
 ]
}"""
val df = spark.read.json(Seq(jsonStr).toDS)
df.write.format("parquet").mode("overwrite").saveAsTable("persisted") {quote}
 
Part 2: reading it back and explaining the queries
val read = spark.table("persisted")
spark.conf.set("spark.sql.optimizer.nestedSchemaPruning.enabled", true)
read.select($"items.itemId").explain(true) // pruned, only loading itemId
read.select(explode($"items.itemId")).explain(true) // not pruned, loading both 
itemId and itemData
 


> Spark SQL reads unnecessary nested fields from Parquet after using explode
> --
>
> Key: SPARK-29721
> URL: https://issues.apache.org/jira/browse/SPARK-29721
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.4
>Reporter: Kai Kang
>Priority: Critical
>
> This is a follow up for SPARK-4502. SPARK-4502 correctly addressed column 
> pruning for nested structures. However, when explode() is called on a nested 
> field, all columns for that nested structure is still fetched from data 
> source.
> We are working on a project to create a parquet store for a big pre-joined 
> table between two tables that has one-to-many relationship, and this is a 
> blocking issue for us.
>  
> The following code illustrates the issue. 
> Part 1: loading some nested data
> bq. val jsonStr = """{
>  "items": [
>{"itemId": 1, "itemData": "a"},
>{"itemId": 2, "itemData": "b"}
>  ]
> }"""
> val df = spark.read.json(Seq(jsonStr).toDS)
> df.write.format("parquet").mode("overwrite").saveAsTable("persisted")
>  
> Part 2: reading it back and explaining the queries
> val read = spark.table("persisted")
> spark.conf.set("spark.sql.optimizer.nestedSchemaPruning.enabled", true)
> read.select($"items.itemId").explain(true) // pruned, only loading itemId
> read.select(explode($"items.itemId")).explain(true) // not pruned, loading 
> both itemId and itemData
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29721) Spark SQL reads unnecessary nested fields from Parquet after using explode

2019-11-01 Thread Kai Kang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kai Kang updated SPARK-29721:
-
Description: 
This is a follow up for SPARK-4502. SPARK-4502 correctly addressed column 
pruning for nested structures. However, when explode() is called on a nested 
field, all columns for that nested structure is still fetched from data source.

We are working on a project to create a parquet store for a big pre-joined 
table between two tables that has one-to-many relationship, and this is a 
blocking issue for us.

 

The following code illustrates the issue. 

Part 1: loading some nested data

{quote}val jsonStr = """{
 "items": [
   {"itemId": 1, "itemData": "a"},
   {"itemId": 2, "itemData": "b"}
 ]
}"""
val df = spark.read.json(Seq(jsonStr).toDS)
df.write.format("parquet").mode("overwrite").saveAsTable("persisted") {quote}
 
Part 2: reading it back and explaining the queries
val read = spark.table("persisted")
spark.conf.set("spark.sql.optimizer.nestedSchemaPruning.enabled", true)
read.select($"items.itemId").explain(true) // pruned, only loading itemId
read.select(explode($"items.itemId")).explain(true) // not pruned, loading both 
itemId and itemData
 

  was:
This is a follow up for SPARK-4502. SPARK-4502 correctly addressed column 
pruning for nested structures. However, when explode() is called on a nested 
field, all columns for that nested structure is still fetched from data source.

We are working on a project to create a parquet store for a big pre-joined 
table between two tables that has one-to-many relationship, and this is a 
blocking issue for us.

 

The following code illustrates the issue. 

Part 1: loading some nested data

val jsonStr = """{
 "items": [
   {"itemId": 1, "itemData": "a"},
   {"itemId": 2, "itemData": "b"}
 ]
}"""
val df = spark.read.json(Seq(jsonStr).toDS)
df.write.format("parquet").mode("overwrite").saveAsTable("persisted") 
 
Part 2: reading it back and explaining the queries
val read = spark.table("persisted")
spark.conf.set("spark.sql.optimizer.nestedSchemaPruning.enabled", true)
read.select($"items.itemId").explain(true) // pruned, only loading itemId
read.select(explode($"items.itemId")).explain(true) // not pruned, loading both 
itemId and itemData
 


> Spark SQL reads unnecessary nested fields from Parquet after using explode
> --
>
> Key: SPARK-29721
> URL: https://issues.apache.org/jira/browse/SPARK-29721
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.4
>Reporter: Kai Kang
>Priority: Critical
>
> This is a follow up for SPARK-4502. SPARK-4502 correctly addressed column 
> pruning for nested structures. However, when explode() is called on a nested 
> field, all columns for that nested structure is still fetched from data 
> source.
> We are working on a project to create a parquet store for a big pre-joined 
> table between two tables that has one-to-many relationship, and this is a 
> blocking issue for us.
>  
> The following code illustrates the issue. 
> Part 1: loading some nested data
> {quote}val jsonStr = """{
>  "items": [
>{"itemId": 1, "itemData": "a"},
>{"itemId": 2, "itemData": "b"}
>  ]
> }"""
> val df = spark.read.json(Seq(jsonStr).toDS)
> df.write.format("parquet").mode("overwrite").saveAsTable("persisted") {quote}
>  
> Part 2: reading it back and explaining the queries
> val read = spark.table("persisted")
> spark.conf.set("spark.sql.optimizer.nestedSchemaPruning.enabled", true)
> read.select($"items.itemId").explain(true) // pruned, only loading itemId
> read.select(explode($"items.itemId")).explain(true) // not pruned, loading 
> both itemId and itemData
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29721) Spark SQL reads unnecessary nested fields from Parquet after using explode

2019-11-01 Thread Kai Kang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kai Kang updated SPARK-29721:
-
Description: 
This is a follow up for SPARK-4502. SPARK-4502 correctly addressed column 
pruning for nested structures. However, when explode() is called on a nested 
field, all columns for that nested structure is still fetched from data source.

We are working on a project to create a parquet store for a big pre-joined 
table between two tables that has one-to-many relationship, and this is a 
blocking issue for us.

 

The following code illustrates the issue. 

Part 1: loading some nested data

val jsonStr = """{
 "items": [
   {"itemId": 1, "itemData": "a"},
   {"itemId": 2, "itemData": "b"}
 ]
}"""
val df = spark.read.json(Seq(jsonStr).toDS)
df.write.format("parquet").mode("overwrite").saveAsTable("persisted") 
 
Part 2: reading it back and explaining the queries
val read = spark.table("persisted")
spark.conf.set("spark.sql.optimizer.nestedSchemaPruning.enabled", true)
read.select($"items.itemId").explain(true) // pruned, only loading itemId
read.select(explode($"items.itemId")).explain(true) // not pruned, loading both 
itemId and itemData
 

  was:
This is a follow up for SPARK-4502. SPARK-4502 correctly addressed column 
pruning for nested structures. However, when explode() is called on a nested 
field, all columns for that nested structure is still fetched from data source.

We are working on a project to create a parquet store for a big pre-joined 
table between two tables that has one-to-many relationship, and this is a 
blocking issue for us.

 

The following code illustrates the issue. 

Part 1: loading some nested data

val jsonStr = """{
 "items": [
 {
 "itemId": 1,
 "itemData": "a"
 },
 {
 "itemId": 1,
 "itemData": "b"
 }
 ]
}"""
val df = spark.read.json(Seq(jsonStr).toDS)
df.write.format("parquet").mode("overwrite").saveAsTable("persisted") 
 
Part 2: reading it back and explaining the queries
val read = spark.table("persisted")
spark.conf.set("spark.sql.optimizer.nestedSchemaPruning.enabled", true)
read.select($"items.itemId").explain(true) // pruned, only loading itemId
read.select(explode($"items.itemId")).explain(true) // not pruned, loading both 
itemId and itemData
 


> Spark SQL reads unnecessary nested fields from Parquet after using explode
> --
>
> Key: SPARK-29721
> URL: https://issues.apache.org/jira/browse/SPARK-29721
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.4
>Reporter: Kai Kang
>Priority: Critical
>
> This is a follow up for SPARK-4502. SPARK-4502 correctly addressed column 
> pruning for nested structures. However, when explode() is called on a nested 
> field, all columns for that nested structure is still fetched from data 
> source.
> We are working on a project to create a parquet store for a big pre-joined 
> table between two tables that has one-to-many relationship, and this is a 
> blocking issue for us.
>  
> The following code illustrates the issue. 
> Part 1: loading some nested data
> val jsonStr = """{
>  "items": [
>{"itemId": 1, "itemData": "a"},
>{"itemId": 2, "itemData": "b"}
>  ]
> }"""
> val df = spark.read.json(Seq(jsonStr).toDS)
> df.write.format("parquet").mode("overwrite").saveAsTable("persisted") 
>  
> Part 2: reading it back and explaining the queries
> val read = spark.table("persisted")
> spark.conf.set("spark.sql.optimizer.nestedSchemaPruning.enabled", true)
> read.select($"items.itemId").explain(true) // pruned, only loading itemId
> read.select(explode($"items.itemId")).explain(true) // not pruned, loading 
> both itemId and itemData
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29721) Spark SQL reads unnecessary nested fields from Parquet after using explode

2019-11-01 Thread Kai Kang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kai Kang updated SPARK-29721:
-
Description: 
This is a follow up for SPARK-4502. SPARK-4502 correctly addressed column 
pruning for nested structures. However, when explode() is called on a nested 
field, all columns for that nested structure is still fetched from data source.

We are working on a project to create a parquet store for a big pre-joined 
table between two tables that has one-to-many relationship, and this is a 
blocking issue for us.

 

The following code illustrates the issue. 

Part 1: loading some nested data

val jsonStr = """{
 "items": [
 {
 "itemId": 1,
 "itemData": "a"
 },
 {
 "itemId": 1,
 "itemData": "b"
 }
 ]
}"""
val df = spark.read.json(Seq(jsonStr).toDS)
df.write.format("parquet").mode("overwrite").saveAsTable("persisted") 
 
Part 2: reading it back and explaining the queries
val read = spark.table("persisted")
spark.conf.set("spark.sql.optimizer.nestedSchemaPruning.enabled", true)
read.select($"items.itemId").explain(true) // pruned, only loading itemId
read.select(explode($"items.itemId")).explain(true) // not pruned, loading both 
itemId and itemData
 

  was:
This is a follow up for SPARK-4502. SPARK-4502 correctly addressed column 
pruning for nested structures. However, when explode() is called on a nested 
field, all columns for that nested structure is still fetched from data source.

We are working on a project to create a parquet store for a big pre-joined 
table between two tables that has one-to-many relationship, and this is a 
blocking issue for us.

 

The following code illustrates the issue. 

Part 1: loading some nested data

val jsonStr = """{
 "items": [
 {
 "itemId": 1,
 "itemData": "a"
 },
 {
 "itemId": 1,
 "itemData": "b"
 }
 ]
}"""
val df = spark.read.json(Seq(jsonStr).toDS)
df.write.format("parquet").mode("overwrite").saveAsTable("persisted") 

 

Part 2: reading it back and explaining the queries

{{val read = spark.table("persisted")}}
 \{{ spark.conf.set("spark.sql.optimizer.nestedSchemaPruning.enabled", true)}}
 \{{ read.select($"items.itemId").explain(true) // pruned, only loading 
itemIdread.select(explode($"items.itemId")).explain(true) // not pruned, 
loading both itemId and itemData}}{{ }}

 


> Spark SQL reads unnecessary nested fields from Parquet after using explode
> --
>
> Key: SPARK-29721
> URL: https://issues.apache.org/jira/browse/SPARK-29721
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.4
>Reporter: Kai Kang
>Priority: Critical
>
> This is a follow up for SPARK-4502. SPARK-4502 correctly addressed column 
> pruning for nested structures. However, when explode() is called on a nested 
> field, all columns for that nested structure is still fetched from data 
> source.
> We are working on a project to create a parquet store for a big pre-joined 
> table between two tables that has one-to-many relationship, and this is a 
> blocking issue for us.
>  
> The following code illustrates the issue. 
> Part 1: loading some nested data
> val jsonStr = """{
>  "items": [
>  {
>  "itemId": 1,
>  "itemData": "a"
>  },
>  {
>  "itemId": 1,
>  "itemData": "b"
>  }
>  ]
> }"""
> val df = spark.read.json(Seq(jsonStr).toDS)
> df.write.format("parquet").mode("overwrite").saveAsTable("persisted") 
>  
> Part 2: reading it back and explaining the queries
> val read = spark.table("persisted")
> spark.conf.set("spark.sql.optimizer.nestedSchemaPruning.enabled", true)
> read.select($"items.itemId").explain(true) // pruned, only loading itemId
> read.select(explode($"items.itemId")).explain(true) // not pruned, loading 
> both itemId and itemData
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29721) Spark SQL reads unnecessary nested fields from Parquet after using explode

2019-11-01 Thread Kai Kang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kai Kang updated SPARK-29721:
-
Description: 
This is a follow up for SPARK-4502. SPARK-4502 correctly addressed column 
pruning for nested structures. However, when explode() is called on a nested 
field, all columns for that nested structure is still fetched from data source.

We are working on a project to create a parquet store for a big pre-joined 
table between two tables that has one-to-many relationship, and this is a 
blocking issue for us.

 

The following code illustrates the issue. 

Part 1: loading some nested data

val jsonStr = """{
 "items": [
 {
 "itemId": 1,
 "itemData": "a"
 },
 {
 "itemId": 1,
 "itemData": "b"
 }
 ]
}"""
val df = spark.read.json(Seq(jsonStr).toDS)
df.write.format("parquet").mode("overwrite").saveAsTable("persisted") 

 

Part 2: reading it back and explaining the queries

{{val read = spark.table("persisted")}}
 \{{ spark.conf.set("spark.sql.optimizer.nestedSchemaPruning.enabled", true)}}
 \{{ read.select($"items.itemId").explain(true) // pruned, only loading 
itemIdread.select(explode($"items.itemId")).explain(true) // not pruned, 
loading both itemId and itemData}}{{ }}

 

  was:
This is a follow up for SPARK-4502. SPARK-4502 correctly addressed column 
pruning for nested structures. However, when explode() is called on a nested 
field, all columns for that nested structure is still fetched from data source.

We are working on a project to create a parquet store for a big pre-joined 
table between two tables that has one-to-many relationship, and this is a 
blocking issue for us.

 

The following code illustrates the issue. 

Part 1: loading some nested data

val jsonStr = """{
 "items": [
 {
 "itemId": 1,
 "itemData": "a"
 },
 {
 "itemId": 1,
 "itemData": "b"
 }
 ]
}"""
val df = spark.read.json(Seq(jsonStr).toDS)
df.write.format("parquet").mode("overwrite").saveAsTable("persisted")

 

 

Part 2: reading it back and explaining the queries

{{val read = spark.table("persisted")}}
 \{{ spark.conf.set("spark.sql.optimizer.nestedSchemaPruning.enabled", true)}}
 \{{ read.select($"items.itemId").explain(true) // pruned, only loading 
itemIdread.select(explode($"items.itemId")).explain(true) // not pruned, 
loading both itemId and itemData}}{{ }}

 


> Spark SQL reads unnecessary nested fields from Parquet after using explode
> --
>
> Key: SPARK-29721
> URL: https://issues.apache.org/jira/browse/SPARK-29721
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.4
>Reporter: Kai Kang
>Priority: Critical
>
> This is a follow up for SPARK-4502. SPARK-4502 correctly addressed column 
> pruning for nested structures. However, when explode() is called on a nested 
> field, all columns for that nested structure is still fetched from data 
> source.
> We are working on a project to create a parquet store for a big pre-joined 
> table between two tables that has one-to-many relationship, and this is a 
> blocking issue for us.
>  
> The following code illustrates the issue. 
> Part 1: loading some nested data
> val jsonStr = """{
>  "items": [
>  {
>  "itemId": 1,
>  "itemData": "a"
>  },
>  {
>  "itemId": 1,
>  "itemData": "b"
>  }
>  ]
> }"""
> val df = spark.read.json(Seq(jsonStr).toDS)
> df.write.format("parquet").mode("overwrite").saveAsTable("persisted") 
>  
> Part 2: reading it back and explaining the queries
> {{val read = spark.table("persisted")}}
>  \{{ spark.conf.set("spark.sql.optimizer.nestedSchemaPruning.enabled", true)}}
>  \{{ read.select($"items.itemId").explain(true) // pruned, only loading 
> itemIdread.select(explode($"items.itemId")).explain(true) // not pruned, 
> loading both itemId and itemData}}{{ }}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29721) Spark SQL reads unnecessary nested fields from Parquet after using explode

2019-11-01 Thread Kai Kang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kai Kang updated SPARK-29721:
-
Description: 
This is a follow up for SPARK-4502. SPARK-4502 correctly addressed column 
pruning for nested structures. However, when explode() is called on a nested 
field, all columns for that nested structure is still fetched from data source.

We are working on a project to create a parquet store for a big pre-joined 
table between two tables that has one-to-many relationship, and this is a 
blocking issue for us.

 

The following code illustrates the issue. 

Part 1: loading some nested data
{quote}{{import spark.implicits._}}
val jsonStr = """{
 "items": [
 {
 "itemId": 1,
 "itemData": "a"
 },
 {
 "itemId": 1,
 "itemData": "b"
 }
 ]
}"""
 {{val df = spark.read.json(Seq(jsonStr).toDS)}}
 {{df.write.format("parquet").mode("overwrite").saveAsTable("persisted")}}
{quote}
Part 2: reading it back and explaining the queries
{quote}val read = spark.table("persisted")
 spark.conf.set("spark.sql.optimizer.nestedSchemaPruning.enabled", true)
 read.select($"items.itemId").explain(true) // pruned, only loading itemId

read.select(explode($"items.itemId")).explain(true) // not pruned, loading both 
itemId and itemData
{quote}
 

 

  was:
This is a follow up for SPARK-4502. SPARK-4502 correctly addressed column 
pruning for nested structures. However, when explode() is called on a nested 
field, all columns for that nested structure is still fetched from data source.

We are working on a project to create a parquet store for a big pre-joined 
table between two tables that has one-to-many relationship, and this is a 
blocking issue for us.

 

The following code illustrates the issue. 

Part 1: loading some nested data
{quote}{{import spark.implicits._}}
{{val jsonStr = """{}}
{{ "items": [}}
{{  {}}
{{    "itemId": 1,}}
{{    "itemData": "a"}}
{{  },}}
{{  {}}
{{    "itemId": 1,}}
{{    "itemData": "b"}}
{{  }}}
{{ ]}"""}}
{{val df = spark.read.json(Seq(jsonStr).toDS)}}
{{df.write.format("parquet").mode("overwrite").saveAsTable("persisted")}}
{quote}
Part 2: reading it back and explaining the queries
{quote}val read = spark.table("persisted")
spark.conf.set("spark.sql.optimizer.nestedSchemaPruning.enabled", true)
read.select($"items.itemId").explain(true) // pruned, only loading itemId

read.select(explode($"items.itemId")).explain(true) // not pruned, loading both 
itemId and itemData
{quote}
 

 


> Spark SQL reads unnecessary nested fields from Parquet after using explode
> --
>
> Key: SPARK-29721
> URL: https://issues.apache.org/jira/browse/SPARK-29721
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.4
>Reporter: Kai Kang
>Priority: Critical
>
> This is a follow up for SPARK-4502. SPARK-4502 correctly addressed column 
> pruning for nested structures. However, when explode() is called on a nested 
> field, all columns for that nested structure is still fetched from data 
> source.
> We are working on a project to create a parquet store for a big pre-joined 
> table between two tables that has one-to-many relationship, and this is a 
> blocking issue for us.
>  
> The following code illustrates the issue. 
> Part 1: loading some nested data
> {quote}{{import spark.implicits._}}
> val jsonStr = """{
>  "items": [
>  {
>  "itemId": 1,
>  "itemData": "a"
>  },
>  {
>  "itemId": 1,
>  "itemData": "b"
>  }
>  ]
> }"""
>  {{val df = spark.read.json(Seq(jsonStr).toDS)}}
>  {{df.write.format("parquet").mode("overwrite").saveAsTable("persisted")}}
> {quote}
> Part 2: reading it back and explaining the queries
> {quote}val read = spark.table("persisted")
>  spark.conf.set("spark.sql.optimizer.nestedSchemaPruning.enabled", true)
>  read.select($"items.itemId").explain(true) // pruned, only loading itemId
> read.select(explode($"items.itemId")).explain(true) // not pruned, loading 
> both itemId and itemData
> {quote}
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29721) Spark SQL reads unnecessary nested fields from Parquet after using explode

2019-11-01 Thread Kai Kang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kai Kang updated SPARK-29721:
-
Description: 
This is a follow up for SPARK-4502. SPARK-4502 correctly addressed column 
pruning for nested structures. However, when explode() is called on a nested 
field, all columns for that nested structure is still fetched from data source.

We are working on a project to create a parquet store for a big pre-joined 
table between two tables that has one-to-many relationship, and this is a 
blocking issue for us.

 

The following code illustrates the issue. 

Part 1: loading some nested data

val jsonStr = """{
 "items": [
 {
 "itemId": 1,
 "itemData": "a"
 },
 {
 "itemId": 1,
 "itemData": "b"
 }
 ]
}"""
val df = spark.read.json(Seq(jsonStr).toDS)
df.write.format("parquet").mode("overwrite").saveAsTable("persisted")

 

 

Part 2: reading it back and explaining the queries

{{val read = spark.table("persisted")}}
 \{{ spark.conf.set("spark.sql.optimizer.nestedSchemaPruning.enabled", true)}}
 \{{ read.select($"items.itemId").explain(true) // pruned, only loading 
itemIdread.select(explode($"items.itemId")).explain(true) // not pruned, 
loading both itemId and itemData}}{{ }}

 

  was:
This is a follow up for SPARK-4502. SPARK-4502 correctly addressed column 
pruning for nested structures. However, when explode() is called on a nested 
field, all columns for that nested structure is still fetched from data source.

We are working on a project to create a parquet store for a big pre-joined 
table between two tables that has one-to-many relationship, and this is a 
blocking issue for us.

 

The following code illustrates the issue. 

Part 1: loading some nested data

{{val jsonStr = """{}}
{{ "items": [}}
{{ {}}
{{ "itemId": 1,}}
{{ "itemData": "a"}}
{{ },}}
{{ {}}
{{ "itemId": 1,}}
{{ "itemData": "b"}}
{{ }}}
{{ ]}}
{{}"""}}

Part 2: reading it back and explaining the queries

{{val read = spark.table("persisted")}}
{{ spark.conf.set("spark.sql.optimizer.nestedSchemaPruning.enabled", true)}}
{{ read.select($"items.itemId").explain(true) // pruned, only loading 
itemIdread.select(explode($"items.itemId")).explain(true) // not pruned, 
loading both itemId and itemData}}{{ }}

 


> Spark SQL reads unnecessary nested fields from Parquet after using explode
> --
>
> Key: SPARK-29721
> URL: https://issues.apache.org/jira/browse/SPARK-29721
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.4
>Reporter: Kai Kang
>Priority: Critical
>
> This is a follow up for SPARK-4502. SPARK-4502 correctly addressed column 
> pruning for nested structures. However, when explode() is called on a nested 
> field, all columns for that nested structure is still fetched from data 
> source.
> We are working on a project to create a parquet store for a big pre-joined 
> table between two tables that has one-to-many relationship, and this is a 
> blocking issue for us.
>  
> The following code illustrates the issue. 
> Part 1: loading some nested data
> val jsonStr = """{
>  "items": [
>  {
>  "itemId": 1,
>  "itemData": "a"
>  },
>  {
>  "itemId": 1,
>  "itemData": "b"
>  }
>  ]
> }"""
> val df = spark.read.json(Seq(jsonStr).toDS)
> df.write.format("parquet").mode("overwrite").saveAsTable("persisted")
>  
>  
> Part 2: reading it back and explaining the queries
> {{val read = spark.table("persisted")}}
>  \{{ spark.conf.set("spark.sql.optimizer.nestedSchemaPruning.enabled", true)}}
>  \{{ read.select($"items.itemId").explain(true) // pruned, only loading 
> itemIdread.select(explode($"items.itemId")).explain(true) // not pruned, 
> loading both itemId and itemData}}{{ }}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29721) Spark SQL reads unnecessary nested fields from Parquet after using explode

2019-11-01 Thread Kai Kang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kai Kang updated SPARK-29721:
-
Description: 
This is a follow up for SPARK-4502. SPARK-4502 correctly addressed column 
pruning for nested structures. However, when explode() is called on a nested 
field, all columns for that nested structure is still fetched from data source.

We are working on a project to create a parquet store for a big pre-joined 
table between two tables that has one-to-many relationship, and this is a 
blocking issue for us.

 

The following code illustrates the issue. 

Part 1: loading some nested data

{{val jsonStr = """{}}
{{ "items": [}}
{{ {}}
{{ "itemId": 1,}}
{{ "itemData": "a"}}
{{ },}}
{{ {}}
{{ "itemId": 1,}}
{{ "itemData": "b"}}
{{ }}}
{{ ]}}
{{}"""}}

Part 2: reading it back and explaining the queries

{{val read = spark.table("persisted")}}
{{ spark.conf.set("spark.sql.optimizer.nestedSchemaPruning.enabled", true)}}
{{ read.select($"items.itemId").explain(true) // pruned, only loading 
itemIdread.select(explode($"items.itemId")).explain(true) // not pruned, 
loading both itemId and itemData}}{{ }}

 

  was:
This is a follow up for SPARK-4502. SPARK-4502 correctly addressed column 
pruning for nested structures. However, when explode() is called on a nested 
field, all columns for that nested structure is still fetched from data source.

We are working on a project to create a parquet store for a big pre-joined 
table between two tables that has one-to-many relationship, and this is a 
blocking issue for us.

 

The following code illustrates the issue. 

Part 1: loading some nested data
{quote}{{import spark.implicits._}}
val jsonStr = """{
 "items": [
 {
 "itemId": 1,
 "itemData": "a"
 },
 {
 "itemId": 1,
 "itemData": "b"
 }
 ]
}"""
 {{val df = spark.read.json(Seq(jsonStr).toDS)}}
 {{df.write.format("parquet").mode("overwrite").saveAsTable("persisted")}}
{quote}
Part 2: reading it back and explaining the queries
{quote}val read = spark.table("persisted")
 spark.conf.set("spark.sql.optimizer.nestedSchemaPruning.enabled", true)
 read.select($"items.itemId").explain(true) // pruned, only loading itemId

read.select(explode($"items.itemId")).explain(true) // not pruned, loading both 
itemId and itemData
{quote}
 

 


> Spark SQL reads unnecessary nested fields from Parquet after using explode
> --
>
> Key: SPARK-29721
> URL: https://issues.apache.org/jira/browse/SPARK-29721
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.4
>Reporter: Kai Kang
>Priority: Critical
>
> This is a follow up for SPARK-4502. SPARK-4502 correctly addressed column 
> pruning for nested structures. However, when explode() is called on a nested 
> field, all columns for that nested structure is still fetched from data 
> source.
> We are working on a project to create a parquet store for a big pre-joined 
> table between two tables that has one-to-many relationship, and this is a 
> blocking issue for us.
>  
> The following code illustrates the issue. 
> Part 1: loading some nested data
> {{val jsonStr = """{}}
> {{ "items": [}}
> {{ {}}
> {{ "itemId": 1,}}
> {{ "itemData": "a"}}
> {{ },}}
> {{ {}}
> {{ "itemId": 1,}}
> {{ "itemData": "b"}}
> {{ }}}
> {{ ]}}
> {{}"""}}
> Part 2: reading it back and explaining the queries
> {{val read = spark.table("persisted")}}
> {{ spark.conf.set("spark.sql.optimizer.nestedSchemaPruning.enabled", true)}}
> {{ read.select($"items.itemId").explain(true) // pruned, only loading 
> itemIdread.select(explode($"items.itemId")).explain(true) // not pruned, 
> loading both itemId and itemData}}{{ }}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-29721) Spark SQL reads unnecessary nested fields from Parquet after using explode

2019-11-01 Thread Kai Kang (Jira)
Kai Kang created SPARK-29721:


 Summary: Spark SQL reads unnecessary nested fields from Parquet 
after using explode
 Key: SPARK-29721
 URL: https://issues.apache.org/jira/browse/SPARK-29721
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.4.4
Reporter: Kai Kang


This is a follow up for SPARK-4502. SPARK-4502 correctly addressed column 
pruning for nested structures. However, when explode() is called on a nested 
field, all columns for that nested structure is still fetched from data source.

We are working on a project to create a parquet store for a big pre-joined 
table between two tables that has one-to-many relationship, and this is a 
blocking issue for us.

 

The following code illustrates the issue. 

Part 1: loading some nested data
{quote}{{import spark.implicits._}}
{{val jsonStr = """{}}
{{ "items": [}}
{{  {}}
{{    "itemId": 1,}}
{{    "itemData": "a"}}
{{  },}}
{{  {}}
{{    "itemId": 1,}}
{{    "itemData": "b"}}
{{  }}}
{{ ]}"""}}
{{val df = spark.read.json(Seq(jsonStr).toDS)}}
{{df.write.format("parquet").mode("overwrite").saveAsTable("persisted")}}
{quote}
Part 2: reading it back and explaining the queries
{quote}val read = spark.table("persisted")
spark.conf.set("spark.sql.optimizer.nestedSchemaPruning.enabled", true)
read.select($"items.itemId").explain(true) // pruned, only loading itemId

read.select(explode($"items.itemId")).explain(true) // not pruned, loading both 
itemId and itemData
{quote}
 

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29682) Failure when resolving conflicting references in Join:

2019-11-01 Thread sandeshyapuram (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16965216#comment-16965216
 ] 

sandeshyapuram commented on SPARK-29682:


[~imback82]  & [~cloud_fan] Currently I've worked my way around by renaming 
every column in the dataframes to perform joins and that works.

Let me know if you have a better workaround to deal with it.

> Failure when resolving conflicting references in Join:
> --
>
> Key: SPARK-29682
> URL: https://issues.apache.org/jira/browse/SPARK-29682
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Spark Shell, Spark Submit
>Affects Versions: 2.4.3
>Reporter: sandeshyapuram
>Priority: Major
>
> When I try to self join a parentDf with multiple childDf say childDf1 ... ... 
> where childDfs are derived after a cube or rollup and are filtered based on 
> group bys,
> I get and error 
> {{Failure when resolving conflicting references in Join: }}
> This shows a long error message which is quite unreadable. On the other hand, 
> if I replace cube or rollup with old groupBy, it works without issues.
>  
> *Sample code:* 
> {code:java}
> val numsDF = sc.parallelize(Seq(1,2,3,4,5,6)).toDF("nums")
> val cubeDF = numsDF
> .cube("nums")
> .agg(
> max(lit(0)).as("agcol"),
> grouping_id().as("gid")
> )
> 
> val group0 = cubeDF.filter(col("gid") <=> lit(0))
> val group1 = cubeDF.filter(col("gid") <=> lit(1))
> cubeDF.printSchema
> group0.printSchema
> group1.printSchema
> //Recreating cubeDf
> cubeDF.select("nums").distinct
> .join(group0, Seq("nums"), "inner")
> .join(group1, Seq("nums"), "inner")
> .show
> {code}
> *Sample output:*
> {code:java}
> numsDF: org.apache.spark.sql.DataFrame = [nums: int]
> cubeDF: org.apache.spark.sql.DataFrame = [nums: int, agcol: int ... 1 more 
> field]
> group0: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [nums: int, 
> agcol: int ... 1 more field]
> group1: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [nums: int, 
> agcol: int ... 1 more field]
> root
>  |-- nums: integer (nullable = true)
>  |-- agcol: integer (nullable = true)
>  |-- gid: integer (nullable = false)
> root
>  |-- nums: integer (nullable = true)
>  |-- agcol: integer (nullable = true)
>  |-- gid: integer (nullable = false)
> root
>  |-- nums: integer (nullable = true)
>  |-- agcol: integer (nullable = true)
>  |-- gid: integer (nullable = false)
> org.apache.spark.sql.AnalysisException:
> Failure when resolving conflicting references in Join:
> 'Join Inner
> :- Deduplicate [nums#220]
> :  +- Project [nums#220]
> : +- Aggregate [nums#220, spark_grouping_id#218], [nums#220, max(0) AS 
> agcol#216, spark_grouping_id#218 AS gid#217]
> :+- Expand [List(nums#212, nums#219, 0), List(nums#212, null, 1)], 
> [nums#212, nums#220, spark_grouping_id#218]
> :   +- Project [nums#212, nums#212 AS nums#219]
> :  +- Project [value#210 AS nums#212]
> : +- SerializeFromObject [input[0, int, false] AS value#210]
> :+- ExternalRDD [obj#209]
> +- Filter (gid#217 <=> 0)
>+- Aggregate [nums#220, spark_grouping_id#218], [nums#220, max(0) AS 
> agcol#216, spark_grouping_id#218 AS gid#217]
>   +- Expand [List(nums#212, nums#219, 0), List(nums#212, null, 1)], 
> [nums#212, nums#220, spark_grouping_id#218]
>  +- Project [nums#212, nums#212 AS nums#219]
> +- Project [value#210 AS nums#212]
>+- SerializeFromObject [input[0, int, false] AS value#210]
>   +- ExternalRDD [obj#209]
> Conflicting attributes: nums#220
> ;;
> 'Join Inner
> :- Deduplicate [nums#220]
> :  +- Project [nums#220]
> : +- Aggregate [nums#220, spark_grouping_id#218], [nums#220, max(0) AS 
> agcol#216, spark_grouping_id#218 AS gid#217]
> :+- Expand [List(nums#212, nums#219, 0), List(nums#212, null, 1)], 
> [nums#212, nums#220, spark_grouping_id#218]
> :   +- Project [nums#212, nums#212 AS nums#219]
> :  +- Project [value#210 AS nums#212]
> : +- SerializeFromObject [input[0, int, false] AS value#210]
> :+- ExternalRDD [obj#209]
> +- Filter (gid#217 <=> 0)
>+- Aggregate [nums#220, spark_grouping_id#218], [nums#220, max(0) AS 
> agcol#216, spark_grouping_id#218 AS gid#217]
>   +- Expand [List(nums#212, nums#219, 0), List(nums#212, null, 1)], 
> [nums#212, nums#220, spark_grouping_id#218]
>  +- Project [nums#212, nums#212 AS nums#219]
> +- Project [value#210 AS nums#212]
>+- SerializeFromObject [input[0, int, false] AS value#210]
>   +- ExternalRDD [obj#209]
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:42)
>   at 
> 

[jira] [Created] (SPARK-29720) Add linux condition to make ProcfsMetricsGetter more complete

2019-11-01 Thread ulysses you (Jira)
ulysses you created SPARK-29720:
---

 Summary: Add linux condition to make ProcfsMetricsGetter more 
complete
 Key: SPARK-29720
 URL: https://issues.apache.org/jira/browse/SPARK-29720
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 3.0.0
Reporter: ulysses you


Now decide whether it can be gather proc stat to executor metrics is that 
{code:java}
procDirExists.get && shouldLogStageExecutorProcessTreeMetrics && 
shouldLogStageExecutorMetrics
{code}
But the proc is only support for linux, so it should add a condition that 
isLinux .



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29719) Converted Metastore relations (ORC, Parquet) wouldn't update InMemoryFileIndex

2019-11-01 Thread Yuming Wang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29719?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16965183#comment-16965183
 ] 

Yuming Wang commented on SPARK-29719:
-

You should refresh {{my_table}}. A similar issue: 
https://github.com/apache/spark/pull/22721

> Converted Metastore relations (ORC, Parquet) wouldn't update InMemoryFileIndex
> --
>
> Key: SPARK-29719
> URL: https://issues.apache.org/jira/browse/SPARK-29719
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Alexander Bessonov
>Priority: Major
>
> Spark attempts to convert Hive tables backed by Parquet and ORC into an 
> internal logical relationships which cache file locations for underlying 
> data. That cache wouldn't be invalidated when attempting to re-read 
> partitioned table later on. The table might have new files by the time it is 
> re-read which might be ignored.
>  
>  
> {code:java}
> val spark = SparkSession.builder()
> .master("yarn")
> .enableHiveSupport
> .config("spark.sql.hive.caseSensitiveInferenceMode", "NEVER_INFER")
> .getOrCreate()
> val df1 = spark.table("my_table").filter("date=20191101")
> // Do something with `df1`
> // External process writes to the partition
> val df2 = spark.table("my_table").filter("date=20191101")
> // Do something with `df2`. Data in `df1` and `df2` should be different, but 
> is equal.{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-29691) Estimator fit method fails to copy params (in PySpark)

2019-11-01 Thread Huaxin Gao (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29691?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16965128#comment-16965128
 ] 

Huaxin Gao edited comment on SPARK-29691 at 11/1/19 10:33 PM:
--

I checked the doc and implementation. The Estimator fits the model using the 
passed in optional params instead of the embedded params, but it doesn't 
overwrite the estimator's embedded params values. In your case, the estimator 
uses 0.75 to fit the model, but it still keeps 0.8 for it's own 
elasticNetParam. If you get the model's parameters, it should have 0.75 for 
elasticNetParam. This seems to work as designed. 

 
{code:java}
# Fit the model, but with an updated parameter setting:
lrModel = lr.fit(training, params= {lor.elasticNetParam : 0.75})
print("After:", lrModel.getOrDefault("elasticNetParam")) # print 0.75
{code}
 


was (Author: huaxingao):
I checked the doc and implementation. The Estimator fits the model using the 
passed in optional params instead of the embedded params, but it doesn't 
overwrite the estimator's embedded params values. In your case, the estimator 
uses 0.75 to fit the model, but it still keeps 0.8 for it's own 
elasticNetParam. If you get the model's parameters, it should have 0.75 for 
elasticNetParam. This seems to work as designed. 

 
{code:java}
# Fit the model, but with an updated parameter setting:lrModel = 
lr.fit(training, params= {lor.elasticNetParam : 0.75}
)print("After:", lrModel.getOrDefault("elasticNetParam")) # print 0.75
{code}
 

> Estimator fit method fails to copy params (in PySpark)
> --
>
> Key: SPARK-29691
> URL: https://issues.apache.org/jira/browse/SPARK-29691
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.4.4
>Reporter: John Bauer
>Priority: Minor
>
> Estimator `fit` method is supposed to copy a dictionary of params, 
> overwriting the estimator's previous values, before fitting the model. 
> However, the parameter values are not updated.  This was observed in PySpark, 
> but may be present in the Java objects, as the PySpark code appears to be 
> functioning correctly.   (The copy method that interacts with Java is 
> actually implemented in Params.)
> For example, this prints
> Before: 0.8
> After: 0.8
> but After should be 0.75
> {code:python}
> from pyspark.ml.classification import LogisticRegression
> # Load training data
> training = spark \
> .read \
> .format("libsvm") \
> .load("data/mllib/sample_multiclass_classification_data.txt")
> lr = LogisticRegression(maxIter=10, regParam=0.3, elasticNetParam=0.8)
> print("Before:", lr.getOrDefault("elasticNetParam"))
> # Fit the model, but with an updated parameter setting:
> lrModel = lr.fit(training, params={"elasticNetParam" : 0.75})
> print("After:", lr.getOrDefault("elasticNetParam"))
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-29691) Estimator fit method fails to copy params (in PySpark)

2019-11-01 Thread Huaxin Gao (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29691?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16965128#comment-16965128
 ] 

Huaxin Gao edited comment on SPARK-29691 at 11/1/19 10:32 PM:
--

I checked the doc and implementation. The Estimator fits the model using the 
passed in optional params instead of the embedded params, but it doesn't 
overwrite the estimator's embedded params values. In your case, the estimator 
uses 0.75 to fit the model, but it still keeps 0.8 for it's own 
elasticNetParam. If you get the model's parameters, it should have 0.75 for 
elasticNetParam. This seems to work as designed. 

 
{code:java}
# Fit the model, but with an updated parameter setting:lrModel = 
lr.fit(training, params= {lor.elasticNetParam : 0.75}
)print("After:", lrModel.getOrDefault("elasticNetParam")) # print 0.75
{code}
 


was (Author: huaxingao):
I checked the doc and implementation. The Estimator fits the model using the 
passed in optional params instead of the embedded params, but it doesn't 
overwrite the estimator's embedded params values. In your case, the estimator 
uses 0.75 to fit the model, but it still keeps 0.8 for it's own 
elasticNetParam. If you get the model's parameters, it should have 0.75 for 
elasticNetParam. This seems to work as designed. 
# Fit the model, but with an updated parameter setting:lrModel = 
lr.fit(training, params={lor.elasticNetParam : 0.75})print("After:", 
lrModel.getOrDefault("elasticNetParam"))  # print 0.75

> Estimator fit method fails to copy params (in PySpark)
> --
>
> Key: SPARK-29691
> URL: https://issues.apache.org/jira/browse/SPARK-29691
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.4.4
>Reporter: John Bauer
>Priority: Minor
>
> Estimator `fit` method is supposed to copy a dictionary of params, 
> overwriting the estimator's previous values, before fitting the model. 
> However, the parameter values are not updated.  This was observed in PySpark, 
> but may be present in the Java objects, as the PySpark code appears to be 
> functioning correctly.   (The copy method that interacts with Java is 
> actually implemented in Params.)
> For example, this prints
> Before: 0.8
> After: 0.8
> but After should be 0.75
> {code:python}
> from pyspark.ml.classification import LogisticRegression
> # Load training data
> training = spark \
> .read \
> .format("libsvm") \
> .load("data/mllib/sample_multiclass_classification_data.txt")
> lr = LogisticRegression(maxIter=10, regParam=0.3, elasticNetParam=0.8)
> print("Before:", lr.getOrDefault("elasticNetParam"))
> # Fit the model, but with an updated parameter setting:
> lrModel = lr.fit(training, params={"elasticNetParam" : 0.75})
> print("After:", lr.getOrDefault("elasticNetParam"))
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29691) Estimator fit method fails to copy params (in PySpark)

2019-11-01 Thread Huaxin Gao (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29691?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16965128#comment-16965128
 ] 

Huaxin Gao commented on SPARK-29691:


I checked the doc and implementation. The Estimator fits the model using the 
passed in optional params instead of the embedded params, but it doesn't 
overwrite the estimator's embedded params values. In your case, the estimator 
uses 0.75 to fit the model, but it still keeps 0.8 for it's own 
elasticNetParam. If you get the model's parameters, it should have 0.75 for 
elasticNetParam. This seems to work as designed. 
# Fit the model, but with an updated parameter setting:lrModel = 
lr.fit(training, params={lor.elasticNetParam : 0.75})print("After:", 
lrModel.getOrDefault("elasticNetParam"))  # print 0.75

> Estimator fit method fails to copy params (in PySpark)
> --
>
> Key: SPARK-29691
> URL: https://issues.apache.org/jira/browse/SPARK-29691
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.4.4
>Reporter: John Bauer
>Priority: Minor
>
> Estimator `fit` method is supposed to copy a dictionary of params, 
> overwriting the estimator's previous values, before fitting the model. 
> However, the parameter values are not updated.  This was observed in PySpark, 
> but may be present in the Java objects, as the PySpark code appears to be 
> functioning correctly.   (The copy method that interacts with Java is 
> actually implemented in Params.)
> For example, this prints
> Before: 0.8
> After: 0.8
> but After should be 0.75
> {code:python}
> from pyspark.ml.classification import LogisticRegression
> # Load training data
> training = spark \
> .read \
> .format("libsvm") \
> .load("data/mllib/sample_multiclass_classification_data.txt")
> lr = LogisticRegression(maxIter=10, regParam=0.3, elasticNetParam=0.8)
> print("Before:", lr.getOrDefault("elasticNetParam"))
> # Fit the model, but with an updated parameter setting:
> lrModel = lr.fit(training, params={"elasticNetParam" : 0.75})
> print("After:", lr.getOrDefault("elasticNetParam"))
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-29719) Converted Metastore relations (ORC, Parquet) wouldn't update InMemoryFileIndex

2019-11-01 Thread Alexander Bessonov (Jira)
Alexander Bessonov created SPARK-29719:
--

 Summary: Converted Metastore relations (ORC, Parquet) wouldn't 
update InMemoryFileIndex
 Key: SPARK-29719
 URL: https://issues.apache.org/jira/browse/SPARK-29719
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.4.0
Reporter: Alexander Bessonov


Spark attempts to convert Hive tables backed by Parquet and ORC into an 
internal logical relationships which cache file locations for underlying data. 
That cache wouldn't be invalidated when attempting to re-read partitioned table 
later on. The table might have new files by the time it is re-read which might 
be ignored.

 

 
{code:java}
val spark = SparkSession.builder()
.master("yarn")
.enableHiveSupport
.config("spark.sql.hive.caseSensitiveInferenceMode", "NEVER_INFER")
.getOrCreate()

val df1 = spark.table("my_table").filter("date=20191101")
// Do something with `df1`
// External process writes to the partition
val df2 = spark.table("my_table").filter("date=20191101")
// Do something with `df2`. Data in `df1` and `df2` should be different, but is 
equal.{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-18910) Can't use UDF that jar file in hdfs

2019-11-01 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-18910?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun closed SPARK-18910.
-

> Can't use UDF that jar file in hdfs
> ---
>
> Key: SPARK-18910
> URL: https://issues.apache.org/jira/browse/SPARK-18910
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2
>Reporter: Hong Shen
>Priority: Major
>
> When I create a UDF that jar file in hdfs, I can't use the UDF. 
> {code}
> spark-sql> create function trans_array as 'com.test.udf.TransArray'  using 
> jar 
> 'hdfs://host1:9000/spark/dev/share/libs/spark-proxy-server-biz-service-impl-1.0.0.jar';
> spark-sql> describe function trans_array;
> Function: test_db.trans_array
> Class: com.alipay.spark.proxy.server.biz.service.impl.udf.TransArray
> Usage: N/A.
> Time taken: 0.127 seconds, Fetched 3 row(s)
> spark-sql> select trans_array(1, '\\|', id, position) as (id0, position0) 
> from test_spark limit 10;
> Error in query: Undefined function: 'trans_array'. This function is neither a 
> registered temporary function nor a permanent function registered in the 
> database 'test_db'.; line 1 pos 7
> {code}
> The reason is when 
> org.apache.spark.sql.internal.SessionState.FunctionResourceLoader.loadResource,
>  the uri.toURL throw exception with " failed unknown protocol: hdfs"
> {code}
>   def addJar(path: String): Unit = {
> sparkSession.sparkContext.addJar(path)
> val uri = new Path(path).toUri
> val jarURL = if (uri.getScheme == null) {
>   // `path` is a local file path without a URL scheme
>   new File(path).toURI.toURL
> } else {
>   // `path` is a URL with a scheme
>   {color:red}uri.toURL{color}
> }
> jarClassLoader.addURL(jarURL)
> Thread.currentThread().setContextClassLoader(jarClassLoader)
>   }
> {code}
> I think we should setURLStreamHandlerFactory method on URL with an instance 
> of FsUrlStreamHandlerFactory, just like:
> {code}
> static {
>   URL.setURLStreamHandlerFactory(new FsUrlStreamHandlerFactory());
> }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-21697) NPE & ExceptionInInitializerError trying to load UDF from HDFS

2019-11-01 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-21697?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun closed SPARK-21697.
-

> NPE & ExceptionInInitializerError trying to load UDF from HDFS
> --
>
> Key: SPARK-21697
> URL: https://issues.apache.org/jira/browse/SPARK-21697
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.1
> Environment: Spark Client mode, Hadoop 2.6.0
>Reporter: Steve Loughran
>Priority: Minor
>  Labels: bulk-closed
>
> Reported on [the 
> PR|https://github.com/apache/spark/pull/17342#issuecomment-321438157] for 
> SPARK-12868: trying to load a UDF of HDFS is triggering an 
> {{ExceptionInInitializerError}}, caused by an NPE which should only happen if 
> the commons-logging {{LOG}} log is null.
> Hypothesis: the commons logging scan for {{commons-logging.properties}} is 
> happening in the classpath with the HDFS JAR; this is triggering a D/L of the 
> JAR, which needs to force in commons-logging, and, as that's not inited yet, 
> NPEs



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-21697) NPE & ExceptionInInitializerError trying to load UDF from HDFS

2019-11-01 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-21697?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-21697.
---
Resolution: Duplicate

> NPE & ExceptionInInitializerError trying to load UDF from HDFS
> --
>
> Key: SPARK-21697
> URL: https://issues.apache.org/jira/browse/SPARK-21697
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.1
> Environment: Spark Client mode, Hadoop 2.6.0
>Reporter: Steve Loughran
>Priority: Minor
>  Labels: bulk-closed
>
> Reported on [the 
> PR|https://github.com/apache/spark/pull/17342#issuecomment-321438157] for 
> SPARK-12868: trying to load a UDF of HDFS is triggering an 
> {{ExceptionInInitializerError}}, caused by an NPE which should only happen if 
> the commons-logging {{LOG}} log is null.
> Hypothesis: the commons logging scan for {{commons-logging.properties}} is 
> happening in the classpath with the HDFS JAR; this is triggering a D/L of the 
> JAR, which needs to force in commons-logging, and, as that's not inited yet, 
> NPEs



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-21697) NPE & ExceptionInInitializerError trying to load UTF from HDFS

2019-11-01 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-21697?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reopened SPARK-21697:
---

> NPE & ExceptionInInitializerError trying to load UTF from HDFS
> --
>
> Key: SPARK-21697
> URL: https://issues.apache.org/jira/browse/SPARK-21697
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.1
> Environment: Spark Client mode, Hadoop 2.6.0
>Reporter: Steve Loughran
>Priority: Minor
>  Labels: bulk-closed
>
> Reported on [the 
> PR|https://github.com/apache/spark/pull/17342#issuecomment-321438157] for 
> SPARK-12868: trying to load a UDF of HDFS is triggering an 
> {{ExceptionInInitializerError}}, caused by an NPE which should only happen if 
> the commons-logging {{LOG}} log is null.
> Hypothesis: the commons logging scan for {{commons-logging.properties}} is 
> happening in the classpath with the HDFS JAR; this is triggering a D/L of the 
> JAR, which needs to force in commons-logging, and, as that's not inited yet, 
> NPEs



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21697) NPE & ExceptionInInitializerError trying to load UDF from HDFS

2019-11-01 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-21697?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-21697:
--
Summary: NPE & ExceptionInInitializerError trying to load UDF from HDFS  
(was: NPE & ExceptionInInitializerError trying to load UTF from HDFS)

> NPE & ExceptionInInitializerError trying to load UDF from HDFS
> --
>
> Key: SPARK-21697
> URL: https://issues.apache.org/jira/browse/SPARK-21697
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.1
> Environment: Spark Client mode, Hadoop 2.6.0
>Reporter: Steve Loughran
>Priority: Minor
>  Labels: bulk-closed
>
> Reported on [the 
> PR|https://github.com/apache/spark/pull/17342#issuecomment-321438157] for 
> SPARK-12868: trying to load a UDF of HDFS is triggering an 
> {{ExceptionInInitializerError}}, caused by an NPE which should only happen if 
> the commons-logging {{LOG}} log is null.
> Hypothesis: the commons logging scan for {{commons-logging.properties}} is 
> happening in the classpath with the HDFS JAR; this is triggering a D/L of the 
> JAR, which needs to force in commons-logging, and, as that's not inited yet, 
> NPEs



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21697) NPE & ExceptionInInitializerError trying to load UDF from HDFS

2019-11-01 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-21697?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-21697:
--
Labels:   (was: bulk-closed)

> NPE & ExceptionInInitializerError trying to load UDF from HDFS
> --
>
> Key: SPARK-21697
> URL: https://issues.apache.org/jira/browse/SPARK-21697
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.1
> Environment: Spark Client mode, Hadoop 2.6.0
>Reporter: Steve Loughran
>Priority: Minor
>
> Reported on [the 
> PR|https://github.com/apache/spark/pull/17342#issuecomment-321438157] for 
> SPARK-12868: trying to load a UDF of HDFS is triggering an 
> {{ExceptionInInitializerError}}, caused by an NPE which should only happen if 
> the commons-logging {{LOG}} log is null.
> Hypothesis: the commons logging scan for {{commons-logging.properties}} is 
> happening in the classpath with the HDFS JAR; this is triggering a D/L of the 
> JAR, which needs to force in commons-logging, and, as that's not inited yet, 
> NPEs



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25694) URL.setURLStreamHandlerFactory causing incompatible HttpURLConnection issue

2019-11-01 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-25694?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-25694:
--
Affects Version/s: 3.0.0
   2.4.4

> URL.setURLStreamHandlerFactory causing incompatible HttpURLConnection issue
> ---
>
> Key: SPARK-25694
> URL: https://issues.apache.org/jira/browse/SPARK-25694
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 2.3.0, 2.3.1, 2.3.2, 2.4.4, 3.0.0
>Reporter: Bo Yang
>Priority: Minor
>
> URL.setURLStreamHandlerFactory() in SharedState causes URL.openConnection() 
> returns FsUrlConnection object, which is not compatible with 
> HttpURLConnection. This will cause exception when using some third party http 
> library (e.g. scalaj.http).
> The following code in Spark 2.3.0 introduced the issue: 
> sql/core/src/main/scala/org/apache/spark/sql/internal/SharedState.scala:
> {code}
> object SharedState extends Logging  {   ...   
>   URL.setURLStreamHandlerFactory(new FsUrlStreamHandlerFactory())   ...
> }
> {code}
> Here is the example exception when using scalaj.http in Spark:
> {code}
>  StackTrace: scala.MatchError: 
> org.apache.hadoop.fs.FsUrlConnection:[http://.example.com|http://.example.com/]
>  (of class org.apache.hadoop.fs.FsUrlConnection)
>  at 
> scalaj.http.HttpRequest.scalaj$http$HttpRequest$$doConnection(Http.scala:343)
>  at scalaj.http.HttpRequest.exec(Http.scala:335)
>  at scalaj.http.HttpRequest.asString(Http.scala:455)
> {code}
>   
> One option to fix the issue is to return null in 
> URLStreamHandlerFactory.createURLStreamHandler when the protocol is 
> http/https, so it will use the default behavior and be compatible with 
> scalaj.http. Following is the code example:
> {code}
> class SparkUrlStreamHandlerFactory extends URLStreamHandlerFactory with 
> Logging {
>   private val fsUrlStreamHandlerFactory = new FsUrlStreamHandlerFactory()
>   override def createURLStreamHandler(protocol: String): URLStreamHandler = {
> val handler = fsUrlStreamHandlerFactory.createURLStreamHandler(protocol)
> if (handler == null) {
>   return null
> }
> if (protocol != null &&
>   (protocol.equalsIgnoreCase("http")
>   || protocol.equalsIgnoreCase("https"))) {
>   // return null to use system default URLStreamHandler
>   null
> } else {
>   handler
> }
>   }
> }
> {code}
> I would like to get some discussion here before submitting a pull request.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25694) URL.setURLStreamHandlerFactory causing incompatible HttpURLConnection issue

2019-11-01 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-25694?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16965036#comment-16965036
 ] 

Dongjoon Hyun commented on SPARK-25694:
---

Although SPARK-12868 adds `setURLStreamHandlerFactory` for `ADD JARS` commands, 
this method can be called at most once in a given Java Virtual Machine. So, 
there is another issue with this.
- 
https://docs.oracle.com/en/java/javase/11/docs/api/java.base/java/net/URL.html#setURLStreamHandlerFactory(java.net.URLStreamHandlerFactory)

> URL.setURLStreamHandlerFactory causing incompatible HttpURLConnection issue
> ---
>
> Key: SPARK-25694
> URL: https://issues.apache.org/jira/browse/SPARK-25694
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 2.3.0, 2.3.1, 2.3.2
>Reporter: Bo Yang
>Priority: Minor
>
> URL.setURLStreamHandlerFactory() in SharedState causes URL.openConnection() 
> returns FsUrlConnection object, which is not compatible with 
> HttpURLConnection. This will cause exception when using some third party http 
> library (e.g. scalaj.http).
> The following code in Spark 2.3.0 introduced the issue: 
> sql/core/src/main/scala/org/apache/spark/sql/internal/SharedState.scala:
> {code}
> object SharedState extends Logging  {   ...   
>   URL.setURLStreamHandlerFactory(new FsUrlStreamHandlerFactory())   ...
> }
> {code}
> Here is the example exception when using scalaj.http in Spark:
> {code}
>  StackTrace: scala.MatchError: 
> org.apache.hadoop.fs.FsUrlConnection:[http://.example.com|http://.example.com/]
>  (of class org.apache.hadoop.fs.FsUrlConnection)
>  at 
> scalaj.http.HttpRequest.scalaj$http$HttpRequest$$doConnection(Http.scala:343)
>  at scalaj.http.HttpRequest.exec(Http.scala:335)
>  at scalaj.http.HttpRequest.asString(Http.scala:455)
> {code}
>   
> One option to fix the issue is to return null in 
> URLStreamHandlerFactory.createURLStreamHandler when the protocol is 
> http/https, so it will use the default behavior and be compatible with 
> scalaj.http. Following is the code example:
> {code}
> class SparkUrlStreamHandlerFactory extends URLStreamHandlerFactory with 
> Logging {
>   private val fsUrlStreamHandlerFactory = new FsUrlStreamHandlerFactory()
>   override def createURLStreamHandler(protocol: String): URLStreamHandler = {
> val handler = fsUrlStreamHandlerFactory.createURLStreamHandler(protocol)
> if (handler == null) {
>   return null
> }
> if (protocol != null &&
>   (protocol.equalsIgnoreCase("http")
>   || protocol.equalsIgnoreCase("https"))) {
>   // return null to use system default URLStreamHandler
>   null
> } else {
>   handler
> }
>   }
> }
> {code}
> I would like to get some discussion here before submitting a pull request.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-29452) Improve tootip information for storage tab

2019-11-01 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29452?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen reassigned SPARK-29452:


Assignee: Rakesh Raushan

> Improve tootip information for storage tab
> --
>
> Key: SPARK-29452
> URL: https://issues.apache.org/jira/browse/SPARK-29452
> Project: Spark
>  Issue Type: Sub-task
>  Components: Web UI
>Affects Versions: 3.0.0
>Reporter: Sandeep Katta
>Assignee: Rakesh Raushan
>Priority: Minor
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29651) Incorrect parsing of interval seconds fraction

2019-11-01 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29651?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-29651:

Fix Version/s: 2.4.5

> Incorrect parsing of interval seconds fraction
> --
>
> Key: SPARK-29651
> URL: https://issues.apache.org/jira/browse/SPARK-29651
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.1.0, 2.2.0, 2.3.0, 2.4.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Minor
> Fix For: 2.4.5, 3.0.0
>
>
> * The fractional part of interval seconds unit is incorrectly parsed if the 
> number of digits is less than 9, for example:
> {code}
> spark-sql> select interval '10.123456 seconds';
> interval 10 seconds 123 microseconds
> {code}
> The result must be *interval 10 seconds 123 milliseconds 456 microseconds*
> * If the seconds unit of an interval is negative, it is incorrectly converted 
> to `CalendarInterval`, for example:
> {code}
> spark-sql> select interval '-10.123456789 seconds';
> interval -9 seconds -876 milliseconds -544 microseconds
> {code}
> Taking into account truncation to microseconds, the result must be *interval 
> -10 seconds -123 milliseconds -456 microseconds*



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-29718) Support PARTITION BY [RANGE|LIST|HASH] and PARTITION OF in CREATE TABLE

2019-11-01 Thread Takeshi Yamamuro (Jira)
Takeshi Yamamuro created SPARK-29718:


 Summary: Support PARTITION BY [RANGE|LIST|HASH] and PARTITION OF 
in CREATE TABLE
 Key: SPARK-29718
 URL: https://issues.apache.org/jira/browse/SPARK-29718
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.0.0
Reporter: Takeshi Yamamuro


5.10. Table Partitioning: 
https://www.postgresql.org/docs/current/ddl-partitioning.html



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-29717) Support [CREATE|DROP] RULE - define a new plan rewrite rule

2019-11-01 Thread Takeshi Yamamuro (Jira)
Takeshi Yamamuro created SPARK-29717:


 Summary: Support [CREATE|DROP] RULE - define a new plan rewrite 
rule
 Key: SPARK-29717
 URL: https://issues.apache.org/jira/browse/SPARK-29717
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.0.0
Reporter: Takeshi Yamamuro


https://www.postgresql.org/docs/current/sql-createrule.html



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-29716) Support [CREATE|DROP] TYPE

2019-11-01 Thread Takeshi Yamamuro (Jira)
Takeshi Yamamuro created SPARK-29716:


 Summary: Support [CREATE|DROP] TYPE
 Key: SPARK-29716
 URL: https://issues.apache.org/jira/browse/SPARK-29716
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.0.0
Reporter: Takeshi Yamamuro


https://www.postgresql.org/docs/current/sql-createtype.html



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-29715) Support SELECT statements in VALUES of INSERT INTO

2019-11-01 Thread Takeshi Yamamuro (Jira)
Takeshi Yamamuro created SPARK-29715:


 Summary: Support SELECT statements in VALUES of INSERT INTO
 Key: SPARK-29715
 URL: https://issues.apache.org/jira/browse/SPARK-29715
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.0.0
Reporter: Takeshi Yamamuro


In PgSQL, we can use SELECT statements in VALUES of INSERT INTO;
{code}
postgres=# create table t (c0 int, c1 int);
CREATE TABLE
postgres=# insert into t values (3, (select 1));
INSERT 0 1
postgres=# select * from t;
 c0 | c1 
+
  3 |  1
(1 row)
{code}
{code}
scala> sql("""create table t (c0 int, c1 int) using parquet""")

scala> sql("""insert into t values (3, (select 1))""")
org.apache.spark.sql.AnalysisException: unresolved operator 'Project 
[unresolvedalias(1, None)];;
'InsertIntoStatement 'UnresolvedRelation [t], false, false
+- 'UnresolvedInlineTable [col1, col2], [List(3, scalar-subquery#0 [])]
  +- 'Project [unresolvedalias(1, None)]
 +- OneRowRelation

  at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis.failAnalysis(CheckAnalysis.scala:47)
  at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis.failAnalysis$(CheckAnalysis.scala:46)
  at 
org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:122)
  at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis$36(CheckAnalysis.scala:540)
  at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis$36$adapted(CheckAnalysis.scala:538)
  at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:154)
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-29452) Improve tootip information for storage tab

2019-11-01 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29452?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-29452.
--
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 26226
[https://github.com/apache/spark/pull/26226]

> Improve tootip information for storage tab
> --
>
> Key: SPARK-29452
> URL: https://issues.apache.org/jira/browse/SPARK-29452
> Project: Spark
>  Issue Type: Sub-task
>  Components: Web UI
>Affects Versions: 3.0.0
>Reporter: Sandeep Katta
>Priority: Minor
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-29714) Add insert.sql

2019-11-01 Thread Takeshi Yamamuro (Jira)
Takeshi Yamamuro created SPARK-29714:


 Summary: Add insert.sql
 Key: SPARK-29714
 URL: https://issues.apache.org/jira/browse/SPARK-29714
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.0.0
Reporter: Takeshi Yamamuro






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-29109) Add window.sql - Part 3

2019-11-01 Thread Takeshi Yamamuro (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29109?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro resolved SPARK-29109.
--
Fix Version/s: 3.0.0
 Assignee: Dylan Guedes
   Resolution: Fixed

Resolved by https://github.com/apache/spark/pull/26274

> Add window.sql - Part 3
> ---
>
> Key: SPARK-29109
> URL: https://issues.apache.org/jira/browse/SPARK-29109
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL, Tests
>Affects Versions: 3.0.0
>Reporter: Dylan Guedes
>Assignee: Dylan Guedes
>Priority: Major
> Fix For: 3.0.0
>
>
> In this ticket, we plan to add the regression test cases of 
> [https://github.com/postgres/postgres/blob/REL_12_BETA3/src/test/regress/sql/window.sql#L553-L911|https://github.com/postgres/postgres/blob/REL_12_BETA3/src/test/regress/sql/window.sql#L553-L911]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-29713) Support Interval Unit Abbreviations in Interval Literals

2019-11-01 Thread Kent Yao (Jira)
Kent Yao created SPARK-29713:


 Summary: Support Interval Unit Abbreviations in Interval Literals
 Key: SPARK-29713
 URL: https://issues.apache.org/jira/browse/SPARK-29713
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.0.0
Reporter: Kent Yao


"year" | "years" | "y" | "yr" | "yrs" => YEAR
"month" | "months" | "mon" | "mons" => MONTH
"week" | "weeks" | "w" => WEEK
"day" | "days" | "d" => DAY
"hour" | "hours" | "h" | "hr" | "hrs" => HOUR
"minute" | "minutes" | "m" | "min" | "mins" => MINUTE
"second" | "seconds" | "s" | "sec" | "secs" => SECOND
"millisecond" | "milliseconds" | "ms" | "msec" | "msecs" | "mseconds" => 
MILLISECOND
"microsecond" | "microseconds" | "us" | "usec" | "usecs" | "useconds" => 
MICROSECOND



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-29712) fromDayTimeString() does not take into account the left bound

2019-11-01 Thread Maxim Gekk (Jira)
Maxim Gekk created SPARK-29712:
--

 Summary: fromDayTimeString() does not take into account the left 
bound
 Key: SPARK-29712
 URL: https://issues.apache.org/jira/browse/SPARK-29712
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.4.0, 3.0.0
Reporter: Maxim Gekk


Currently, fromDayTimeString() takes into account the right bound but not the 
left one. For example:
{code}
spark-sql> SELECT interval '1 2:03:04' hour to minute;
interval 1 days 2 hours 3 minutes
{code}
The result should be *interval 2 hours 3 minutes*



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-29643) ALTER TABLE (DROP PARTITION) should look up catalog/table like v2 commands

2019-11-01 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29643?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-29643:
---

Assignee: Huaxin Gao

> ALTER TABLE (DROP PARTITION) should look up catalog/table like v2 commands
> --
>
> Key: SPARK-29643
> URL: https://issues.apache.org/jira/browse/SPARK-29643
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Huaxin Gao
>Assignee: Huaxin Gao
>Priority: Major
>
> ALTER TABLE (DROP PARTITION) should look up catalog/table like v2 commands



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-29643) ALTER TABLE (DROP PARTITION) should look up catalog/table like v2 commands

2019-11-01 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29643?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-29643.
-
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 26303
[https://github.com/apache/spark/pull/26303]

> ALTER TABLE (DROP PARTITION) should look up catalog/table like v2 commands
> --
>
> Key: SPARK-29643
> URL: https://issues.apache.org/jira/browse/SPARK-29643
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Huaxin Gao
>Assignee: Huaxin Gao
>Priority: Major
> Fix For: 3.0.0
>
>
> ALTER TABLE (DROP PARTITION) should look up catalog/table like v2 commands



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-29486) CalendarInterval should have 3 fields: months, days and microseconds

2019-11-01 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29486?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-29486.
-
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 26134
[https://github.com/apache/spark/pull/26134]

> CalendarInterval should have 3 fields: months, days and microseconds
> 
>
> Key: SPARK-29486
> URL: https://issues.apache.org/jira/browse/SPARK-29486
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.4
>Reporter: Liu, Linhong
>Assignee: Liu, Linhong
>Priority: Major
> Fix For: 3.0.0
>
>
> Current CalendarInterval has 2 fields: months and microseconds. This PR try 
> to change it
> to 3 fields: months, days and microseconds. This is because one logical day 
> interval may
> have different number of microseconds (daylight saving).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-29486) CalendarInterval should have 3 fields: months, days and microseconds

2019-11-01 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29486?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-29486:
---

Assignee: Liu, Linhong

> CalendarInterval should have 3 fields: months, days and microseconds
> 
>
> Key: SPARK-29486
> URL: https://issues.apache.org/jira/browse/SPARK-29486
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.4
>Reporter: Liu, Linhong
>Assignee: Liu, Linhong
>Priority: Major
>
> Current CalendarInterval has 2 fields: months and microseconds. This PR try 
> to change it
> to 3 fields: months, days and microseconds. This is because one logical day 
> interval may
> have different number of microseconds (daylight saving).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-29711) Dynamic adjust spark sql class log level in beeline

2019-11-01 Thread deshanxiao (Jira)
deshanxiao created SPARK-29711:
--

 Summary: Dynamic adjust spark sql class log level in beeline
 Key: SPARK-29711
 URL: https://issues.apache.org/jira/browse/SPARK-29711
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0
Reporter: deshanxiao


We can change the log level in beeline like: set spark.log.level=debug. It will 
not change a lot but useful



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29710) Seeing offsets not resetting even when reset policy is configured explicitly

2019-11-01 Thread Shyam (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16964696#comment-16964696
 ] 

Shyam commented on SPARK-29710:
---

@Gabor Somogyi Can you please help me , what is wrong here 

> Seeing offsets not resetting even when reset policy is configured explicitly
> 
>
> Key: SPARK-29710
> URL: https://issues.apache.org/jira/browse/SPARK-29710
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.4.1
> Environment: Window10 , eclipse neos
>Reporter: Shyam
>Priority: Major
>
>  
>  even after setting *"auto.offset.reset" to "latest"*  I am getting below 
> error
>  
> org.apache.kafka.clients.consumer.OffsetOutOfRangeException: Offsets out of 
> range with no configured reset policy for partitions: 
> \{COMPANY_TRANSACTIONS_INBOUND-16=168}org.apache.kafka.clients.consumer.OffsetOutOfRangeException:
>  Offsets out of range with no configured reset policy for partitions: 
> \{COMPANY_TRANSACTIONS_INBOUND-16=168} at 
> org.apache.kafka.clients.consumer.internals.Fetcher.throwIfOffsetOutOfRange(Fetcher.java:348)
>  at 
> org.apache.kafka.clients.consumer.internals.Fetcher.fetchedRecords(Fetcher.java:396)
>  at 
> org.apache.kafka.clients.consumer.KafkaConsumer.pollOnce(KafkaConsumer.java:999)
>  at 
> org.apache.kafka.clients.consumer.KafkaConsumer.poll(KafkaConsumer.java:937) 
> at 
> org.apache.spark.sql.kafka010.InternalKafkaConsumer.fetchData(KafkaDataConsumer.scala:470)
>  at 
> org.apache.spark.sql.kafka010.InternalKafkaConsumer.org$apache$spark$sql$kafka010$InternalKafkaConsumer$$fetchRecord(KafkaDataConsumer.scala:361)
>  at 
> org.apache.spark.sql.kafka010.InternalKafkaConsumer$$anonfun$get$1.apply(KafkaDataConsumer.scala:251)
>  at 
> org.apache.spark.sql.kafka010.InternalKafkaConsumer$$anonfun$get$1.apply(KafkaDataConsumer.scala:234)
>  at 
> org.apache.spark.util.UninterruptibleThread.runUninterruptibly(UninterruptibleThread.scala:77)
>  at 
> org.apache.spark.sql.kafka010.InternalKafkaConsumer.runUninterruptiblyIfPossible(KafkaDataConsumer.scala:209)
>  at 
> org.apache.spark.sql.kafka010.InternalKafkaConsumer.get(KafkaDataConsumer.scala:234)
>  
> [https://stackoverflow.com/questions/58653885/even-after-setting-auto-offset-reset-to-latest-getting-error-offsetoutofrang]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29707) Make PartitionFilters and PushedFilters abbreviate configurable in metadata

2019-11-01 Thread Hu Fuwang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29707?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16964655#comment-16964655
 ] 

Hu Fuwang commented on SPARK-29707:
---

I am working on this.

> Make PartitionFilters and PushedFilters abbreviate configurable in metadata
> ---
>
> Key: SPARK-29707
> URL: https://issues.apache.org/jira/browse/SPARK-29707
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Major
> Attachments: screenshot-1.png
>
>
>  !screenshot-1.png! 
> It lost some key information.
> Related code:
> https://github.com/apache/spark/blob/ec5d698d99634e5bb8fc7b0fa1c270dd67c129c8/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala#L58-L66



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-29710) Seeing offsets not resetting even when reset policy is configured explicitly

2019-11-01 Thread Shyam (Jira)
Shyam created SPARK-29710:
-

 Summary: Seeing offsets not resetting even when reset policy is 
configured explicitly
 Key: SPARK-29710
 URL: https://issues.apache.org/jira/browse/SPARK-29710
 Project: Spark
  Issue Type: Bug
  Components: Structured Streaming
Affects Versions: 2.4.1
 Environment: Window10 , eclipse neos
Reporter: Shyam


 

 even after setting *"auto.offset.reset" to "latest"*  I am getting below error

 

org.apache.kafka.clients.consumer.OffsetOutOfRangeException: Offsets out of 
range with no configured reset policy for partitions: 
\{COMPANY_TRANSACTIONS_INBOUND-16=168}org.apache.kafka.clients.consumer.OffsetOutOfRangeException:
 Offsets out of range with no configured reset policy for partitions: 
\{COMPANY_TRANSACTIONS_INBOUND-16=168} at 
org.apache.kafka.clients.consumer.internals.Fetcher.throwIfOffsetOutOfRange(Fetcher.java:348)
 at 
org.apache.kafka.clients.consumer.internals.Fetcher.fetchedRecords(Fetcher.java:396)
 at 
org.apache.kafka.clients.consumer.KafkaConsumer.pollOnce(KafkaConsumer.java:999)
 at 
org.apache.kafka.clients.consumer.KafkaConsumer.poll(KafkaConsumer.java:937) at 
org.apache.spark.sql.kafka010.InternalKafkaConsumer.fetchData(KafkaDataConsumer.scala:470)
 at 
org.apache.spark.sql.kafka010.InternalKafkaConsumer.org$apache$spark$sql$kafka010$InternalKafkaConsumer$$fetchRecord(KafkaDataConsumer.scala:361)
 at 
org.apache.spark.sql.kafka010.InternalKafkaConsumer$$anonfun$get$1.apply(KafkaDataConsumer.scala:251)
 at 
org.apache.spark.sql.kafka010.InternalKafkaConsumer$$anonfun$get$1.apply(KafkaDataConsumer.scala:234)
 at 
org.apache.spark.util.UninterruptibleThread.runUninterruptibly(UninterruptibleThread.scala:77)
 at 
org.apache.spark.sql.kafka010.InternalKafkaConsumer.runUninterruptiblyIfPossible(KafkaDataConsumer.scala:209)
 at 
org.apache.spark.sql.kafka010.InternalKafkaConsumer.get(KafkaDataConsumer.scala:234)

 

[https://stackoverflow.com/questions/58653885/even-after-setting-auto-offset-reset-to-latest-getting-error-offsetoutofrang]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-29709) structured streaming The offset in the checkpoint is suddenly reset to the earliest

2019-11-01 Thread test (Jira)
test created SPARK-29709:


 Summary: structured streaming The offset in the checkpoint is 
suddenly reset to the earliest
 Key: SPARK-29709
 URL: https://issues.apache.org/jira/browse/SPARK-29709
 Project: Spark
  Issue Type: Bug
  Components: Structured Streaming
Affects Versions: 2.4.0
Reporter: test


structured streaming The offset in the checkpoint is suddenly reset to the 
earliest,

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org