[jira] [Updated] (SPARK-48091) Using `explode` together with `transform` in the same select statement causes aliases in the transformed column to be ignored

2024-06-06 Thread Ron Serruya (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48091?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ron Serruya updated SPARK-48091:

Description: 
When using an `explode` function, and `transform` function in the same select 
statement, aliases used inside the transformed column are ignored.

This behavior only happens using the pyspark API and the scala API, but not 
when using the SQL API

 
{code:java}
from pyspark.sql import functions as F

# Create the df
df = spark.createDataFrame([
{"id": 1, "array1": ['a', 'b'], 'array2': [2,3,4]}
]){code}
Good case, where all aliases are used

 
{code:java}
df.select(
F.transform(
'array2',
lambda x: F.struct(x.alias("some_alias"), 
F.col("id").alias("second_alias"))
).alias("new_array2")
).printSchema() 

root
 |-- new_array2: array (nullable = true)
 ||-- element: struct (containsNull = false)
 |||-- some_alias: long (nullable = true)
 |||-- second_alias: long (nullable = true){code}
Bad case, when using explode, the alises inside the transformed column is 
ignored, and  `id` is kept instead of `second_alias`, and `x_17` is used 
instead of `some_alias`

 

 
{code:java}
df.select(
F.explode("array1").alias("exploded"),
F.transform(
'array2',
lambda x: F.struct(x.alias("some_alias"), 
F.col("id").alias("second_alias"))
).alias("new_array2")
).printSchema()

root
 |-- exploded: string (nullable = true)
 |-- new_array2: array (nullable = true)
 ||-- element: struct (containsNull = false)
 |||-- x_17: long (nullable = true)
 |||-- id: long (nullable = true) {code}
 

 {code:scala}
import org.apache.spark.sql.functions._
var df2 = df.select(array(lit(1), lit(2), lit(3)).as("my_array"), array(lit(1), 
lit(2), lit(3)).as("my_array2"))

df2.select(
  explode($"my_array").as("exploded"),
  transform($"my_array2", (x) => struct(x.as("data"))).as("my_struct")
).printSchema
{code}


{noformat}
root
 |-- exploded: integer (nullable = false)
 |-- my_struct: array (nullable = false)
 ||-- element: struct (containsNull = false)
 |||-- x_2: integer (nullable = false)
{noformat}


 

When using the SQL API instead, it works fine
{code:java}
spark.sql(
"""
select explode(array1) as exploded, transform(array2, x-> struct(x as 
some_alias, id as second_alias)) as array2 from {df}
""", df=df
).printSchema()

root
 |-- exploded: string (nullable = true)
 |-- array2: array (nullable = true)
 ||-- element: struct (containsNull = false)
 |||-- some_alias: long (nullable = true)
 |||-- second_alias: long (nullable = true) {code}
 

Workaround: for now, using F.named_struct can be used as a workaround

  was:
When using an `explode` function, and `transform` function in the same select 
statement, aliases used inside the transformed column are ignored.

This behaviour only happens using the pyspark API, and not when using the SQL 
API

 
{code:java}
from pyspark.sql import functions as F

# Create the df
df = spark.createDataFrame([
{"id": 1, "array1": ['a', 'b'], 'array2': [2,3,4]}
]){code}
Good case, where all aliases are used

 
{code:java}
df.select(
F.transform(
'array2',
lambda x: F.struct(x.alias("some_alias"), 
F.col("id").alias("second_alias"))
).alias("new_array2")
).printSchema() 

root
 |-- new_array2: array (nullable = true)
 ||-- element: struct (containsNull = false)
 |||-- some_alias: long (nullable = true)
 |||-- second_alias: long (nullable = true){code}
Bad case, when using explode, the alises inside the transformed column is 
ignored, and  `id` is kept instead of `second_alias`, and `x_17` is used 
instead of `some_alias`

 

 
{code:java}
df.select(
F.explode("array1").alias("exploded"),
F.transform(
'array2',
lambda x: F.struct(x.alias("some_alias"), 
F.col("id").alias("second_alias"))
).alias("new_array2")
).printSchema()

root
 |-- exploded: string (nullable = true)
 |-- new_array2: array (nullable = true)
 ||-- element: struct (containsNull = false)
 |||-- x_17: long (nullable = true)
 |||-- id: long (nullable = true) {code}
 

 

 

When using the SQL API instead, it works fine
{code:java}
spark.sql(
"""
select explode(array1) as exploded, transform(array2, x-> struct(x as 
some_alias, id as second_alias)) as array2 from {df}
""", df=df
).printSchema()

root
 |-- exploded: string (nullable = true)
 |-- array2: array (nullable = true)
 ||-- element: struct (containsNull = false)
 |||-- some_alias: long (nullable = true)
 |||-- second_alias: long (nullable = true) {code}
 

Workaround: for now, using F.named_struct can be used as a workaround


> Using `explode` together with `transform` in the same select statement causes 
> aliases in the transformed column to be ignored
> 

[jira] [Updated] (SPARK-48091) Using `explode` together with `transform` in the same select statement causes aliases in the transformed column to be ignored

2024-06-06 Thread Ron Serruya (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48091?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ron Serruya updated SPARK-48091:

Environment: Scala 2.12.15, Python 3.10, 3.12, OSX 14.4 and Databricks DBR 
13.3, 14.3, Pyspark 3.4.0, 3.5.0, 3.5.1   (was: Python 3.10, 3.12, OSX 14.4 and 
Databricks DBR 13.3, 14.3, Pyspark 3.4.0, 3.5.0, 3.5.1)

> Using `explode` together with `transform` in the same select statement causes 
> aliases in the transformed column to be ignored
> -
>
> Key: SPARK-48091
> URL: https://issues.apache.org/jira/browse/SPARK-48091
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.4.0, 3.5.0, 3.5.1
> Environment: Scala 2.12.15, Python 3.10, 3.12, OSX 14.4 and 
> Databricks DBR 13.3, 14.3, Pyspark 3.4.0, 3.5.0, 3.5.1 
>Reporter: Ron Serruya
>Priority: Minor
>  Labels: alias
>
> When using an `explode` function, and `transform` function in the same select 
> statement, aliases used inside the transformed column are ignored.
> This behavior only happens using the pyspark API and the scala API, but not 
> when using the SQL API
>  
> {code:java}
> from pyspark.sql import functions as F
> # Create the df
> df = spark.createDataFrame([
> {"id": 1, "array1": ['a', 'b'], 'array2': [2,3,4]}
> ]){code}
> Good case, where all aliases are used
>  
> {code:java}
> df.select(
> F.transform(
> 'array2',
> lambda x: F.struct(x.alias("some_alias"), 
> F.col("id").alias("second_alias"))
> ).alias("new_array2")
> ).printSchema() 
> root
>  |-- new_array2: array (nullable = true)
>  ||-- element: struct (containsNull = false)
>  |||-- some_alias: long (nullable = true)
>  |||-- second_alias: long (nullable = true){code}
> Bad case, when using explode, the alises inside the transformed column is 
> ignored, and  `id` is kept instead of `second_alias`, and `x_17` is used 
> instead of `some_alias`
>  
>  
> {code:java}
> df.select(
> F.explode("array1").alias("exploded"),
> F.transform(
> 'array2',
> lambda x: F.struct(x.alias("some_alias"), 
> F.col("id").alias("second_alias"))
> ).alias("new_array2")
> ).printSchema()
> root
>  |-- exploded: string (nullable = true)
>  |-- new_array2: array (nullable = true)
>  ||-- element: struct (containsNull = false)
>  |||-- x_17: long (nullable = true)
>  |||-- id: long (nullable = true) {code}
>  
>  {code:scala}
> import org.apache.spark.sql.functions._
> var df2 = df.select(array(lit(1), lit(2), lit(3)).as("my_array"), 
> array(lit(1), lit(2), lit(3)).as("my_array2"))
> df2.select(
>   explode($"my_array").as("exploded"),
>   transform($"my_array2", (x) => struct(x.as("data"))).as("my_struct")
> ).printSchema
> {code}
> {noformat}
> root
>  |-- exploded: integer (nullable = false)
>  |-- my_struct: array (nullable = false)
>  ||-- element: struct (containsNull = false)
>  |||-- x_2: integer (nullable = false)
> {noformat}
>  
> When using the SQL API instead, it works fine
> {code:java}
> spark.sql(
> """
> select explode(array1) as exploded, transform(array2, x-> struct(x as 
> some_alias, id as second_alias)) as array2 from {df}
> """, df=df
> ).printSchema()
> root
>  |-- exploded: string (nullable = true)
>  |-- array2: array (nullable = true)
>  ||-- element: struct (containsNull = false)
>  |||-- some_alias: long (nullable = true)
>  |||-- second_alias: long (nullable = true) {code}
>  
> Workaround: for now, using F.named_struct can be used as a workaround



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48091) Using `explode` together with `transform` in the same select statement causes aliases in the transformed column to be ignored

2024-06-06 Thread Ron Serruya (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48091?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ron Serruya updated SPARK-48091:

Component/s: Spark Core
 (was: PySpark)

> Using `explode` together with `transform` in the same select statement causes 
> aliases in the transformed column to be ignored
> -
>
> Key: SPARK-48091
> URL: https://issues.apache.org/jira/browse/SPARK-48091
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.4.0, 3.5.0, 3.5.1
> Environment: Python 3.10, 3.12, OSX 14.4 and Databricks DBR 13.3, 
> 14.3, Pyspark 3.4.0, 3.5.0, 3.5.1
>Reporter: Ron Serruya
>Priority: Minor
>  Labels: alias
>
> When using an `explode` function, and `transform` function in the same select 
> statement, aliases used inside the transformed column are ignored.
> This behaviour only happens using the pyspark API, and not when using the SQL 
> API
>  
> {code:java}
> from pyspark.sql import functions as F
> # Create the df
> df = spark.createDataFrame([
> {"id": 1, "array1": ['a', 'b'], 'array2': [2,3,4]}
> ]){code}
> Good case, where all aliases are used
>  
> {code:java}
> df.select(
> F.transform(
> 'array2',
> lambda x: F.struct(x.alias("some_alias"), 
> F.col("id").alias("second_alias"))
> ).alias("new_array2")
> ).printSchema() 
> root
>  |-- new_array2: array (nullable = true)
>  ||-- element: struct (containsNull = false)
>  |||-- some_alias: long (nullable = true)
>  |||-- second_alias: long (nullable = true){code}
> Bad case, when using explode, the alises inside the transformed column is 
> ignored, and  `id` is kept instead of `second_alias`, and `x_17` is used 
> instead of `some_alias`
>  
>  
> {code:java}
> df.select(
> F.explode("array1").alias("exploded"),
> F.transform(
> 'array2',
> lambda x: F.struct(x.alias("some_alias"), 
> F.col("id").alias("second_alias"))
> ).alias("new_array2")
> ).printSchema()
> root
>  |-- exploded: string (nullable = true)
>  |-- new_array2: array (nullable = true)
>  ||-- element: struct (containsNull = false)
>  |||-- x_17: long (nullable = true)
>  |||-- id: long (nullable = true) {code}
>  
>  
>  
> When using the SQL API instead, it works fine
> {code:java}
> spark.sql(
> """
> select explode(array1) as exploded, transform(array2, x-> struct(x as 
> some_alias, id as second_alias)) as array2 from {df}
> """, df=df
> ).printSchema()
> root
>  |-- exploded: string (nullable = true)
>  |-- array2: array (nullable = true)
>  ||-- element: struct (containsNull = false)
>  |||-- some_alias: long (nullable = true)
>  |||-- second_alias: long (nullable = true) {code}
>  
> Workaround: for now, using F.named_struct can be used as a workaround



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48091) Using `explode` together with `transform` in the same select statement causes aliases in the transformed column to be ignored

2024-06-06 Thread Ron Serruya (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48091?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ron Serruya updated SPARK-48091:

Labels: alias  (was: PySpark alias)

> Using `explode` together with `transform` in the same select statement causes 
> aliases in the transformed column to be ignored
> -
>
> Key: SPARK-48091
> URL: https://issues.apache.org/jira/browse/SPARK-48091
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.4.0, 3.5.0, 3.5.1
> Environment: Python 3.10, 3.12, OSX 14.4 and Databricks DBR 13.3, 
> 14.3, Pyspark 3.4.0, 3.5.0, 3.5.1
>Reporter: Ron Serruya
>Priority: Minor
>  Labels: alias
>
> When using an `explode` function, and `transform` function in the same select 
> statement, aliases used inside the transformed column are ignored.
> This behaviour only happens using the pyspark API, and not when using the SQL 
> API
>  
> {code:java}
> from pyspark.sql import functions as F
> # Create the df
> df = spark.createDataFrame([
> {"id": 1, "array1": ['a', 'b'], 'array2': [2,3,4]}
> ]){code}
> Good case, where all aliases are used
>  
> {code:java}
> df.select(
> F.transform(
> 'array2',
> lambda x: F.struct(x.alias("some_alias"), 
> F.col("id").alias("second_alias"))
> ).alias("new_array2")
> ).printSchema() 
> root
>  |-- new_array2: array (nullable = true)
>  ||-- element: struct (containsNull = false)
>  |||-- some_alias: long (nullable = true)
>  |||-- second_alias: long (nullable = true){code}
> Bad case, when using explode, the alises inside the transformed column is 
> ignored, and  `id` is kept instead of `second_alias`, and `x_17` is used 
> instead of `some_alias`
>  
>  
> {code:java}
> df.select(
> F.explode("array1").alias("exploded"),
> F.transform(
> 'array2',
> lambda x: F.struct(x.alias("some_alias"), 
> F.col("id").alias("second_alias"))
> ).alias("new_array2")
> ).printSchema()
> root
>  |-- exploded: string (nullable = true)
>  |-- new_array2: array (nullable = true)
>  ||-- element: struct (containsNull = false)
>  |||-- x_17: long (nullable = true)
>  |||-- id: long (nullable = true) {code}
>  
>  
>  
> When using the SQL API instead, it works fine
> {code:java}
> spark.sql(
> """
> select explode(array1) as exploded, transform(array2, x-> struct(x as 
> some_alias, id as second_alias)) as array2 from {df}
> """, df=df
> ).printSchema()
> root
>  |-- exploded: string (nullable = true)
>  |-- array2: array (nullable = true)
>  ||-- element: struct (containsNull = false)
>  |||-- some_alias: long (nullable = true)
>  |||-- second_alias: long (nullable = true) {code}
>  
> Workaround: for now, using F.named_struct can be used as a workaround



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48091) Using `explode` together with `transform` in the same select statement causes aliases in the transformed column to be ignored

2024-05-02 Thread Ron Serruya (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48091?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ron Serruya updated SPARK-48091:

Description: 
When using an `explode` function, and `transform` function in the same select 
statement, aliases used inside the transformed column are ignored.

This behaviour only happens using the pyspark API, and not when using the SQL 
API

 
{code:java}
from pyspark.sql import functions as F

# Create the df
df = spark.createDataFrame([
{"id": 1, "array1": ['a', 'b'], 'array2': [2,3,4]}
]){code}
Good case, where all aliases are used

 
{code:java}
df.select(
F.transform(
'array2',
lambda x: F.struct(x.alias("some_alias"), 
F.col("id").alias("second_alias"))
).alias("new_array2")
).printSchema() 

root
 |-- new_array2: array (nullable = true)
 ||-- element: struct (containsNull = false)
 |||-- some_alias: long (nullable = true)
 |||-- second_alias: long (nullable = true){code}
Bad case, when using explode, the alises inside the transformed column is 
ignored, and  `id` is kept instead of `second_alias`, and `x_17` is used 
instead of `some_alias`

 

 
{code:java}
df.select(
F.explode("array1").alias("exploded"),
F.transform(
'array2',
lambda x: F.struct(x.alias("some_alias"), 
F.col("id").alias("second_alias"))
).alias("new_array2")
).printSchema()

root
 |-- exploded: string (nullable = true)
 |-- new_array2: array (nullable = true)
 ||-- element: struct (containsNull = false)
 |||-- x_17: long (nullable = true)
 |||-- id: long (nullable = true) {code}
 

 

 

When using the SQL API instead, it works fine
{code:java}
spark.sql(
"""
select explode(array1) as exploded, transform(array2, x-> struct(x as 
some_alias, id as second_alias)) as array2 from {df}
""", df=df
).printSchema()

root
 |-- exploded: string (nullable = true)
 |-- array2: array (nullable = true)
 ||-- element: struct (containsNull = false)
 |||-- some_alias: long (nullable = true)
 |||-- second_alias: long (nullable = true) {code}
 

Workaround: for now, using F.named_struct can be used as a workaround

  was:
When using an `explode` function, and `transform` function in the same select 
statement, aliases used inside the transformed column are ignored.

This behaviour only happens using the pyspark API, and not when using the SQL 
API

 
{code:java}
from pyspark.sql import functions as F

# Create the df
df = spark.createDataFrame([
{"id": 1, "array1": ['a', 'b'], 'array2': [2,3,4]}
]){code}
Good case, where all aliases are used

 
{code:java}
df.select(
F.transform(
'array2',
lambda x: F.struct(x.alias("some_alias"), 
F.col("id").alias("second_alias"))
).alias("new_array2")
).printSchema() 

root
 |-- new_array2: array (nullable = true)
 ||-- element: struct (containsNull = false)
 |||-- some_alias: long (nullable = true)
 |||-- second_alias: long (nullable = true){code}
Bad case, when using explode, the alises inside the transformed column is 
ignored, and  `id` is kept instead of `second_alias`, and `x_17` is used 
instead of `some_alias`

 

 
{code:java}
df.select(
F.explode("array1").alias("exploded"),
F.transform(
'array2',
lambda x: F.struct(x.alias("some_alias"), 
F.col("id").alias("second_alias"))
).alias("new_array2")
).printSchema()

root
 |-- exploded: string (nullable = true)
 |-- new_array2: array (nullable = true)
 ||-- element: struct (containsNull = false)
 |||-- x_17: long (nullable = true)
 |||-- id: long (nullable = true) {code}
 

 

 

When using the SQL API instead, it works fine
{code:java}
spark.sql(
"""
select explode(array1) as exploded, transform(array2, x-> struct(x as 
some_alias, id as second_alias)) as array2 from {df}
""", df=df
).printSchema()

root
 |-- exploded: string (nullable = true)
 |-- array2: array (nullable = true)
 ||-- element: struct (containsNull = false)
 |||-- some_alias: long (nullable = true)
 |||-- second_alias: long (nullable = true) {code}
 


> Using `explode` together with `transform` in the same select statement causes 
> aliases in the transformed column to be ignored
> -
>
> Key: SPARK-48091
> URL: https://issues.apache.org/jira/browse/SPARK-48091
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.4.0, 3.5.0, 3.5.1
> Environment: Python 3.10, 3.12, OSX 14.4 and Databricks DBR 13.3, 
> 14.3, Pyspark 3.4.0, 3.5.0, 3.5.1
>Reporter: Ron Serruya
>Priority: Minor
>  Labels: PySpark, alias
>
> When using an `explode` function, and `transform` function in the same select 
>