[jira] [Created] (SPARK-48690) SPJ: Support auto-shuffle one side + less join keys than partition keys

2024-06-21 Thread Szehon Ho (Jira)
Szehon Ho created SPARK-48690:
-

 Summary: SPJ: Support auto-shuffle one side + less join keys than 
partition keys
 Key: SPARK-48690
 URL: https://issues.apache.org/jira/browse/SPARK-48690
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 4.0.0
Reporter: Szehon Ho






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48689) Reading lengthy JSON results in a corrupted record.

2024-06-21 Thread Yuxiang Wei (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48689?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuxiang Wei updated SPARK-48689:

Description: 
When reading a data frame from a JSON file including a very long string, spark 
will incorrectly make it a corrupted record even though the format is correct. 
Here is a minimal example with PySpark:


{{import json}}
{{import tempfile}}
{{{}from pyspark.sql import SparkSession{}}}{{{}# Create a Spark session{}}}
{{spark = SparkSession.builder \}}
{{    .appName("PySpark JSON Example") \}}
{{{}    .getOrCreate(){}}}{{{}# Define the JSON content{}}}
{{data = {}}
{{    "text": "a" * 1}}
{{{}}{}}}{{{}# Create a temporary file{}}}
{{with tempfile.NamedTemporaryFile(delete=False, suffix=".json", mode="w") as 
tmp_file:}}
{{    # Write the JSON content to the temporary file}}
{{    tmp_file.write(json.dumps(data) + "\n")}}
{{    tmp_file_path = tmp_file.name}}{{    # Load the JSON file into a PySpark 
DataFrame}}
{{    df = spark.read.json(tmp_file_path)}}{{    # Print the schema}}
{{    print(df)}}

 

 

  was:
When reading a data frame from a JSON file including a very long string, spark 
will incorrectly make it a corrupted record even though the format is correct. 
Here is a minimal example with PySpark:


{{import json}}
{{import tempfile}}
{{{}from pyspark.sql import SparkSession{}}}{{{}# Create a Spark session{}}}
{{spark = SparkSession.builder \}}
{{    .appName("PySpark JSON Example") \}}
{{{}    .getOrCreate(){}}}{{{}# Define the JSON content{}}}
{{data = {}}
{{    "text": "a" * 1}}
{{{}}{}}}{{{}# Create a temporary file{}}}
{{with tempfile.NamedTemporaryFile(delete=False, suffix=".json", mode="w") as 
tmp_file:}}
{{    # Write the JSON content to the temporary file}}
{{    tmp_file.write(json.dumps(data) + "\n")}}
{{    tmp_file_path = tmp_file.name}}{{    # Load the JSON file into a PySpark 
DataFrame}}
{{    df = spark.read.json(tmp_file_path)}}{{    # Print the schema}}
{{    print(df)}}

 


> Reading lengthy JSON results in a corrupted record.
> ---
>
> Key: SPARK-48689
> URL: https://issues.apache.org/jira/browse/SPARK-48689
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.5.1
> Environment: Ubuntu 22.04, Python 3.11, and OpenJDK 22
>Reporter: Yuxiang Wei
>Priority: Major
>  Labels: Reader
>
> When reading a data frame from a JSON file including a very long string, 
> spark will incorrectly make it a corrupted record even though the format is 
> correct. Here is a minimal example with PySpark:
> {{import json}}
> {{import tempfile}}
> {{{}from pyspark.sql import SparkSession{}}}{{{}# Create a Spark session{}}}
> {{spark = SparkSession.builder \}}
> {{    .appName("PySpark JSON Example") \}}
> {{{}    .getOrCreate(){}}}{{{}# Define the JSON content{}}}
> {{data = {}}
> {{    "text": "a" * 1}}
> {{{}}{}}}{{{}# Create a temporary file{}}}
> {{with tempfile.NamedTemporaryFile(delete=False, suffix=".json", mode="w") as 
> tmp_file:}}
> {{    # Write the JSON content to the temporary file}}
> {{    tmp_file.write(json.dumps(data) + "\n")}}
> {{    tmp_file_path = tmp_file.name}}{{    # Load the JSON file into a 
> PySpark DataFrame}}
> {{    df = spark.read.json(tmp_file_path)}}{{    # Print the schema}}
> {{    print(df)}}
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48689) Reading lengthy JSON results in a corrupted record.

2024-06-21 Thread Yuxiang Wei (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48689?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuxiang Wei updated SPARK-48689:

Description: 
When reading a data frame from a JSON file including a very long string, spark 
will incorrectly make it a corrupted record even though the format is correct. 
Here is a minimal example with PySpark:


{{import json}}
{{import tempfile}}
{{{}from pyspark.sql import SparkSession{}}}{{{}# Create a Spark session{}}}
{{spark = (SparkSession.builde}}
{{    .appName("PySpark JSON Example")}}
{{    .getOrCreate()}}
{{{}){}}}{{{}# Define the JSON content{}}}
{{data = {}}
{{    "text": "a" * 1}}
{{{}}{}}}{{{}# Create a temporary file{}}}
{{with tempfile.NamedTemporaryFile(delete=False, suffix=".json", mode="w") as 
tmp_file:}}
{{    # Write the JSON content to the temporary file}}
{{    tmp_file.write(json.dumps(data) + "\n")}}
{{    tmp_file_path = tmp_file.name}}{{    # Load the JSON file into a PySpark 
DataFrame}}
{{    df = spark.read.json(tmp_file_path)}}{{    # Print the schema}}
{{    print(df)}}

  was:
When reading a data frame from a JSON file including a very long string, spark 
will incorrectly make it a corrupted record even though the format is correct. 
Here is a minimal example with PySpark:


{{import json}}
{{import tempfile}}
{{{}from pyspark.sql import SparkSession{}}}{{{}# Create a Spark session{}}}
{{spark = SparkSession.builder \}}
{{    .appName("PySpark JSON Example") \}}
{{{}    .getOrCreate(){}}}{{{}# Define the JSON content{}}}
{{data = {}}
{{    "text": "a" * 1}}
{{{}}{}}}{{{}# Create a temporary file{}}}
{{with tempfile.NamedTemporaryFile(delete=False, suffix=".json", mode="w") as 
tmp_file:}}
{{    # Write the JSON content to the temporary file}}
{{    tmp_file.write(json.dumps(data) + "\n")}}
{{    tmp_file_path = tmp_file.name}}{{    # Load the JSON file into a PySpark 
DataFrame}}
{{    df = spark.read.json(tmp_file_path)}}{{    # Print the schema}}
{{    print(df)}}

 

 


> Reading lengthy JSON results in a corrupted record.
> ---
>
> Key: SPARK-48689
> URL: https://issues.apache.org/jira/browse/SPARK-48689
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.5.1
> Environment: Ubuntu 22.04, Python 3.11, and OpenJDK 22
>Reporter: Yuxiang Wei
>Priority: Major
>  Labels: Reader
>
> When reading a data frame from a JSON file including a very long string, 
> spark will incorrectly make it a corrupted record even though the format is 
> correct. Here is a minimal example with PySpark:
> {{import json}}
> {{import tempfile}}
> {{{}from pyspark.sql import SparkSession{}}}{{{}# Create a Spark session{}}}
> {{spark = (SparkSession.builde}}
> {{    .appName("PySpark JSON Example")}}
> {{    .getOrCreate()}}
> {{{}){}}}{{{}# Define the JSON content{}}}
> {{data = {}}
> {{    "text": "a" * 1}}
> {{{}}{}}}{{{}# Create a temporary file{}}}
> {{with tempfile.NamedTemporaryFile(delete=False, suffix=".json", mode="w") as 
> tmp_file:}}
> {{    # Write the JSON content to the temporary file}}
> {{    tmp_file.write(json.dumps(data) + "\n")}}
> {{    tmp_file_path = tmp_file.name}}{{    # Load the JSON file into a 
> PySpark DataFrame}}
> {{    df = spark.read.json(tmp_file_path)}}{{    # Print the schema}}
> {{    print(df)}}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-48689) Reading lengthy JSON results in a corrupted record.

2024-06-21 Thread Yuxiang Wei (Jira)
Yuxiang Wei created SPARK-48689:
---

 Summary: Reading lengthy JSON results in a corrupted record.
 Key: SPARK-48689
 URL: https://issues.apache.org/jira/browse/SPARK-48689
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 3.5.1
 Environment: Ubuntu 22.04, Python 3.11, and OpenJDK 22
Reporter: Yuxiang Wei


When reading a data frame from a JSON file including a very long string, spark 
will incorrectly make it a corrupted record even though the format is correct. 
Here is a minimal example with PySpark:


{{import json}}
{{import tempfile}}
{{{}from pyspark.sql import SparkSession{}}}{{{}# Create a Spark session{}}}
{{spark = SparkSession.builder \}}
{{    .appName("PySpark JSON Example") \}}
{{{}    .getOrCreate(){}}}{{{}# Define the JSON content{}}}
{{data = {}}
{{    "text": "a" * 1}}
{{{}}{}}}{{{}# Create a temporary file{}}}
{{with tempfile.NamedTemporaryFile(delete=False, suffix=".json", mode="w") as 
tmp_file:}}
{{    # Write the JSON content to the temporary file}}
{{    tmp_file.write(json.dumps(data) + "\n")}}
{{    tmp_file_path = tmp_file.name}}{{    # Load the JSON file into a PySpark 
DataFrame}}
{{    df = spark.read.json(tmp_file_path)}}{{    # Print the schema}}
{{    print(df)}}

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-48688) Return reasonable error when calling SQL to_avro and from_avro functions but Avro is not loaded by default

2024-06-21 Thread Daniel (Jira)
Daniel created SPARK-48688:
--

 Summary: Return reasonable error when calling SQL to_avro and 
from_avro functions but Avro is not loaded by default
 Key: SPARK-48688
 URL: https://issues.apache.org/jira/browse/SPARK-48688
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 4.0.0
Reporter: Daniel






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-48687) Add changes to implement state schema validation in planning phase on driver for stateful streaming queries

2024-06-21 Thread Anish Shrigondekar (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-48687?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17856883#comment-17856883
 ] 

Anish Shrigondekar commented on SPARK-48687:


PR here - [https://github.com/apache/spark/pull/47035]

 

[~kabhwan] - PTAL thx !

> Add changes to implement state schema validation in planning phase on driver 
> for stateful streaming queries
> ---
>
> Key: SPARK-48687
> URL: https://issues.apache.org/jira/browse/SPARK-48687
> Project: Spark
>  Issue Type: Task
>  Components: Structured Streaming
>Affects Versions: 4.0.0
>Reporter: Anish Shrigondekar
>Priority: Major
>
> Add changes to implement state schema validation in planning phase on driver 
> for stateful streaming queries



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-48687) Add changes to implement state schema validation in planning phase on driver for stateful streaming queries

2024-06-21 Thread Anish Shrigondekar (Jira)
Anish Shrigondekar created SPARK-48687:
--

 Summary: Add changes to implement state schema validation in 
planning phase on driver for stateful streaming queries
 Key: SPARK-48687
 URL: https://issues.apache.org/jira/browse/SPARK-48687
 Project: Spark
  Issue Type: Task
  Components: Structured Streaming
Affects Versions: 4.0.0
Reporter: Anish Shrigondekar


Add changes to implement state schema validation in planning phase on driver 
for stateful streaming queries



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-48686) Improve performance of ParserUtils.unescapeSQLString

2024-06-21 Thread Josh Rosen (Jira)
Josh Rosen created SPARK-48686:
--

 Summary: Improve performance of ParserUtils.unescapeSQLString
 Key: SPARK-48686
 URL: https://issues.apache.org/jira/browse/SPARK-48686
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 4.0.0
Reporter: Josh Rosen
Assignee: Josh Rosen


The `ParserUtils.unescapeSQLString` method is currently implemented using 
regexes for part of the parsing, but this slows down the common case where 
escaping is not needed and may be prone to O(n^2) behavior for certain extreme 
inputs.

I think that we should optimize this to remove the use of regexes.

I will submit a PR for this.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-48685) PySpark MinHashLSH when used with CountVectorizer doesn't meet requirements

2024-06-21 Thread Etienne Soulard-Geoffrion (Jira)
Etienne Soulard-Geoffrion created SPARK-48685:
-

 Summary: PySpark MinHashLSH when used with CountVectorizer doesn't 
meet requirements
 Key: SPARK-48685
 URL: https://issues.apache.org/jira/browse/SPARK-48685
 Project: Spark
  Issue Type: Bug
  Components: ML
Affects Versions: 3.5.1
Reporter: Etienne Soulard-Geoffrion
 Fix For: 3.5.1


I'm facing an issue when trying to use the MinHashLSH model, where the model is 
complaining about having only zero values in some rows although I do apply a 
filter before using the model. Here is a sample code to demonstrate using 
pyspark:


```python
@F.udf(returnType=types.BooleanType())
def is_non_zero_vector(vector: SparseVector) -> bool:
"""
Returns True if the vector has at least one non zero element
"""
return vector.numNonzeros() > 0
 
df_text = df.select("id", "text")

tokenizer=Tokenizer(inputCol="text", outputCol="text_tokens")
df_text=tokenizer.transform(df_text).select("id", "text_tokens")

ngram=NGram(inputCol="text_tokens", outputCol="text_ngrams", 
n=self.min_hash_lsh_ngram_size)
df_text=ngram.transform(df_text).select("id", "text_ngrams")

count_vectorizer=CountVectorizer(inputCol="text_ngrams", 
outputCol="text_count_vector").fit(df_text)
df_text=count_vectorizer.transform(df_text).select("id", "text_count_vector")

# Keep only the non zero vectors
df_text=df_text.filter(is_non_zero_vector(F.col("text_count_vector")))

min_hash_lsh=MinHashLSH(
inputCol="text_count_vector",
outputCol="text_hashes",
seed=self.min_hash_lsh_num_hash_tables,
numHashTables=self.min_hash_lsh_num_hash_tables,
).fit(df_text)
df_text=min_hash_lsh.transform(df_text)

# Calculate the distance between all pairs of vectors and keep only the pairs 
with a distance > 0 (that are duplicates)
pairs_df=min_hash_lsh.approxSimilarityJoin(df_text, df_text, 0.6, 
distCol="vector_distance")
pairs_df=pairs_df.filter("vector_distance != 0")

```

I've also analyzed the dataframe and there is in fact no rows without at least 
1 non-zero index.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-48545) Create to_avro and from_avro SQL functions to match PySpark equivalent

2024-06-21 Thread Gengliang Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48545?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gengliang Wang resolved SPARK-48545.

Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 46977
[https://github.com/apache/spark/pull/46977]

> Create to_avro and from_avro SQL functions to match PySpark equivalent
> --
>
> Key: SPARK-48545
> URL: https://issues.apache.org/jira/browse/SPARK-48545
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Daniel
>Assignee: Daniel
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> The PySpark API is here: 
> https://github.com/apache/spark/blob/d5c33c6bfb5757b243fc8e1734daeaa4fe3b9b32/python/pyspark/sql/avro/functions.py#L35



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-48545) Create to_avro and from_avro SQL functions to match PySpark equivalent

2024-06-21 Thread Gengliang Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48545?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gengliang Wang reassigned SPARK-48545:
--

Assignee: Daniel

> Create to_avro and from_avro SQL functions to match PySpark equivalent
> --
>
> Key: SPARK-48545
> URL: https://issues.apache.org/jira/browse/SPARK-48545
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Daniel
>Assignee: Daniel
>Priority: Major
>  Labels: pull-request-available
>
> The PySpark API is here: 
> https://github.com/apache/spark/blob/d5c33c6bfb5757b243fc8e1734daeaa4fe3b9b32/python/pyspark/sql/avro/functions.py#L35



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-48655) SPJ: Add tests for shuffle skipping for aggregate queries

2024-06-21 Thread Chao Sun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48655?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun reassigned SPARK-48655:


Assignee: Szehon Ho

> SPJ: Add tests for shuffle skipping for aggregate queries
> -
>
> Key: SPARK-48655
> URL: https://issues.apache.org/jira/browse/SPARK-48655
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Szehon Ho
>Assignee: Szehon Ho
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-48463) MLLib function unable to handle nested data

2024-06-21 Thread Chhavi Bansal (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-48463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17856832#comment-17856832
 ] 

Chhavi Bansal commented on SPARK-48463:
---

Thank you for the update.

> MLLib function unable to handle nested data
> ---
>
> Key: SPARK-48463
> URL: https://issues.apache.org/jira/browse/SPARK-48463
> Project: Spark
>  Issue Type: Bug
>  Components: ML, MLlib
>Affects Versions: 3.5.1
>Reporter: Chhavi Bansal
>Assignee: Weichen Xu
>Priority: Major
>  Labels: ML, MLPipelines, mllib, nested
>
> I am trying to use feature transformer on nested data after flattening, but 
> it fails.
>  
> {code:java}
> val structureData = Seq(
>   Row(Row(10, 12), 1000),
>   Row(Row(12, 14), 4300),
>   Row( Row(37, 891), 1400),
>   Row(Row(8902, 12), 4000),
>   Row(Row(12, 89), 1000)
> )
> val structureSchema = new StructType()
>   .add("location", new StructType()
> .add("longitude", IntegerType)
> .add("latitude", IntegerType))
>   .add("salary", IntegerType) 
> val df = spark.createDataFrame(spark.sparkContext.parallelize(structureData), 
> structureSchema) 
> def flattenSchema(schema: StructType, prefix: String = null, prefixSelect: 
> String = null):
> Array[Column] = {
>   schema.fields.flatMap(f => {
> val colName = if (prefix == null) f.name else (prefix + "." + f.name)
> val colnameSelect = if (prefix == null) f.name else (prefixSelect + "." + 
> f.name)
> f.dataType match {
>   case st: StructType => flattenSchema(st, colName, colnameSelect)
>   case _ =>
> Array(col(colName).as(colnameSelect))
> }
>   })
> }
> val flattenColumns = flattenSchema(df.schema)
> val flattenedDf = df.select(flattenColumns: _*){code}
> Now using the string indexer on the DOT notation.
>  
> {code:java}
> val si = new 
> StringIndexer().setInputCol("location.longitude").setOutputCol("longitutdee")
> val pipeline = new Pipeline().setStages(Array(si))
> pipeline.fit(flattenedDf).transform(flattenedDf).show() {code}
> The above code fails 
> {code:java}
> xception in thread "main" org.apache.spark.sql.AnalysisException: Cannot 
> resolve column name "location.longitude" among (location.longitude, 
> location.latitude, salary); did you mean to quote the `location.longitude` 
> column?
>     at 
> org.apache.spark.sql.errors.QueryCompilationErrors$.cannotResolveColumnNameAmongFieldsError(QueryCompilationErrors.scala:2261)
>     at 
> org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$resolveException(Dataset.scala:258)
>     at org.apache.spark.sql.Dataset.$anonfun$resolve$1(Dataset.scala:250)
> . {code}
> This points to the same failure as when we try to select dot notation columns 
> in a spark dataframe, which is solved using BACKTICKS *`column.name`.* 
> [https://stackoverflow.com/a/51430335/11688337]
>  
> *so next*
> I use the back ticks while defining stringIndexer
> {code:java}
> val si = new 
> StringIndexer().setInputCol("`location.longitude`").setOutputCol("longitutdee")
>  {code}
> In this case *it again fails* (with a diff reason) in the stringIndexer code 
> itself
> {code:java}
> Exception in thread "main" org.apache.spark.SparkException: Input column 
> `location.longitude` does not exist.
>     at 
> org.apache.spark.ml.feature.StringIndexerBase.$anonfun$validateAndTransformSchema$2(StringIndexer.scala:128)
>     at 
> scala.collection.TraversableLike.$anonfun$flatMap$1(TraversableLike.scala:244)
>     at 
> scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36)
>     at 
> scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33) 
> {code}
>  
> This blocks me to use feature transformation functions on nested columns. 
> Any help in solving this problem will be highly appreciated.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-48675) Cache table doesn't work with collated column

2024-06-21 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48675?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-48675:
---

Assignee: Nikola Mandic

> Cache table doesn't work with collated column
> -
>
> Key: SPARK-48675
> URL: https://issues.apache.org/jira/browse/SPARK-48675
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Nikola Mandic
>Assignee: Nikola Mandic
>Priority: Major
>
> Following sequence of queries produces the error:
> {code:java}
> >  cache lazy table t as select col from values ('a' collate utf8_lcase) as 
> > (col);
> > select col from t;
> org.apache.spark.SparkException: not support type: 
> org.apache.spark.sql.types.StringType@1.
>         at 
> org.apache.spark.sql.errors.QueryExecutionErrors$.notSupportTypeError(QueryExecutionErrors.scala:1069)
>         at 
> org.apache.spark.sql.execution.columnar.ColumnBuilder$.apply(ColumnBuilder.scala:200)
>         at 
> org.apache.spark.sql.execution.columnar.DefaultCachedBatchSerializer$$anon$1.$anonfun$next$1(InMemoryRelation.scala:85)
>         at scala.collection.immutable.List.map(List.scala:247)
>         at scala.collection.immutable.List.map(List.scala:79)
>         at 
> org.apache.spark.sql.execution.columnar.DefaultCachedBatchSerializer$$anon$1.next(InMemoryRelation.scala:84)
>         at 
> org.apache.spark.sql.execution.columnar.DefaultCachedBatchSerializer$$anon$1.next(InMemoryRelation.scala:82)
>         at 
> org.apache.spark.sql.execution.columnar.CachedRDDBuilder$$anon$2.next(InMemoryRelation.scala:296)
>         at 
> org.apache.spark.sql.execution.columnar.CachedRDDBuilder$$anon$2.next(InMemoryRelation.scala:293)
> ... {code}
> This is also the problem on non-lazy cached tables.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-48675) Cache table doesn't work with collated column

2024-06-21 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48675?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-48675.
-
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 47045
[https://github.com/apache/spark/pull/47045]

> Cache table doesn't work with collated column
> -
>
> Key: SPARK-48675
> URL: https://issues.apache.org/jira/browse/SPARK-48675
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Nikola Mandic
>Assignee: Nikola Mandic
>Priority: Major
> Fix For: 4.0.0
>
>
> Following sequence of queries produces the error:
> {code:java}
> >  cache lazy table t as select col from values ('a' collate utf8_lcase) as 
> > (col);
> > select col from t;
> org.apache.spark.SparkException: not support type: 
> org.apache.spark.sql.types.StringType@1.
>         at 
> org.apache.spark.sql.errors.QueryExecutionErrors$.notSupportTypeError(QueryExecutionErrors.scala:1069)
>         at 
> org.apache.spark.sql.execution.columnar.ColumnBuilder$.apply(ColumnBuilder.scala:200)
>         at 
> org.apache.spark.sql.execution.columnar.DefaultCachedBatchSerializer$$anon$1.$anonfun$next$1(InMemoryRelation.scala:85)
>         at scala.collection.immutable.List.map(List.scala:247)
>         at scala.collection.immutable.List.map(List.scala:79)
>         at 
> org.apache.spark.sql.execution.columnar.DefaultCachedBatchSerializer$$anon$1.next(InMemoryRelation.scala:84)
>         at 
> org.apache.spark.sql.execution.columnar.DefaultCachedBatchSerializer$$anon$1.next(InMemoryRelation.scala:82)
>         at 
> org.apache.spark.sql.execution.columnar.CachedRDDBuilder$$anon$2.next(InMemoryRelation.scala:296)
>         at 
> org.apache.spark.sql.execution.columnar.CachedRDDBuilder$$anon$2.next(InMemoryRelation.scala:293)
> ... {code}
> This is also the problem on non-lazy cached tables.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48680) Add char/varchar doc to language specific tables

2024-06-21 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48680?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-48680:
---
Labels: pull-request-available  (was: )

> Add char/varchar doc to language specific tables
> 
>
> Key: SPARK-48680
> URL: https://issues.apache.org/jira/browse/SPARK-48680
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 4.0.0
>Reporter: Kent Yao
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48664) Add official image Dockerfile for Apache Spark 4.0.0-preview1

2024-06-21 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48664?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-48664:
---
Labels: pull-request-available  (was: )

> Add official image Dockerfile for Apache Spark 4.0.0-preview1
> -
>
> Key: SPARK-48664
> URL: https://issues.apache.org/jira/browse/SPARK-48664
> Project: Spark
>  Issue Type: Task
>  Components: Spark Docker
>Affects Versions: 4.0.0
>Reporter: Kent Yao
>Assignee: Wenchen Fan
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48656) ArrayIndexOutOfBoundsException in CartesianRDD getPartitions

2024-06-21 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-48656:
---
Labels: pull-request-available  (was: )

> ArrayIndexOutOfBoundsException in CartesianRDD getPartitions
> 
>
> Key: SPARK-48656
> URL: https://issues.apache.org/jira/browse/SPARK-48656
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Nick Young
>Assignee: Wei Guo
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> ```val rdd1 = spark.sparkContext.parallelize(Seq(1, 2, 3), numSlices = 65536)
> val rdd2 = spark.sparkContext.parallelize(Seq(1, 2, 3), numSlices = 
> 65536)rdd2.cartesian(rdd1).partitions```
> Throws `ArrayIndexOutOfBoundsException: 0` at CartesianRDD.scala:69 because 
> `s1.index * numPartitionsInRdd2 + s2.index` overflows and wraps to 0. We 
> should provide a better error message which indicates the number of partition 
> overflows so it's easier for the user to debug.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48655) SPJ: Add tests for shuffle skipping for aggregate queries

2024-06-21 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48655?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-48655:
---
Labels: pull-request-available  (was: )

> SPJ: Add tests for shuffle skipping for aggregate queries
> -
>
> Key: SPARK-48655
> URL: https://issues.apache.org/jira/browse/SPARK-48655
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Szehon Ho
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-48684) Print related JIRA summary before proceeding merge

2024-06-21 Thread Kent Yao (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48684?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kent Yao resolved SPARK-48684.
--
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 47057
[https://github.com/apache/spark/pull/47057]

> Print related JIRA summary before proceeding merge
> --
>
> Key: SPARK-48684
> URL: https://issues.apache.org/jira/browse/SPARK-48684
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra
>Affects Versions: 4.0.0
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Minor
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48684) Print related JIRA summary before proceeding merge

2024-06-21 Thread Kent Yao (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48684?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kent Yao updated SPARK-48684:
-
Component/s: Project Infra
 (was: SQL)

> Print related JIRA summary before proceeding merge
> --
>
> Key: SPARK-48684
> URL: https://issues.apache.org/jira/browse/SPARK-48684
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra
>Affects Versions: 4.0.0
>Reporter: Kent Yao
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48684) Print related JIRA summary before proceeding merge

2024-06-21 Thread Kent Yao (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48684?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kent Yao updated SPARK-48684:
-
Priority: Minor  (was: Major)

> Print related JIRA summary before proceeding merge
> --
>
> Key: SPARK-48684
> URL: https://issues.apache.org/jira/browse/SPARK-48684
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Kent Yao
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-48684) Print related JIRA summary before proceeding merge

2024-06-21 Thread Kent Yao (Jira)
Kent Yao created SPARK-48684:


 Summary: Print related JIRA summary before proceeding merge
 Key: SPARK-48684
 URL: https://issues.apache.org/jira/browse/SPARK-48684
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 4.0.0
Reporter: Kent Yao






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-48662) Fix StructsToXml expression with collations

2024-06-21 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48662?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-48662:


Assignee: Mihailo Milosevic

> Fix StructsToXml expression with collations
> ---
>
> Key: SPARK-48662
> URL: https://issues.apache.org/jira/browse/SPARK-48662
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Mihailo Milosevic
>Assignee: Mihailo Milosevic
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-48662) Fix StructsToXml expression with collations

2024-06-21 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48662?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-48662.
--
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 47053
[https://github.com/apache/spark/pull/47053]

> Fix StructsToXml expression with collations
> ---
>
> Key: SPARK-48662
> URL: https://issues.apache.org/jira/browse/SPARK-48662
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Mihailo Milosevic
>Assignee: Mihailo Milosevic
>Priority: Major
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-48463) MLLib function unable to handle nested data

2024-06-21 Thread Weichen Xu (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-48463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17856713#comment-17856713
 ] 

Weichen Xu commented on SPARK-48463:


I will try to do it this sprint. (and then cherrypick it to databricks runtime)

> MLLib function unable to handle nested data
> ---
>
> Key: SPARK-48463
> URL: https://issues.apache.org/jira/browse/SPARK-48463
> Project: Spark
>  Issue Type: Bug
>  Components: ML, MLlib
>Affects Versions: 3.5.1
>Reporter: Chhavi Bansal
>Assignee: Weichen Xu
>Priority: Major
>  Labels: ML, MLPipelines, mllib, nested
>
> I am trying to use feature transformer on nested data after flattening, but 
> it fails.
>  
> {code:java}
> val structureData = Seq(
>   Row(Row(10, 12), 1000),
>   Row(Row(12, 14), 4300),
>   Row( Row(37, 891), 1400),
>   Row(Row(8902, 12), 4000),
>   Row(Row(12, 89), 1000)
> )
> val structureSchema = new StructType()
>   .add("location", new StructType()
> .add("longitude", IntegerType)
> .add("latitude", IntegerType))
>   .add("salary", IntegerType) 
> val df = spark.createDataFrame(spark.sparkContext.parallelize(structureData), 
> structureSchema) 
> def flattenSchema(schema: StructType, prefix: String = null, prefixSelect: 
> String = null):
> Array[Column] = {
>   schema.fields.flatMap(f => {
> val colName = if (prefix == null) f.name else (prefix + "." + f.name)
> val colnameSelect = if (prefix == null) f.name else (prefixSelect + "." + 
> f.name)
> f.dataType match {
>   case st: StructType => flattenSchema(st, colName, colnameSelect)
>   case _ =>
> Array(col(colName).as(colnameSelect))
> }
>   })
> }
> val flattenColumns = flattenSchema(df.schema)
> val flattenedDf = df.select(flattenColumns: _*){code}
> Now using the string indexer on the DOT notation.
>  
> {code:java}
> val si = new 
> StringIndexer().setInputCol("location.longitude").setOutputCol("longitutdee")
> val pipeline = new Pipeline().setStages(Array(si))
> pipeline.fit(flattenedDf).transform(flattenedDf).show() {code}
> The above code fails 
> {code:java}
> xception in thread "main" org.apache.spark.sql.AnalysisException: Cannot 
> resolve column name "location.longitude" among (location.longitude, 
> location.latitude, salary); did you mean to quote the `location.longitude` 
> column?
>     at 
> org.apache.spark.sql.errors.QueryCompilationErrors$.cannotResolveColumnNameAmongFieldsError(QueryCompilationErrors.scala:2261)
>     at 
> org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$resolveException(Dataset.scala:258)
>     at org.apache.spark.sql.Dataset.$anonfun$resolve$1(Dataset.scala:250)
> . {code}
> This points to the same failure as when we try to select dot notation columns 
> in a spark dataframe, which is solved using BACKTICKS *`column.name`.* 
> [https://stackoverflow.com/a/51430335/11688337]
>  
> *so next*
> I use the back ticks while defining stringIndexer
> {code:java}
> val si = new 
> StringIndexer().setInputCol("`location.longitude`").setOutputCol("longitutdee")
>  {code}
> In this case *it again fails* (with a diff reason) in the stringIndexer code 
> itself
> {code:java}
> Exception in thread "main" org.apache.spark.SparkException: Input column 
> `location.longitude` does not exist.
>     at 
> org.apache.spark.ml.feature.StringIndexerBase.$anonfun$validateAndTransformSchema$2(StringIndexer.scala:128)
>     at 
> scala.collection.TraversableLike.$anonfun$flatMap$1(TraversableLike.scala:244)
>     at 
> scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36)
>     at 
> scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33) 
> {code}
>  
> This blocks me to use feature transformation functions on nested columns. 
> Any help in solving this problem will be highly appreciated.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-47258) Assign error classes to SHOW CREATE TABLE errors

2024-06-21 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot reassigned SPARK-47258:
--

Assignee: Apache Spark

> Assign error classes to SHOW CREATE TABLE errors
> 
>
> Key: SPARK-47258
> URL: https://issues.apache.org/jira/browse/SPARK-47258
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Max Gekk
>Assignee: Apache Spark
>Priority: Minor
>  Labels: pull-request-available, starter
>
> Choose a proper name for the error class *_LEGACY_ERROR_TEMP_127[0-5]* 
> defined in {*}core/src/main/resources/error/error-classes.json{*}. The name 
> should be short but complete (look at the example in error-classes.json).
> Add a test which triggers the error from user code if such test still doesn't 
> exist. Check exception fields by using {*}checkError(){*}. The last function 
> checks valuable error fields only, and avoids dependencies from error text 
> message. In this way, tech editors can modify error format in 
> error-classes.json, and don't worry of Spark's internal tests. Migrate other 
> tests that might trigger the error onto checkError().
> If you cannot reproduce the error from user space (using SQL query), replace 
> the error by an internal error, see {*}SparkException.internalError(){*}.
> Improve the error message format in error-classes.json if the current is not 
> clear. Propose a solution to users how to avoid and fix such kind of errors.
> Please, look at the PR below as examples:
>  * [https://github.com/apache/spark/pull/38685]
>  * [https://github.com/apache/spark/pull/38656]
>  * [https://github.com/apache/spark/pull/38490]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-47258) Assign error classes to SHOW CREATE TABLE errors

2024-06-21 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot reassigned SPARK-47258:
--

Assignee: (was: Apache Spark)

> Assign error classes to SHOW CREATE TABLE errors
> 
>
> Key: SPARK-47258
> URL: https://issues.apache.org/jira/browse/SPARK-47258
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Max Gekk
>Priority: Minor
>  Labels: pull-request-available, starter
>
> Choose a proper name for the error class *_LEGACY_ERROR_TEMP_127[0-5]* 
> defined in {*}core/src/main/resources/error/error-classes.json{*}. The name 
> should be short but complete (look at the example in error-classes.json).
> Add a test which triggers the error from user code if such test still doesn't 
> exist. Check exception fields by using {*}checkError(){*}. The last function 
> checks valuable error fields only, and avoids dependencies from error text 
> message. In this way, tech editors can modify error format in 
> error-classes.json, and don't worry of Spark's internal tests. Migrate other 
> tests that might trigger the error onto checkError().
> If you cannot reproduce the error from user space (using SQL query), replace 
> the error by an internal error, see {*}SparkException.internalError(){*}.
> Improve the error message format in error-classes.json if the current is not 
> clear. Propose a solution to users how to avoid and fix such kind of errors.
> Please, look at the PR below as examples:
>  * [https://github.com/apache/spark/pull/38685]
>  * [https://github.com/apache/spark/pull/38656]
>  * [https://github.com/apache/spark/pull/38490]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-48666) A filter should not be pushed down if it contains Unevaluable expression

2024-06-21 Thread Yokesh NK (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-48666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17856433#comment-17856433
 ] 

Yokesh NK edited comment on SPARK-48666 at 6/21/24 9:11 AM:


During `PruneFileSourcePartitions` optimization, it converts the expression 
`isnotnull(getdata(cast(snapshot_date#2 as string))#30)` into 
`isnotnull(getdata(cast(input[1, int, true] as string))#30)`. In this case, 
`getdata` is a Python User-Defined Function (PythonUDF). However, when 
attempting to evaluate the transformed expression `getdata(cast(input[1, int, 
true] as string))`, the function fails to execute correctly. Just to test, 
excluding the rule `PruneFileSourcePartitions`, let this execution complete 
with no issue. So, this bug will be expected to fix in Spark.


was (Author: JIRAUSER302587):
During `PruneFileSourcePartitions` optimization, it converts the expression 
`isnotnull(getdata(cast(snapshot_date#2 as string))#30)` into 
`isnotnull(getdata(cast(input[1, int, true] as string))#30)`. In this case, 
`getdata` is a Python User-Defined Function (PythonUDF). However, when 
attempting to evaluate the transformed expression `getdata(cast(input[1, int, 
true] as string))`, the function fails to execute correctly. Just to test, 
excluding the rule `PruneFileSourcePartitions`, let this execution complete 
with no issue. So, this bug will be fixed in Spark.

> A filter should not be pushed down if it contains Unevaluable expression
> 
>
> Key: SPARK-48666
> URL: https://issues.apache.org/jira/browse/SPARK-48666
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Wei Zheng
>Priority: Major
>
> We should avoid pushing down Unevaluable expression as it can cause 
> unexpected failures. For example, the code snippet below (assuming there is a 
> table {{_t_}} with a partition column {{{_}p{_})}}
> {code:java}
> from pyspark import SparkConf
> from pyspark.sql import SparkSession
> from pyspark.sql.types import StringType
> import pyspark.sql.functions as f
> def getdata(p: str) -> str:
> return "data"
> NEW_COLUMN = 'new_column'
> P_COLUMN = 'p'
> f_getdata = f.udf(getdata, StringType())
> rows = spark.sql("select * from default.t")
> table = rows.withColumn(NEW_COLUMN, f_getdata(f.col(P_COLUMN)))
> df = table.alias('t1').join(table.alias('t2'), (f.col(f"t1.{NEW_COLUMN}") == 
> f.col(f"t2.{NEW_COLUMN}")), how='inner')
> df.show(){code}
> will cause an error like:
> {code:java}
> org.apache.spark.SparkException: [INTERNAL_ERROR] Cannot evaluate expression: 
> getdata(input[0, string, true])#16
>     at org.apache.spark.SparkException$.internalError(SparkException.scala:92)
>     at org.apache.spark.SparkException$.internalError(SparkException.scala:96)
>     at 
> org.apache.spark.sql.errors.QueryExecutionErrors$.cannotEvaluateExpressionError(QueryExecutionErrors.scala:66)
>     at 
> org.apache.spark.sql.catalyst.expressions.Unevaluable.eval(Expression.scala:391)
>     at 
> org.apache.spark.sql.catalyst.expressions.Unevaluable.eval$(Expression.scala:390)
>     at 
> org.apache.spark.sql.catalyst.expressions.PythonUDF.eval(PythonUDF.scala:71)
>     at 
> org.apache.spark.sql.catalyst.expressions.IsNotNull.eval(nullExpressions.scala:384)
>     at 
> org.apache.spark.sql.catalyst.expressions.InterpretedPredicate.eval(predicates.scala:52)
>     at 
> org.apache.spark.sql.catalyst.catalog.ExternalCatalogUtils$.$anonfun$prunePartitionsByFilter$1(ExternalCatalogUtils.scala:166)
>     at 
> org.apache.spark.sql.catalyst.catalog.ExternalCatalogUtils$.$anonfun$prunePartitionsByFilter$1$adapted(ExternalCatalogUtils.scala:165)
>  {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48683) Schema evolution with `df.mergeInto` losing `when` clauses

2024-06-21 Thread Pengfei Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48683?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pengfei Xu updated SPARK-48683:
---
Component/s: SQL
 (was: Spark Core)

> Schema evolution with `df.mergeInto` losing `when` clauses
> --
>
> Key: SPARK-48683
> URL: https://issues.apache.org/jira/browse/SPARK-48683
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Pengfei Xu
>Priority: Major
>
> When calling {{df.mergeInto(...).when...(...).withSchemaEvolution()}} all 
> {{when}} clauses will be lost.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-48683) Schema evolution with `df.mergeInto` losing `when` clauses

2024-06-21 Thread Pengfei Xu (Jira)
Pengfei Xu created SPARK-48683:
--

 Summary: Schema evolution with `df.mergeInto` losing `when` clauses
 Key: SPARK-48683
 URL: https://issues.apache.org/jira/browse/SPARK-48683
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Affects Versions: 4.0.0
Reporter: Pengfei Xu


When calling {{df.mergeInto(...).when...(...).withSchemaEvolution()}} all 
{{when}} clauses will be lost.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-48659) Unify v1 and v2 ALTER TABLE .. SET TBLPROPERTIES tests

2024-06-21 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48659?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-48659.
-
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 47018
[https://github.com/apache/spark/pull/47018]

> Unify v1 and v2 ALTER TABLE .. SET TBLPROPERTIES tests
> --
>
> Key: SPARK-48659
> URL: https://issues.apache.org/jira/browse/SPARK-48659
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Tests
>Affects Versions: 4.0.0
>Reporter: BingKun Pan
>Assignee: BingKun Pan
>Priority: Minor
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-48659) Unify v1 and v2 ALTER TABLE .. SET TBLPROPERTIES tests

2024-06-21 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48659?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-48659:
---

Assignee: BingKun Pan

> Unify v1 and v2 ALTER TABLE .. SET TBLPROPERTIES tests
> --
>
> Key: SPARK-48659
> URL: https://issues.apache.org/jira/browse/SPARK-48659
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Tests
>Affects Versions: 4.0.0
>Reporter: BingKun Pan
>Assignee: BingKun Pan
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-48682) Use ICU in InitCap expression (UTF8_BINARY collation)

2024-06-21 Thread Jira
Uroš Bojanić created SPARK-48682:


 Summary: Use ICU in InitCap expression (UTF8_BINARY collation)
 Key: SPARK-48682
 URL: https://issues.apache.org/jira/browse/SPARK-48682
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 4.0.0
Reporter: Uroš Bojanić






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-48681) Use ICU in Lower/Upper expressions (UTF8_BINARY collation)

2024-06-21 Thread Jira
Uroš Bojanić created SPARK-48681:


 Summary: Use ICU in Lower/Upper expressions (UTF8_BINARY collation)
 Key: SPARK-48681
 URL: https://issues.apache.org/jira/browse/SPARK-48681
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 4.0.0
Reporter: Uroš Bojanić






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-48680) Add char/varchar doc to language specific tables

2024-06-21 Thread Kent Yao (Jira)
Kent Yao created SPARK-48680:


 Summary: Add char/varchar doc to language specific tables
 Key: SPARK-48680
 URL: https://issues.apache.org/jira/browse/SPARK-48680
 Project: Spark
  Issue Type: Documentation
  Components: Documentation
Affects Versions: 4.0.0
Reporter: Kent Yao






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org