[jira] [Created] (SPARK-48690) SPJ: Support auto-shuffle one side + less join keys than partition keys
Szehon Ho created SPARK-48690: - Summary: SPJ: Support auto-shuffle one side + less join keys than partition keys Key: SPARK-48690 URL: https://issues.apache.org/jira/browse/SPARK-48690 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 4.0.0 Reporter: Szehon Ho -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-48689) Reading lengthy JSON results in a corrupted record.
[ https://issues.apache.org/jira/browse/SPARK-48689?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuxiang Wei updated SPARK-48689: Description: When reading a data frame from a JSON file including a very long string, spark will incorrectly make it a corrupted record even though the format is correct. Here is a minimal example with PySpark: {{import json}} {{import tempfile}} {{{}from pyspark.sql import SparkSession{}}}{{{}# Create a Spark session{}}} {{spark = SparkSession.builder \}} {{ .appName("PySpark JSON Example") \}} {{{} .getOrCreate(){}}}{{{}# Define the JSON content{}}} {{data = {}} {{ "text": "a" * 1}} {{{}}{}}}{{{}# Create a temporary file{}}} {{with tempfile.NamedTemporaryFile(delete=False, suffix=".json", mode="w") as tmp_file:}} {{ # Write the JSON content to the temporary file}} {{ tmp_file.write(json.dumps(data) + "\n")}} {{ tmp_file_path = tmp_file.name}}{{ # Load the JSON file into a PySpark DataFrame}} {{ df = spark.read.json(tmp_file_path)}}{{ # Print the schema}} {{ print(df)}} was: When reading a data frame from a JSON file including a very long string, spark will incorrectly make it a corrupted record even though the format is correct. Here is a minimal example with PySpark: {{import json}} {{import tempfile}} {{{}from pyspark.sql import SparkSession{}}}{{{}# Create a Spark session{}}} {{spark = SparkSession.builder \}} {{ .appName("PySpark JSON Example") \}} {{{} .getOrCreate(){}}}{{{}# Define the JSON content{}}} {{data = {}} {{ "text": "a" * 1}} {{{}}{}}}{{{}# Create a temporary file{}}} {{with tempfile.NamedTemporaryFile(delete=False, suffix=".json", mode="w") as tmp_file:}} {{ # Write the JSON content to the temporary file}} {{ tmp_file.write(json.dumps(data) + "\n")}} {{ tmp_file_path = tmp_file.name}}{{ # Load the JSON file into a PySpark DataFrame}} {{ df = spark.read.json(tmp_file_path)}}{{ # Print the schema}} {{ print(df)}} > Reading lengthy JSON results in a corrupted record. > --- > > Key: SPARK-48689 > URL: https://issues.apache.org/jira/browse/SPARK-48689 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.5.1 > Environment: Ubuntu 22.04, Python 3.11, and OpenJDK 22 >Reporter: Yuxiang Wei >Priority: Major > Labels: Reader > > When reading a data frame from a JSON file including a very long string, > spark will incorrectly make it a corrupted record even though the format is > correct. Here is a minimal example with PySpark: > {{import json}} > {{import tempfile}} > {{{}from pyspark.sql import SparkSession{}}}{{{}# Create a Spark session{}}} > {{spark = SparkSession.builder \}} > {{ .appName("PySpark JSON Example") \}} > {{{} .getOrCreate(){}}}{{{}# Define the JSON content{}}} > {{data = {}} > {{ "text": "a" * 1}} > {{{}}{}}}{{{}# Create a temporary file{}}} > {{with tempfile.NamedTemporaryFile(delete=False, suffix=".json", mode="w") as > tmp_file:}} > {{ # Write the JSON content to the temporary file}} > {{ tmp_file.write(json.dumps(data) + "\n")}} > {{ tmp_file_path = tmp_file.name}}{{ # Load the JSON file into a > PySpark DataFrame}} > {{ df = spark.read.json(tmp_file_path)}}{{ # Print the schema}} > {{ print(df)}} > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-48689) Reading lengthy JSON results in a corrupted record.
[ https://issues.apache.org/jira/browse/SPARK-48689?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuxiang Wei updated SPARK-48689: Description: When reading a data frame from a JSON file including a very long string, spark will incorrectly make it a corrupted record even though the format is correct. Here is a minimal example with PySpark: {{import json}} {{import tempfile}} {{{}from pyspark.sql import SparkSession{}}}{{{}# Create a Spark session{}}} {{spark = (SparkSession.builde}} {{ .appName("PySpark JSON Example")}} {{ .getOrCreate()}} {{{}){}}}{{{}# Define the JSON content{}}} {{data = {}} {{ "text": "a" * 1}} {{{}}{}}}{{{}# Create a temporary file{}}} {{with tempfile.NamedTemporaryFile(delete=False, suffix=".json", mode="w") as tmp_file:}} {{ # Write the JSON content to the temporary file}} {{ tmp_file.write(json.dumps(data) + "\n")}} {{ tmp_file_path = tmp_file.name}}{{ # Load the JSON file into a PySpark DataFrame}} {{ df = spark.read.json(tmp_file_path)}}{{ # Print the schema}} {{ print(df)}} was: When reading a data frame from a JSON file including a very long string, spark will incorrectly make it a corrupted record even though the format is correct. Here is a minimal example with PySpark: {{import json}} {{import tempfile}} {{{}from pyspark.sql import SparkSession{}}}{{{}# Create a Spark session{}}} {{spark = SparkSession.builder \}} {{ .appName("PySpark JSON Example") \}} {{{} .getOrCreate(){}}}{{{}# Define the JSON content{}}} {{data = {}} {{ "text": "a" * 1}} {{{}}{}}}{{{}# Create a temporary file{}}} {{with tempfile.NamedTemporaryFile(delete=False, suffix=".json", mode="w") as tmp_file:}} {{ # Write the JSON content to the temporary file}} {{ tmp_file.write(json.dumps(data) + "\n")}} {{ tmp_file_path = tmp_file.name}}{{ # Load the JSON file into a PySpark DataFrame}} {{ df = spark.read.json(tmp_file_path)}}{{ # Print the schema}} {{ print(df)}} > Reading lengthy JSON results in a corrupted record. > --- > > Key: SPARK-48689 > URL: https://issues.apache.org/jira/browse/SPARK-48689 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.5.1 > Environment: Ubuntu 22.04, Python 3.11, and OpenJDK 22 >Reporter: Yuxiang Wei >Priority: Major > Labels: Reader > > When reading a data frame from a JSON file including a very long string, > spark will incorrectly make it a corrupted record even though the format is > correct. Here is a minimal example with PySpark: > {{import json}} > {{import tempfile}} > {{{}from pyspark.sql import SparkSession{}}}{{{}# Create a Spark session{}}} > {{spark = (SparkSession.builde}} > {{ .appName("PySpark JSON Example")}} > {{ .getOrCreate()}} > {{{}){}}}{{{}# Define the JSON content{}}} > {{data = {}} > {{ "text": "a" * 1}} > {{{}}{}}}{{{}# Create a temporary file{}}} > {{with tempfile.NamedTemporaryFile(delete=False, suffix=".json", mode="w") as > tmp_file:}} > {{ # Write the JSON content to the temporary file}} > {{ tmp_file.write(json.dumps(data) + "\n")}} > {{ tmp_file_path = tmp_file.name}}{{ # Load the JSON file into a > PySpark DataFrame}} > {{ df = spark.read.json(tmp_file_path)}}{{ # Print the schema}} > {{ print(df)}} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-48689) Reading lengthy JSON results in a corrupted record.
Yuxiang Wei created SPARK-48689: --- Summary: Reading lengthy JSON results in a corrupted record. Key: SPARK-48689 URL: https://issues.apache.org/jira/browse/SPARK-48689 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 3.5.1 Environment: Ubuntu 22.04, Python 3.11, and OpenJDK 22 Reporter: Yuxiang Wei When reading a data frame from a JSON file including a very long string, spark will incorrectly make it a corrupted record even though the format is correct. Here is a minimal example with PySpark: {{import json}} {{import tempfile}} {{{}from pyspark.sql import SparkSession{}}}{{{}# Create a Spark session{}}} {{spark = SparkSession.builder \}} {{ .appName("PySpark JSON Example") \}} {{{} .getOrCreate(){}}}{{{}# Define the JSON content{}}} {{data = {}} {{ "text": "a" * 1}} {{{}}{}}}{{{}# Create a temporary file{}}} {{with tempfile.NamedTemporaryFile(delete=False, suffix=".json", mode="w") as tmp_file:}} {{ # Write the JSON content to the temporary file}} {{ tmp_file.write(json.dumps(data) + "\n")}} {{ tmp_file_path = tmp_file.name}}{{ # Load the JSON file into a PySpark DataFrame}} {{ df = spark.read.json(tmp_file_path)}}{{ # Print the schema}} {{ print(df)}} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-48688) Return reasonable error when calling SQL to_avro and from_avro functions but Avro is not loaded by default
Daniel created SPARK-48688: -- Summary: Return reasonable error when calling SQL to_avro and from_avro functions but Avro is not loaded by default Key: SPARK-48688 URL: https://issues.apache.org/jira/browse/SPARK-48688 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 4.0.0 Reporter: Daniel -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-48687) Add changes to implement state schema validation in planning phase on driver for stateful streaming queries
[ https://issues.apache.org/jira/browse/SPARK-48687?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17856883#comment-17856883 ] Anish Shrigondekar commented on SPARK-48687: PR here - [https://github.com/apache/spark/pull/47035] [~kabhwan] - PTAL thx ! > Add changes to implement state schema validation in planning phase on driver > for stateful streaming queries > --- > > Key: SPARK-48687 > URL: https://issues.apache.org/jira/browse/SPARK-48687 > Project: Spark > Issue Type: Task > Components: Structured Streaming >Affects Versions: 4.0.0 >Reporter: Anish Shrigondekar >Priority: Major > > Add changes to implement state schema validation in planning phase on driver > for stateful streaming queries -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-48687) Add changes to implement state schema validation in planning phase on driver for stateful streaming queries
Anish Shrigondekar created SPARK-48687: -- Summary: Add changes to implement state schema validation in planning phase on driver for stateful streaming queries Key: SPARK-48687 URL: https://issues.apache.org/jira/browse/SPARK-48687 Project: Spark Issue Type: Task Components: Structured Streaming Affects Versions: 4.0.0 Reporter: Anish Shrigondekar Add changes to implement state schema validation in planning phase on driver for stateful streaming queries -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-48686) Improve performance of ParserUtils.unescapeSQLString
Josh Rosen created SPARK-48686: -- Summary: Improve performance of ParserUtils.unescapeSQLString Key: SPARK-48686 URL: https://issues.apache.org/jira/browse/SPARK-48686 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 4.0.0 Reporter: Josh Rosen Assignee: Josh Rosen The `ParserUtils.unescapeSQLString` method is currently implemented using regexes for part of the parsing, but this slows down the common case where escaping is not needed and may be prone to O(n^2) behavior for certain extreme inputs. I think that we should optimize this to remove the use of regexes. I will submit a PR for this. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-48685) PySpark MinHashLSH when used with CountVectorizer doesn't meet requirements
Etienne Soulard-Geoffrion created SPARK-48685: - Summary: PySpark MinHashLSH when used with CountVectorizer doesn't meet requirements Key: SPARK-48685 URL: https://issues.apache.org/jira/browse/SPARK-48685 Project: Spark Issue Type: Bug Components: ML Affects Versions: 3.5.1 Reporter: Etienne Soulard-Geoffrion Fix For: 3.5.1 I'm facing an issue when trying to use the MinHashLSH model, where the model is complaining about having only zero values in some rows although I do apply a filter before using the model. Here is a sample code to demonstrate using pyspark: ```python @F.udf(returnType=types.BooleanType()) def is_non_zero_vector(vector: SparseVector) -> bool: """ Returns True if the vector has at least one non zero element """ return vector.numNonzeros() > 0 df_text = df.select("id", "text") tokenizer=Tokenizer(inputCol="text", outputCol="text_tokens") df_text=tokenizer.transform(df_text).select("id", "text_tokens") ngram=NGram(inputCol="text_tokens", outputCol="text_ngrams", n=self.min_hash_lsh_ngram_size) df_text=ngram.transform(df_text).select("id", "text_ngrams") count_vectorizer=CountVectorizer(inputCol="text_ngrams", outputCol="text_count_vector").fit(df_text) df_text=count_vectorizer.transform(df_text).select("id", "text_count_vector") # Keep only the non zero vectors df_text=df_text.filter(is_non_zero_vector(F.col("text_count_vector"))) min_hash_lsh=MinHashLSH( inputCol="text_count_vector", outputCol="text_hashes", seed=self.min_hash_lsh_num_hash_tables, numHashTables=self.min_hash_lsh_num_hash_tables, ).fit(df_text) df_text=min_hash_lsh.transform(df_text) # Calculate the distance between all pairs of vectors and keep only the pairs with a distance > 0 (that are duplicates) pairs_df=min_hash_lsh.approxSimilarityJoin(df_text, df_text, 0.6, distCol="vector_distance") pairs_df=pairs_df.filter("vector_distance != 0") ``` I've also analyzed the dataframe and there is in fact no rows without at least 1 non-zero index. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-48545) Create to_avro and from_avro SQL functions to match PySpark equivalent
[ https://issues.apache.org/jira/browse/SPARK-48545?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gengliang Wang resolved SPARK-48545. Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 46977 [https://github.com/apache/spark/pull/46977] > Create to_avro and from_avro SQL functions to match PySpark equivalent > -- > > Key: SPARK-48545 > URL: https://issues.apache.org/jira/browse/SPARK-48545 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 4.0.0 >Reporter: Daniel >Assignee: Daniel >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > > The PySpark API is here: > https://github.com/apache/spark/blob/d5c33c6bfb5757b243fc8e1734daeaa4fe3b9b32/python/pyspark/sql/avro/functions.py#L35 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-48545) Create to_avro and from_avro SQL functions to match PySpark equivalent
[ https://issues.apache.org/jira/browse/SPARK-48545?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gengliang Wang reassigned SPARK-48545: -- Assignee: Daniel > Create to_avro and from_avro SQL functions to match PySpark equivalent > -- > > Key: SPARK-48545 > URL: https://issues.apache.org/jira/browse/SPARK-48545 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 4.0.0 >Reporter: Daniel >Assignee: Daniel >Priority: Major > Labels: pull-request-available > > The PySpark API is here: > https://github.com/apache/spark/blob/d5c33c6bfb5757b243fc8e1734daeaa4fe3b9b32/python/pyspark/sql/avro/functions.py#L35 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-48655) SPJ: Add tests for shuffle skipping for aggregate queries
[ https://issues.apache.org/jira/browse/SPARK-48655?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Sun reassigned SPARK-48655: Assignee: Szehon Ho > SPJ: Add tests for shuffle skipping for aggregate queries > - > > Key: SPARK-48655 > URL: https://issues.apache.org/jira/browse/SPARK-48655 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 4.0.0 >Reporter: Szehon Ho >Assignee: Szehon Ho >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-48463) MLLib function unable to handle nested data
[ https://issues.apache.org/jira/browse/SPARK-48463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17856832#comment-17856832 ] Chhavi Bansal commented on SPARK-48463: --- Thank you for the update. > MLLib function unable to handle nested data > --- > > Key: SPARK-48463 > URL: https://issues.apache.org/jira/browse/SPARK-48463 > Project: Spark > Issue Type: Bug > Components: ML, MLlib >Affects Versions: 3.5.1 >Reporter: Chhavi Bansal >Assignee: Weichen Xu >Priority: Major > Labels: ML, MLPipelines, mllib, nested > > I am trying to use feature transformer on nested data after flattening, but > it fails. > > {code:java} > val structureData = Seq( > Row(Row(10, 12), 1000), > Row(Row(12, 14), 4300), > Row( Row(37, 891), 1400), > Row(Row(8902, 12), 4000), > Row(Row(12, 89), 1000) > ) > val structureSchema = new StructType() > .add("location", new StructType() > .add("longitude", IntegerType) > .add("latitude", IntegerType)) > .add("salary", IntegerType) > val df = spark.createDataFrame(spark.sparkContext.parallelize(structureData), > structureSchema) > def flattenSchema(schema: StructType, prefix: String = null, prefixSelect: > String = null): > Array[Column] = { > schema.fields.flatMap(f => { > val colName = if (prefix == null) f.name else (prefix + "." + f.name) > val colnameSelect = if (prefix == null) f.name else (prefixSelect + "." + > f.name) > f.dataType match { > case st: StructType => flattenSchema(st, colName, colnameSelect) > case _ => > Array(col(colName).as(colnameSelect)) > } > }) > } > val flattenColumns = flattenSchema(df.schema) > val flattenedDf = df.select(flattenColumns: _*){code} > Now using the string indexer on the DOT notation. > > {code:java} > val si = new > StringIndexer().setInputCol("location.longitude").setOutputCol("longitutdee") > val pipeline = new Pipeline().setStages(Array(si)) > pipeline.fit(flattenedDf).transform(flattenedDf).show() {code} > The above code fails > {code:java} > xception in thread "main" org.apache.spark.sql.AnalysisException: Cannot > resolve column name "location.longitude" among (location.longitude, > location.latitude, salary); did you mean to quote the `location.longitude` > column? > at > org.apache.spark.sql.errors.QueryCompilationErrors$.cannotResolveColumnNameAmongFieldsError(QueryCompilationErrors.scala:2261) > at > org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$resolveException(Dataset.scala:258) > at org.apache.spark.sql.Dataset.$anonfun$resolve$1(Dataset.scala:250) > . {code} > This points to the same failure as when we try to select dot notation columns > in a spark dataframe, which is solved using BACKTICKS *`column.name`.* > [https://stackoverflow.com/a/51430335/11688337] > > *so next* > I use the back ticks while defining stringIndexer > {code:java} > val si = new > StringIndexer().setInputCol("`location.longitude`").setOutputCol("longitutdee") > {code} > In this case *it again fails* (with a diff reason) in the stringIndexer code > itself > {code:java} > Exception in thread "main" org.apache.spark.SparkException: Input column > `location.longitude` does not exist. > at > org.apache.spark.ml.feature.StringIndexerBase.$anonfun$validateAndTransformSchema$2(StringIndexer.scala:128) > at > scala.collection.TraversableLike.$anonfun$flatMap$1(TraversableLike.scala:244) > at > scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36) > at > scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33) > {code} > > This blocks me to use feature transformation functions on nested columns. > Any help in solving this problem will be highly appreciated. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-48675) Cache table doesn't work with collated column
[ https://issues.apache.org/jira/browse/SPARK-48675?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-48675: --- Assignee: Nikola Mandic > Cache table doesn't work with collated column > - > > Key: SPARK-48675 > URL: https://issues.apache.org/jira/browse/SPARK-48675 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 4.0.0 >Reporter: Nikola Mandic >Assignee: Nikola Mandic >Priority: Major > > Following sequence of queries produces the error: > {code:java} > > cache lazy table t as select col from values ('a' collate utf8_lcase) as > > (col); > > select col from t; > org.apache.spark.SparkException: not support type: > org.apache.spark.sql.types.StringType@1. > at > org.apache.spark.sql.errors.QueryExecutionErrors$.notSupportTypeError(QueryExecutionErrors.scala:1069) > at > org.apache.spark.sql.execution.columnar.ColumnBuilder$.apply(ColumnBuilder.scala:200) > at > org.apache.spark.sql.execution.columnar.DefaultCachedBatchSerializer$$anon$1.$anonfun$next$1(InMemoryRelation.scala:85) > at scala.collection.immutable.List.map(List.scala:247) > at scala.collection.immutable.List.map(List.scala:79) > at > org.apache.spark.sql.execution.columnar.DefaultCachedBatchSerializer$$anon$1.next(InMemoryRelation.scala:84) > at > org.apache.spark.sql.execution.columnar.DefaultCachedBatchSerializer$$anon$1.next(InMemoryRelation.scala:82) > at > org.apache.spark.sql.execution.columnar.CachedRDDBuilder$$anon$2.next(InMemoryRelation.scala:296) > at > org.apache.spark.sql.execution.columnar.CachedRDDBuilder$$anon$2.next(InMemoryRelation.scala:293) > ... {code} > This is also the problem on non-lazy cached tables. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-48675) Cache table doesn't work with collated column
[ https://issues.apache.org/jira/browse/SPARK-48675?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-48675. - Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 47045 [https://github.com/apache/spark/pull/47045] > Cache table doesn't work with collated column > - > > Key: SPARK-48675 > URL: https://issues.apache.org/jira/browse/SPARK-48675 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 4.0.0 >Reporter: Nikola Mandic >Assignee: Nikola Mandic >Priority: Major > Fix For: 4.0.0 > > > Following sequence of queries produces the error: > {code:java} > > cache lazy table t as select col from values ('a' collate utf8_lcase) as > > (col); > > select col from t; > org.apache.spark.SparkException: not support type: > org.apache.spark.sql.types.StringType@1. > at > org.apache.spark.sql.errors.QueryExecutionErrors$.notSupportTypeError(QueryExecutionErrors.scala:1069) > at > org.apache.spark.sql.execution.columnar.ColumnBuilder$.apply(ColumnBuilder.scala:200) > at > org.apache.spark.sql.execution.columnar.DefaultCachedBatchSerializer$$anon$1.$anonfun$next$1(InMemoryRelation.scala:85) > at scala.collection.immutable.List.map(List.scala:247) > at scala.collection.immutable.List.map(List.scala:79) > at > org.apache.spark.sql.execution.columnar.DefaultCachedBatchSerializer$$anon$1.next(InMemoryRelation.scala:84) > at > org.apache.spark.sql.execution.columnar.DefaultCachedBatchSerializer$$anon$1.next(InMemoryRelation.scala:82) > at > org.apache.spark.sql.execution.columnar.CachedRDDBuilder$$anon$2.next(InMemoryRelation.scala:296) > at > org.apache.spark.sql.execution.columnar.CachedRDDBuilder$$anon$2.next(InMemoryRelation.scala:293) > ... {code} > This is also the problem on non-lazy cached tables. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-48680) Add char/varchar doc to language specific tables
[ https://issues.apache.org/jira/browse/SPARK-48680?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-48680: --- Labels: pull-request-available (was: ) > Add char/varchar doc to language specific tables > > > Key: SPARK-48680 > URL: https://issues.apache.org/jira/browse/SPARK-48680 > Project: Spark > Issue Type: Documentation > Components: Documentation >Affects Versions: 4.0.0 >Reporter: Kent Yao >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-48664) Add official image Dockerfile for Apache Spark 4.0.0-preview1
[ https://issues.apache.org/jira/browse/SPARK-48664?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-48664: --- Labels: pull-request-available (was: ) > Add official image Dockerfile for Apache Spark 4.0.0-preview1 > - > > Key: SPARK-48664 > URL: https://issues.apache.org/jira/browse/SPARK-48664 > Project: Spark > Issue Type: Task > Components: Spark Docker >Affects Versions: 4.0.0 >Reporter: Kent Yao >Assignee: Wenchen Fan >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-48656) ArrayIndexOutOfBoundsException in CartesianRDD getPartitions
[ https://issues.apache.org/jira/browse/SPARK-48656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-48656: --- Labels: pull-request-available (was: ) > ArrayIndexOutOfBoundsException in CartesianRDD getPartitions > > > Key: SPARK-48656 > URL: https://issues.apache.org/jira/browse/SPARK-48656 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 4.0.0 >Reporter: Nick Young >Assignee: Wei Guo >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > > ```val rdd1 = spark.sparkContext.parallelize(Seq(1, 2, 3), numSlices = 65536) > val rdd2 = spark.sparkContext.parallelize(Seq(1, 2, 3), numSlices = > 65536)rdd2.cartesian(rdd1).partitions``` > Throws `ArrayIndexOutOfBoundsException: 0` at CartesianRDD.scala:69 because > `s1.index * numPartitionsInRdd2 + s2.index` overflows and wraps to 0. We > should provide a better error message which indicates the number of partition > overflows so it's easier for the user to debug. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-48655) SPJ: Add tests for shuffle skipping for aggregate queries
[ https://issues.apache.org/jira/browse/SPARK-48655?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-48655: --- Labels: pull-request-available (was: ) > SPJ: Add tests for shuffle skipping for aggregate queries > - > > Key: SPARK-48655 > URL: https://issues.apache.org/jira/browse/SPARK-48655 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 4.0.0 >Reporter: Szehon Ho >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-48684) Print related JIRA summary before proceeding merge
[ https://issues.apache.org/jira/browse/SPARK-48684?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kent Yao resolved SPARK-48684. -- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 47057 [https://github.com/apache/spark/pull/47057] > Print related JIRA summary before proceeding merge > -- > > Key: SPARK-48684 > URL: https://issues.apache.org/jira/browse/SPARK-48684 > Project: Spark > Issue Type: Improvement > Components: Project Infra >Affects Versions: 4.0.0 >Reporter: Kent Yao >Assignee: Kent Yao >Priority: Minor > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-48684) Print related JIRA summary before proceeding merge
[ https://issues.apache.org/jira/browse/SPARK-48684?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kent Yao updated SPARK-48684: - Component/s: Project Infra (was: SQL) > Print related JIRA summary before proceeding merge > -- > > Key: SPARK-48684 > URL: https://issues.apache.org/jira/browse/SPARK-48684 > Project: Spark > Issue Type: Improvement > Components: Project Infra >Affects Versions: 4.0.0 >Reporter: Kent Yao >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-48684) Print related JIRA summary before proceeding merge
[ https://issues.apache.org/jira/browse/SPARK-48684?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kent Yao updated SPARK-48684: - Priority: Minor (was: Major) > Print related JIRA summary before proceeding merge > -- > > Key: SPARK-48684 > URL: https://issues.apache.org/jira/browse/SPARK-48684 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 4.0.0 >Reporter: Kent Yao >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-48684) Print related JIRA summary before proceeding merge
Kent Yao created SPARK-48684: Summary: Print related JIRA summary before proceeding merge Key: SPARK-48684 URL: https://issues.apache.org/jira/browse/SPARK-48684 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 4.0.0 Reporter: Kent Yao -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-48662) Fix StructsToXml expression with collations
[ https://issues.apache.org/jira/browse/SPARK-48662?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-48662: Assignee: Mihailo Milosevic > Fix StructsToXml expression with collations > --- > > Key: SPARK-48662 > URL: https://issues.apache.org/jira/browse/SPARK-48662 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 4.0.0 >Reporter: Mihailo Milosevic >Assignee: Mihailo Milosevic >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-48662) Fix StructsToXml expression with collations
[ https://issues.apache.org/jira/browse/SPARK-48662?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-48662. -- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 47053 [https://github.com/apache/spark/pull/47053] > Fix StructsToXml expression with collations > --- > > Key: SPARK-48662 > URL: https://issues.apache.org/jira/browse/SPARK-48662 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 4.0.0 >Reporter: Mihailo Milosevic >Assignee: Mihailo Milosevic >Priority: Major > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-48463) MLLib function unable to handle nested data
[ https://issues.apache.org/jira/browse/SPARK-48463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17856713#comment-17856713 ] Weichen Xu commented on SPARK-48463: I will try to do it this sprint. (and then cherrypick it to databricks runtime) > MLLib function unable to handle nested data > --- > > Key: SPARK-48463 > URL: https://issues.apache.org/jira/browse/SPARK-48463 > Project: Spark > Issue Type: Bug > Components: ML, MLlib >Affects Versions: 3.5.1 >Reporter: Chhavi Bansal >Assignee: Weichen Xu >Priority: Major > Labels: ML, MLPipelines, mllib, nested > > I am trying to use feature transformer on nested data after flattening, but > it fails. > > {code:java} > val structureData = Seq( > Row(Row(10, 12), 1000), > Row(Row(12, 14), 4300), > Row( Row(37, 891), 1400), > Row(Row(8902, 12), 4000), > Row(Row(12, 89), 1000) > ) > val structureSchema = new StructType() > .add("location", new StructType() > .add("longitude", IntegerType) > .add("latitude", IntegerType)) > .add("salary", IntegerType) > val df = spark.createDataFrame(spark.sparkContext.parallelize(structureData), > structureSchema) > def flattenSchema(schema: StructType, prefix: String = null, prefixSelect: > String = null): > Array[Column] = { > schema.fields.flatMap(f => { > val colName = if (prefix == null) f.name else (prefix + "." + f.name) > val colnameSelect = if (prefix == null) f.name else (prefixSelect + "." + > f.name) > f.dataType match { > case st: StructType => flattenSchema(st, colName, colnameSelect) > case _ => > Array(col(colName).as(colnameSelect)) > } > }) > } > val flattenColumns = flattenSchema(df.schema) > val flattenedDf = df.select(flattenColumns: _*){code} > Now using the string indexer on the DOT notation. > > {code:java} > val si = new > StringIndexer().setInputCol("location.longitude").setOutputCol("longitutdee") > val pipeline = new Pipeline().setStages(Array(si)) > pipeline.fit(flattenedDf).transform(flattenedDf).show() {code} > The above code fails > {code:java} > xception in thread "main" org.apache.spark.sql.AnalysisException: Cannot > resolve column name "location.longitude" among (location.longitude, > location.latitude, salary); did you mean to quote the `location.longitude` > column? > at > org.apache.spark.sql.errors.QueryCompilationErrors$.cannotResolveColumnNameAmongFieldsError(QueryCompilationErrors.scala:2261) > at > org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$resolveException(Dataset.scala:258) > at org.apache.spark.sql.Dataset.$anonfun$resolve$1(Dataset.scala:250) > . {code} > This points to the same failure as when we try to select dot notation columns > in a spark dataframe, which is solved using BACKTICKS *`column.name`.* > [https://stackoverflow.com/a/51430335/11688337] > > *so next* > I use the back ticks while defining stringIndexer > {code:java} > val si = new > StringIndexer().setInputCol("`location.longitude`").setOutputCol("longitutdee") > {code} > In this case *it again fails* (with a diff reason) in the stringIndexer code > itself > {code:java} > Exception in thread "main" org.apache.spark.SparkException: Input column > `location.longitude` does not exist. > at > org.apache.spark.ml.feature.StringIndexerBase.$anonfun$validateAndTransformSchema$2(StringIndexer.scala:128) > at > scala.collection.TraversableLike.$anonfun$flatMap$1(TraversableLike.scala:244) > at > scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36) > at > scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33) > {code} > > This blocks me to use feature transformation functions on nested columns. > Any help in solving this problem will be highly appreciated. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-47258) Assign error classes to SHOW CREATE TABLE errors
[ https://issues.apache.org/jira/browse/SPARK-47258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot reassigned SPARK-47258: -- Assignee: Apache Spark > Assign error classes to SHOW CREATE TABLE errors > > > Key: SPARK-47258 > URL: https://issues.apache.org/jira/browse/SPARK-47258 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 4.0.0 >Reporter: Max Gekk >Assignee: Apache Spark >Priority: Minor > Labels: pull-request-available, starter > > Choose a proper name for the error class *_LEGACY_ERROR_TEMP_127[0-5]* > defined in {*}core/src/main/resources/error/error-classes.json{*}. The name > should be short but complete (look at the example in error-classes.json). > Add a test which triggers the error from user code if such test still doesn't > exist. Check exception fields by using {*}checkError(){*}. The last function > checks valuable error fields only, and avoids dependencies from error text > message. In this way, tech editors can modify error format in > error-classes.json, and don't worry of Spark's internal tests. Migrate other > tests that might trigger the error onto checkError(). > If you cannot reproduce the error from user space (using SQL query), replace > the error by an internal error, see {*}SparkException.internalError(){*}. > Improve the error message format in error-classes.json if the current is not > clear. Propose a solution to users how to avoid and fix such kind of errors. > Please, look at the PR below as examples: > * [https://github.com/apache/spark/pull/38685] > * [https://github.com/apache/spark/pull/38656] > * [https://github.com/apache/spark/pull/38490] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-47258) Assign error classes to SHOW CREATE TABLE errors
[ https://issues.apache.org/jira/browse/SPARK-47258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot reassigned SPARK-47258: -- Assignee: (was: Apache Spark) > Assign error classes to SHOW CREATE TABLE errors > > > Key: SPARK-47258 > URL: https://issues.apache.org/jira/browse/SPARK-47258 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 4.0.0 >Reporter: Max Gekk >Priority: Minor > Labels: pull-request-available, starter > > Choose a proper name for the error class *_LEGACY_ERROR_TEMP_127[0-5]* > defined in {*}core/src/main/resources/error/error-classes.json{*}. The name > should be short but complete (look at the example in error-classes.json). > Add a test which triggers the error from user code if such test still doesn't > exist. Check exception fields by using {*}checkError(){*}. The last function > checks valuable error fields only, and avoids dependencies from error text > message. In this way, tech editors can modify error format in > error-classes.json, and don't worry of Spark's internal tests. Migrate other > tests that might trigger the error onto checkError(). > If you cannot reproduce the error from user space (using SQL query), replace > the error by an internal error, see {*}SparkException.internalError(){*}. > Improve the error message format in error-classes.json if the current is not > clear. Propose a solution to users how to avoid and fix such kind of errors. > Please, look at the PR below as examples: > * [https://github.com/apache/spark/pull/38685] > * [https://github.com/apache/spark/pull/38656] > * [https://github.com/apache/spark/pull/38490] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-48666) A filter should not be pushed down if it contains Unevaluable expression
[ https://issues.apache.org/jira/browse/SPARK-48666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17856433#comment-17856433 ] Yokesh NK edited comment on SPARK-48666 at 6/21/24 9:11 AM: During `PruneFileSourcePartitions` optimization, it converts the expression `isnotnull(getdata(cast(snapshot_date#2 as string))#30)` into `isnotnull(getdata(cast(input[1, int, true] as string))#30)`. In this case, `getdata` is a Python User-Defined Function (PythonUDF). However, when attempting to evaluate the transformed expression `getdata(cast(input[1, int, true] as string))`, the function fails to execute correctly. Just to test, excluding the rule `PruneFileSourcePartitions`, let this execution complete with no issue. So, this bug will be expected to fix in Spark. was (Author: JIRAUSER302587): During `PruneFileSourcePartitions` optimization, it converts the expression `isnotnull(getdata(cast(snapshot_date#2 as string))#30)` into `isnotnull(getdata(cast(input[1, int, true] as string))#30)`. In this case, `getdata` is a Python User-Defined Function (PythonUDF). However, when attempting to evaluate the transformed expression `getdata(cast(input[1, int, true] as string))`, the function fails to execute correctly. Just to test, excluding the rule `PruneFileSourcePartitions`, let this execution complete with no issue. So, this bug will be fixed in Spark. > A filter should not be pushed down if it contains Unevaluable expression > > > Key: SPARK-48666 > URL: https://issues.apache.org/jira/browse/SPARK-48666 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 4.0.0 >Reporter: Wei Zheng >Priority: Major > > We should avoid pushing down Unevaluable expression as it can cause > unexpected failures. For example, the code snippet below (assuming there is a > table {{_t_}} with a partition column {{{_}p{_})}} > {code:java} > from pyspark import SparkConf > from pyspark.sql import SparkSession > from pyspark.sql.types import StringType > import pyspark.sql.functions as f > def getdata(p: str) -> str: > return "data" > NEW_COLUMN = 'new_column' > P_COLUMN = 'p' > f_getdata = f.udf(getdata, StringType()) > rows = spark.sql("select * from default.t") > table = rows.withColumn(NEW_COLUMN, f_getdata(f.col(P_COLUMN))) > df = table.alias('t1').join(table.alias('t2'), (f.col(f"t1.{NEW_COLUMN}") == > f.col(f"t2.{NEW_COLUMN}")), how='inner') > df.show(){code} > will cause an error like: > {code:java} > org.apache.spark.SparkException: [INTERNAL_ERROR] Cannot evaluate expression: > getdata(input[0, string, true])#16 > at org.apache.spark.SparkException$.internalError(SparkException.scala:92) > at org.apache.spark.SparkException$.internalError(SparkException.scala:96) > at > org.apache.spark.sql.errors.QueryExecutionErrors$.cannotEvaluateExpressionError(QueryExecutionErrors.scala:66) > at > org.apache.spark.sql.catalyst.expressions.Unevaluable.eval(Expression.scala:391) > at > org.apache.spark.sql.catalyst.expressions.Unevaluable.eval$(Expression.scala:390) > at > org.apache.spark.sql.catalyst.expressions.PythonUDF.eval(PythonUDF.scala:71) > at > org.apache.spark.sql.catalyst.expressions.IsNotNull.eval(nullExpressions.scala:384) > at > org.apache.spark.sql.catalyst.expressions.InterpretedPredicate.eval(predicates.scala:52) > at > org.apache.spark.sql.catalyst.catalog.ExternalCatalogUtils$.$anonfun$prunePartitionsByFilter$1(ExternalCatalogUtils.scala:166) > at > org.apache.spark.sql.catalyst.catalog.ExternalCatalogUtils$.$anonfun$prunePartitionsByFilter$1$adapted(ExternalCatalogUtils.scala:165) > {code} > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-48683) Schema evolution with `df.mergeInto` losing `when` clauses
[ https://issues.apache.org/jira/browse/SPARK-48683?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pengfei Xu updated SPARK-48683: --- Component/s: SQL (was: Spark Core) > Schema evolution with `df.mergeInto` losing `when` clauses > -- > > Key: SPARK-48683 > URL: https://issues.apache.org/jira/browse/SPARK-48683 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 4.0.0 >Reporter: Pengfei Xu >Priority: Major > > When calling {{df.mergeInto(...).when...(...).withSchemaEvolution()}} all > {{when}} clauses will be lost. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-48683) Schema evolution with `df.mergeInto` losing `when` clauses
Pengfei Xu created SPARK-48683: -- Summary: Schema evolution with `df.mergeInto` losing `when` clauses Key: SPARK-48683 URL: https://issues.apache.org/jira/browse/SPARK-48683 Project: Spark Issue Type: New Feature Components: Spark Core Affects Versions: 4.0.0 Reporter: Pengfei Xu When calling {{df.mergeInto(...).when...(...).withSchemaEvolution()}} all {{when}} clauses will be lost. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-48659) Unify v1 and v2 ALTER TABLE .. SET TBLPROPERTIES tests
[ https://issues.apache.org/jira/browse/SPARK-48659?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-48659. - Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 47018 [https://github.com/apache/spark/pull/47018] > Unify v1 and v2 ALTER TABLE .. SET TBLPROPERTIES tests > -- > > Key: SPARK-48659 > URL: https://issues.apache.org/jira/browse/SPARK-48659 > Project: Spark > Issue Type: Improvement > Components: SQL, Tests >Affects Versions: 4.0.0 >Reporter: BingKun Pan >Assignee: BingKun Pan >Priority: Minor > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-48659) Unify v1 and v2 ALTER TABLE .. SET TBLPROPERTIES tests
[ https://issues.apache.org/jira/browse/SPARK-48659?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-48659: --- Assignee: BingKun Pan > Unify v1 and v2 ALTER TABLE .. SET TBLPROPERTIES tests > -- > > Key: SPARK-48659 > URL: https://issues.apache.org/jira/browse/SPARK-48659 > Project: Spark > Issue Type: Improvement > Components: SQL, Tests >Affects Versions: 4.0.0 >Reporter: BingKun Pan >Assignee: BingKun Pan >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-48682) Use ICU in InitCap expression (UTF8_BINARY collation)
Uroš Bojanić created SPARK-48682: Summary: Use ICU in InitCap expression (UTF8_BINARY collation) Key: SPARK-48682 URL: https://issues.apache.org/jira/browse/SPARK-48682 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 4.0.0 Reporter: Uroš Bojanić -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-48681) Use ICU in Lower/Upper expressions (UTF8_BINARY collation)
Uroš Bojanić created SPARK-48681: Summary: Use ICU in Lower/Upper expressions (UTF8_BINARY collation) Key: SPARK-48681 URL: https://issues.apache.org/jira/browse/SPARK-48681 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 4.0.0 Reporter: Uroš Bojanić -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-48680) Add char/varchar doc to language specific tables
Kent Yao created SPARK-48680: Summary: Add char/varchar doc to language specific tables Key: SPARK-48680 URL: https://issues.apache.org/jira/browse/SPARK-48680 Project: Spark Issue Type: Documentation Components: Documentation Affects Versions: 4.0.0 Reporter: Kent Yao -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org