[jira] [Commented] (SPARK-48992) applyInPandas does not respect streaming watermark
[ https://issues.apache.org/jira/browse/SPARK-48992?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17868515#comment-17868515 ] Jungtaek Lim commented on SPARK-48992: -- I understand the confusion - but you'll need to use applyInPandasWithState, not applyInPandas. It is purposed to be used for batch (for streaming, it's per microbatch). > applyInPandas does not respect streaming watermark > -- > > Key: SPARK-48992 > URL: https://issues.apache.org/jira/browse/SPARK-48992 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.5.0 > Environment: Azure Databricks runtime 14.3 LTS >Reporter: Richard Swinbank >Priority: Minor > > When I use GroupedData.applyInPandas to implement aggregation in a streaming > query, it fails to respect a watermark specified using > DataFrame.withWatermark. > This query reproduces the behaviour I'm seeing: > > {code:python} > from pyspark.sql.functions import window > from typing import Tuple > import pandas as pd > df_source_stream = ( > spark.readStream > .format("rate") > .option("rowsPerSecond", 3) > .load() > .withColumn("bucket", window("timestamp", "10 seconds").end) > ) > def my_function( > key: Tuple[str], df: pd.DataFrame > ) -> pd.DataFrame: > return pd.DataFrame({"bucket": [key[0]], "count": [df.shape[0]]}) > df = ( > df_source_stream > .withWatermark("bucket", "10 seconds") > .groupBy("bucket") > .applyInPandas(my_function, "bucket TIMESTAMP, count INT") > ) > display(df) > {code} > I expect the output of the query to contain one row per {{bucket}} value, but > a new row is emitted for each incoming microbatch. > In contrast, an out of the box aggregate behaves as expected. For example: > {code:python} > df = ( > df_source_stream > .withWatermark("bucket", "10 seconds") > .groupBy("bucket") > .count() # standard aggregate in place of applyInPandas > ) > {code} > The output of this query contains *one* row per {{bucket}} value. > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-48996) Allow bare literals for __and__ and __or__ of Column
[ https://issues.apache.org/jira/browse/SPARK-48996?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-48996: --- Labels: pull-request-available (was: ) > Allow bare literals for __and__ and __or__ of Column > > > Key: SPARK-48996 > URL: https://issues.apache.org/jira/browse/SPARK-48996 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 4.0.0 >Reporter: Takuya Ueshin >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-48997) Maintenance thread pool error should not cause the entire executor to crash
Neil Ramaswamy created SPARK-48997: -- Summary: Maintenance thread pool error should not cause the entire executor to crash Key: SPARK-48997 URL: https://issues.apache.org/jira/browse/SPARK-48997 Project: Spark Issue Type: Improvement Components: Structured Streaming Affects Versions: 4.0.0 Reporter: Neil Ramaswamy Today, it's possible for an exception within a thread in the maintenance pool to cause the entire executor to crash. Here's how: # An error occurs in a maintenance pool thread # It gets passed to the maintenance task thread, which `throw`s it # That gets caught by `onError`, which `.stop()`s the maintenance thread pool # If any of the maintenance pool threads are waiting on a lock, they will receive an `InterruptedException` (this happens if they are verifying if the their state store instance is active) # This `InterruptedException` is not caught, which is not `NonFatal` # This uncaught exception bubbles all the way to the `SparkUncaughtExceptionHandler`, causing the executor to exit A fix that is better is to modify the maintenance thread pool to only `unload` providers that experience errors, not stop the entire thread pool. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-48996) Allow bare literals for __and__ and __or__ of Column
Takuya Ueshin created SPARK-48996: - Summary: Allow bare literals for __and__ and __or__ of Column Key: SPARK-48996 URL: https://issues.apache.org/jira/browse/SPARK-48996 Project: Spark Issue Type: Improvement Components: PySpark Affects Versions: 4.0.0 Reporter: Takuya Ueshin -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-48937) Fix collation support for the StringToMap expression (binary & lowercase collation only)
[ https://issues.apache.org/jira/browse/SPARK-48937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17868501#comment-17868501 ] psyren99 commented on SPARK-48937: -- [~uros-db] still working, should have it done in a couple days > Fix collation support for the StringToMap expression (binary & lowercase > collation only) > > > Key: SPARK-48937 > URL: https://issues.apache.org/jira/browse/SPARK-48937 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 4.0.0 >Reporter: Uroš Bojanić >Priority: Major > > Enable collation support for *StringToMap* built-in string function in Spark > ({*}str_to_map{*}). First confirm what is the expected behaviour for this > function when given collated strings, and then move on to implementation and > testing. You will find this expression in the *complexTypeCreator.scala* > file. However, this experssion is currently implemented as pass-through > function, which is wrong because it doesn't provide appropriate collation > awareness for non-default delimiters. > > Example 1. > {code:java} > SELECT str_to_map('a:1,b:2,c:3', ',', ':');{code} > This query will give the correct result, regardless of the collation. > {code:java} > {"a":"1","b":"2","c":"3"}{code} > > Example 2. > {code:java} > SELECT str_to_map('ay1xby2xcy3', 'X', 'Y');{code} > This query will give the *incorrect* result, under UTF8_LCASE collation. The > correct result should be: > {code:java} > {"a":"1","b":"2","c":"3"}{code} > > Update the corresponding E2E SQL tests (CollationSQLExpressionsSuite) to > reflect how this function should be used with collation in SparkSQL, and feel > free to use your chosen Spark SQL Editor to experiment with the existing > functions to learn more about how they work. In addition, look into the > possible use-cases and implementation of similar functions within other other > open-source DBMS, such as [PostgreSQL|https://www.postgresql.org/docs/]. > > The goal for this Jira ticket is to implement the *StringToMap* expression so > that it supports UTF8_BINARY and UTF8_LCASE collations (i.e. > StringTypeBinaryLcase). To understand what changes were introduced in order > to enable full collation support for other existing functions in Spark, take > a look at the related Spark PRs and Jira tickets for completed tasks in this > parent (for example: https://issues.apache.org/jira/browse/SPARK-47414). > > Read more about ICU [Collation Concepts|http://example.com/] and > [Collator|http://example.com/] class. Also, refer to the Unicode Technical > Standard for string > [collation|https://www.unicode.org/reports/tr35/tr35-collation.html#Collation_Type_Fallback]. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-48995) Column.endswith(None) occasionally causes NPE
[ https://issues.apache.org/jira/browse/SPARK-48995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17868494#comment-17868494 ] Mithun Radhakrishnan commented on SPARK-48995: -- To the untrained eye, it would appear that the {{None}} argument to {{Column.endswith()}} turns up as a null-Column-reference in {{Column.fn}}. > Column.endswith(None) occasionally causes NPE > - > > Key: SPARK-48995 > URL: https://issues.apache.org/jira/browse/SPARK-48995 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 4.0.0 > Environment: Tested from {{pyspark}} shell, on Apache Spark 4.0. >Reporter: Mithun Radhakrishnan >Priority: Major > > This one is pretty hard to repro, since it only seems to happen occasionally. > Invoking `Column.endswith()` seems to result in an NPE, with Spark 4.0: > {code:python} > from pyspark.sql.types import * > import pyspark.sql.functions as f > schema = StructType([StructField("s", StringType(), True)]) > strings = [Row("abc"), Row("bcd"), Row(None)] > df = sc.parallelize(strings).toDF(schema) > df.select( f.col('s').endswith(None) ).collect() > {code} > Here is the resulting stack trace: > {code} > py4j.protocol.Py4JJavaError: An error occurred while calling o205.endsWith. > : java.lang.NullPointerException: Cannot invoke > "org.apache.spark.sql.Column.expr()" because "x$1" is null > at org.apache.spark.sql.Column$.$anonfun$fn$2(Column.scala:77) > at scala.collection.immutable.ArraySeq.map(ArraySeq.scala:75) > at scala.collection.immutable.ArraySeq.map(ArraySeq.scala:35) > at org.apache.spark.sql.Column$.$anonfun$fn$1(Column.scala:77) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(origin.scala:84) > at org.apache.spark.sql.package$.withOrigin(package.scala:111) > at org.apache.spark.sql.Column$.fn(Column.scala:76) > at org.apache.spark.sql.Column$.fn(Column.scala:64) > at org.apache.spark.sql.Column.fn(Column.scala:169) > at org.apache.spark.sql.Column.endsWith(Column.scala:1078) > at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native > Method) > at > java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77) > at > java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.base/java.lang.reflect.Method.invoke(Method.java:568) > at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) > at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:374) > at py4j.Gateway.invoke(Gateway.java:282) > at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) > at py4j.commands.CallCommand.execute(CallCommand.java:79) > at > py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182) > at py4j.ClientServerConnection.run(ClientServerConnection.java:106) > at java.base/java.lang.Thread.run(Thread.java:840) > {code} > This seems to point to {{Column::fn}}, which looks new to Spark 4.0. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-48995) Column.endswith(None) occasionally causes NPE
Mithun Radhakrishnan created SPARK-48995: Summary: Column.endswith(None) occasionally causes NPE Key: SPARK-48995 URL: https://issues.apache.org/jira/browse/SPARK-48995 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 4.0.0 Environment: Tested from {{pyspark}} shell, on Apache Spark 4.0. Reporter: Mithun Radhakrishnan This one is pretty hard to repro, since it only seems to happen occasionally. Invoking `Column.endswith()` seems to result in an NPE, with Spark 4.0: {code:python} from pyspark.sql.types import * import pyspark.sql.functions as f schema = StructType([StructField("s", StringType(), True)]) strings = [Row("abc"), Row("bcd"), Row(None)] df = sc.parallelize(strings).toDF(schema) df.select( f.col('s').endswith(None) ).collect() {code} Here is the resulting stack trace: {code} py4j.protocol.Py4JJavaError: An error occurred while calling o205.endsWith. : java.lang.NullPointerException: Cannot invoke "org.apache.spark.sql.Column.expr()" because "x$1" is null at org.apache.spark.sql.Column$.$anonfun$fn$2(Column.scala:77) at scala.collection.immutable.ArraySeq.map(ArraySeq.scala:75) at scala.collection.immutable.ArraySeq.map(ArraySeq.scala:35) at org.apache.spark.sql.Column$.$anonfun$fn$1(Column.scala:77) at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(origin.scala:84) at org.apache.spark.sql.package$.withOrigin(package.scala:111) at org.apache.spark.sql.Column$.fn(Column.scala:76) at org.apache.spark.sql.Column$.fn(Column.scala:64) at org.apache.spark.sql.Column.fn(Column.scala:169) at org.apache.spark.sql.Column.endsWith(Column.scala:1078) at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77) at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.base/java.lang.reflect.Method.invoke(Method.java:568) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:374) at py4j.Gateway.invoke(Gateway.java:282) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182) at py4j.ClientServerConnection.run(ClientServerConnection.java:106) at java.base/java.lang.Thread.run(Thread.java:840) {code} This seems to point to {{Column::fn}}, which looks new to Spark 4.0. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-48993) Maximum number of maxRecursiveFieldDepth should be a spark conf
[ https://issues.apache.org/jira/browse/SPARK-48993?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17868457#comment-17868457 ] Wei Liu commented on SPARK-48993: - ill followup on this > Maximum number of maxRecursiveFieldDepth should be a spark conf > --- > > Key: SPARK-48993 > URL: https://issues.apache.org/jira/browse/SPARK-48993 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 4.0.0 >Reporter: Wei Liu >Priority: Major > > [https://github.com/apache/spark/pull/38922#discussion_r1051294998] > > There is no reason to hard code a 10 here -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-48993) Maximum number of maxRecursiveFieldDepth should be a spark conf
Wei Liu created SPARK-48993: --- Summary: Maximum number of maxRecursiveFieldDepth should be a spark conf Key: SPARK-48993 URL: https://issues.apache.org/jira/browse/SPARK-48993 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 4.0.0 Reporter: Wei Liu [https://github.com/apache/spark/pull/38922#discussion_r1051294998] There is no reason to hard code a 10 here -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-48992) applyInPandas does not respect streaming watermark
[ https://issues.apache.org/jira/browse/SPARK-48992?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Richard Swinbank updated SPARK-48992: - Description: When I use GroupedData.applyInPandas to implement aggregation in a streaming query, it fails to respect a watermark specified using DataFrame.withWatermark. This query reproduces the behaviour I'm seeing: {code:python} from pyspark.sql.functions import window from typing import Tuple import pandas as pd df_source_stream = ( spark.readStream .format("rate") .option("rowsPerSecond", 3) .load() .withColumn("bucket", window("timestamp", "10 seconds").end) ) def my_function( key: Tuple[str], df: pd.DataFrame ) -> pd.DataFrame: return pd.DataFrame({"bucket": [key[0]], "count": [df.shape[0]]}) df = ( df_source_stream .withWatermark("bucket", "10 seconds") .groupBy("bucket") .applyInPandas(my_function, "bucket TIMESTAMP, count INT") ) display(df) {code} I expect the output of the query to contain one row per {{bucket}} value, but a new row is emitted for each incoming microbatch. In contrast, an out of the box aggregate behaves as expected. For example: {code:python} df = ( df_source_stream .withWatermark("bucket", "10 seconds") .groupBy("bucket") .count() # standard aggregate in place of applyInPandas ) {code} The output of this query contains *one* row per {{bucket}} value. was: When I use GroupedData.applyInPandas to implement aggregation in a streaming query, it fails to respect a watermark specified using DataFrame.withWatermark. This query reproduces the behvaiour I'm seeing: {code:python} from pyspark.sql.functions import window from typing import Tuple import pandas as pd df_source_stream = ( spark.readStream .format("rate") .option("rowsPerSecond", 3) .load() .withColumn("bucket", window("timestamp", "10 seconds").end) ) def my_function( key: Tuple[str], df: pd.DataFrame ) -> pd.DataFrame: return pd.DataFrame({"bucket": [key[0]], "count": [df.shape[0]]}) df = ( df_source_stream .withWatermark("bucket", "10 seconds") .groupBy("bucket") .applyInPandas(my_function, "bucket TIMESTAMP, count INT") ) display(df) {code} I expect the output of the query to contain one row per {{bucket}} value, but a new row is emitted for each incoming microbatch. In contrast, an out of the box aggregate behaves as expected. For example: {code:python} df = ( df_source_stream .withWatermark("bucket", "10 seconds") .groupBy("bucket") .count() # standard aggregate in place of applyInPandas ) {code} The output of this query contains *one* row per {{bucket}} value. > applyInPandas does not respect streaming watermark > -- > > Key: SPARK-48992 > URL: https://issues.apache.org/jira/browse/SPARK-48992 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.5.0 > Environment: Azure Databricks runtime 14.3 LTS >Reporter: Richard Swinbank >Priority: Minor > > When I use GroupedData.applyInPandas to implement aggregation in a streaming > query, it fails to respect a watermark specified using > DataFrame.withWatermark. > This query reproduces the behaviour I'm seeing: > > {code:python} > from pyspark.sql.functions import window > from typing import Tuple > import pandas as pd > df_source_stream = ( > spark.readStream > .format("rate") > .option("rowsPerSecond", 3) > .load() > .withColumn("bucket", window("timestamp", "10 seconds").end) > ) > def my_function( > key: Tuple[str], df: pd.DataFrame > ) -> pd.DataFrame: > return pd.DataFrame({"bucket": [key[0]], "count": [df.shape[0]]}) > df = ( > df_source_stream > .withWatermark("bucket", "10 seconds") > .groupBy("bucket") > .applyInPandas(my_function, "bucket TIMESTAMP, count INT") > ) > display(df) > {code} > I expect the output of the query to contain one row per {{bucket}} value, but > a new row is emitted for each incoming microbatch. > In contrast, an out of the box aggregate behaves as expected. For example: > {code:python} > df = ( > df_source_stream > .withWatermark("bucket", "10 seconds") > .groupBy("bucket") > .count() # standard aggregate in place of applyInPandas > ) > {code} > The output of this query contains *one* row per {{bucket}} value. > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-48992) applyInPandas does not respect streaming watermark
Richard Swinbank created SPARK-48992: Summary: applyInPandas does not respect streaming watermark Key: SPARK-48992 URL: https://issues.apache.org/jira/browse/SPARK-48992 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 3.5.0 Environment: Azure Databricks runtime 14.3 LTS Reporter: Richard Swinbank When I use GroupedData.applyInPandas to implement aggregation in a streaming query, it fails to respect a watermark specified using DataFrame.withWatermark. This query reproduces the behvaiour I'm seeing: {code:python} from pyspark.sql.functions import window from typing import Tuple import pandas as pd df_source_stream = ( spark.readStream .format("rate") .option("rowsPerSecond", 3) .load() .withColumn("bucket", window("timestamp", "10 seconds").end) ) def my_function( key: Tuple[str], df: pd.DataFrame ) -> pd.DataFrame: return pd.DataFrame({"bucket": [key[0]], "count": [df.shape[0]]}) df = ( df_source_stream .withWatermark("bucket", "10 seconds") .groupBy("bucket") .applyInPandas(my_function, "bucket TIMESTAMP, count INT") ) display(df) {code} I expect the output of the query to contain one row per {{bucket}} value, but a new row is emitted for each incoming microbatch. In contrast, an out of the box aggregate behaves as expected. For example: {code:python} df = ( df_source_stream .withWatermark("bucket", "10 seconds") .groupBy("bucket") .count() # standard aggregate in place of applyInPandas ) {code} The output of this query contains *one* row per {{bucket}} value. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-47677) Pandas circular import error in Python 3.10
[ https://issues.apache.org/jira/browse/SPARK-47677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17868384#comment-17868384 ] Mikó Szilárd commented on SPARK-47677: -- Hi [~XinrongM], Is it possible that this change fixed the problem? [https://github.com/apache/spark/pull/45832] > Pandas circular import error in Python 3.10 > > > Key: SPARK-47677 > URL: https://issues.apache.org/jira/browse/SPARK-47677 > Project: Spark > Issue Type: Test > Components: PySpark, Tests >Affects Versions: 4.0.0 >Reporter: Xinrong Meng >Priority: Major > > {{AttributeError: partially initialized module 'pandas' has no attribute > '_pandas_datetime_CAPI' (most likely due to a circular import)}} > > The above error appears in multiple tests with Python 3.10. > Python 3.11, 3.12 and pypy3 don't have the issue. > > See [https://github.com/apache/spark/actions/runs/8469356110/job/23208894575] > for details. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-47865) Deflaky PythonForeachWriterSuite
[ https://issues.apache.org/jira/browse/SPARK-47865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17868365#comment-17868365 ] Mikó Szilárd commented on SPARK-47865: -- Hi [~dongjoon] , Is this a duplicate of https://issues.apache.org/jira/browse/SPARK-47866? > Deflaky PythonForeachWriterSuite > > > Key: SPARK-47865 > URL: https://issues.apache.org/jira/browse/SPARK-47865 > Project: Spark > Issue Type: Bug > Components: SQL, Tests >Affects Versions: 4.0.0 >Reporter: Dongjoon Hyun >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-48991) FileStreamSink.hasMetadata handles invalid path
[ https://issues.apache.org/jira/browse/SPARK-48991?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kent Yao updated SPARK-48991: - Fix Version/s: 3.5.3 (was: 3.5.2) > FileStreamSink.hasMetadata handles invalid path > --- > > Key: SPARK-48991 > URL: https://issues.apache.org/jira/browse/SPARK-48991 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 4.0.0, 3.5.1, 3.4.3 >Reporter: Kent Yao >Assignee: Kent Yao >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0, 3.4.4, 3.5.3 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-48988) Make DefaultParamsReader/Writer handle metadata with spark session
[ https://issues.apache.org/jira/browse/SPARK-48988?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-48988. -- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 47467 [https://github.com/apache/spark/pull/47467] > Make DefaultParamsReader/Writer handle metadata with spark session > -- > > Key: SPARK-48988 > URL: https://issues.apache.org/jira/browse/SPARK-48988 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 4.0.0 >Reporter: Ruifeng Zheng >Assignee: Ruifeng Zheng >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-48988) Make DefaultParamsReader/Writer handle metadata with spark session
[ https://issues.apache.org/jira/browse/SPARK-48988?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-48988: --- Labels: pull-request-available (was: ) > Make DefaultParamsReader/Writer handle metadata with spark session > -- > > Key: SPARK-48988 > URL: https://issues.apache.org/jira/browse/SPARK-48988 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 4.0.0 >Reporter: Ruifeng Zheng >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-48988) Make DefaultParamsReader/Writer handle metadata with spark session
[ https://issues.apache.org/jira/browse/SPARK-48988?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-48988: Assignee: Ruifeng Zheng > Make DefaultParamsReader/Writer handle metadata with spark session > -- > > Key: SPARK-48988 > URL: https://issues.apache.org/jira/browse/SPARK-48988 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 4.0.0 >Reporter: Ruifeng Zheng >Assignee: Ruifeng Zheng >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-48833) Support variant in `InMemoryTableScan`
[ https://issues.apache.org/jira/browse/SPARK-48833?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-48833. - Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 47252 [https://github.com/apache/spark/pull/47252] > Support variant in `InMemoryTableScan` > -- > > Key: SPARK-48833 > URL: https://issues.apache.org/jira/browse/SPARK-48833 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 4.0.0 >Reporter: Richard Chen >Assignee: Richard Chen >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > > Currently, df.cache() does not support tables with variant types. We should > allow for support -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-48833) Support variant in `InMemoryTableScan`
[ https://issues.apache.org/jira/browse/SPARK-48833?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-48833: --- Assignee: Richard Chen > Support variant in `InMemoryTableScan` > -- > > Key: SPARK-48833 > URL: https://issues.apache.org/jira/browse/SPARK-48833 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 4.0.0 >Reporter: Richard Chen >Assignee: Richard Chen >Priority: Major > Labels: pull-request-available > > Currently, df.cache() does not support tables with variant types. We should > allow for support -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-48567) Pyspark StreamingQuery lastProgress and friend should return actual StreamingQueryProgress
[ https://issues.apache.org/jira/browse/SPARK-48567?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-48567: Assignee: Hyukjin Kwon > Pyspark StreamingQuery lastProgress and friend should return actual > StreamingQueryProgress > -- > > Key: SPARK-48567 > URL: https://issues.apache.org/jira/browse/SPARK-48567 > Project: Spark > Issue Type: New Feature > Components: PySpark, SS >Affects Versions: 4.0.0 >Reporter: Wei Liu >Assignee: Hyukjin Kwon >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-48567) Pyspark StreamingQuery lastProgress and friend should return actual StreamingQueryProgress
[ https://issues.apache.org/jira/browse/SPARK-48567?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-48567. -- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 47470 [https://github.com/apache/spark/pull/47470] > Pyspark StreamingQuery lastProgress and friend should return actual > StreamingQueryProgress > -- > > Key: SPARK-48567 > URL: https://issues.apache.org/jira/browse/SPARK-48567 > Project: Spark > Issue Type: New Feature > Components: PySpark, SS >Affects Versions: 4.0.0 >Reporter: Wei Liu >Assignee: Hyukjin Kwon >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-48991) FileStreamSink.hasMetadata handles invalid path
[ https://issues.apache.org/jira/browse/SPARK-48991?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-48991: --- Labels: pull-request-available (was: ) > FileStreamSink.hasMetadata handles invalid path > --- > > Key: SPARK-48991 > URL: https://issues.apache.org/jira/browse/SPARK-48991 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 4.0.0, 3.5.1, 3.4.3 >Reporter: Kent Yao >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-48991) FileStreamSink.hasMetadata handles invalid path
Kent Yao created SPARK-48991: Summary: FileStreamSink.hasMetadata handles invalid path Key: SPARK-48991 URL: https://issues.apache.org/jira/browse/SPARK-48991 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.4.3, 3.5.1, 4.0.0 Reporter: Kent Yao -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-48338) Sql Scripting support for Spark SQL
[ https://issues.apache.org/jira/browse/SPARK-48338?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-48338. - Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 47404 [https://github.com/apache/spark/pull/47404] > Sql Scripting support for Spark SQL > --- > > Key: SPARK-48338 > URL: https://issues.apache.org/jira/browse/SPARK-48338 > Project: Spark > Issue Type: Epic > Components: Spark Core >Affects Versions: 4.0.0 >Reporter: Aleksandar Tomic >Assignee: Aleksandar Tomic >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > Attachments: Sql Scripting - OSS.odt, [Design Doc] Sql Scripting - > OSS.pdf > > > Design doc for this feature is in attachment. > High level example of Sql Script: > ``` > BEGIN > DECLARE c INT = 10; > WHILE c > 0 DO > INSERT INTO tscript VALUES (c); > SET c = c - 1; > END WHILE; > END > ``` > High level motivation behind this feature: > SQL Scripting gives customers the ability to develop complex ETL and analysis > entirely in SQL. Until now, customers have had to write verbose SQL > statements or combine SQL + Python to efficiently write business logic. > Coming from another system, customers have to choose whether or not they want > to migrate to pyspark. Some customers end up not using Spark because of this > gap. SQL Scripting is a key milestone towards enabling SQL practitioners to > write sophisticated queries, without the need to use pyspark. Further, SQL > Scripting is a necessary step towards support for SQL Stored Procedures, and > along with SQL Variables (released) and Temp Tables (in progress), will allow > for more seamless data warehouse migrations. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-48338) Sql Scripting support for Spark SQL
[ https://issues.apache.org/jira/browse/SPARK-48338?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-48338: --- Assignee: Aleksandar Tomic > Sql Scripting support for Spark SQL > --- > > Key: SPARK-48338 > URL: https://issues.apache.org/jira/browse/SPARK-48338 > Project: Spark > Issue Type: Epic > Components: Spark Core >Affects Versions: 4.0.0 >Reporter: Aleksandar Tomic >Assignee: Aleksandar Tomic >Priority: Major > Labels: pull-request-available > Attachments: Sql Scripting - OSS.odt, [Design Doc] Sql Scripting - > OSS.pdf > > > Design doc for this feature is in attachment. > High level example of Sql Script: > ``` > BEGIN > DECLARE c INT = 10; > WHILE c > 0 DO > INSERT INTO tscript VALUES (c); > SET c = c - 1; > END WHILE; > END > ``` > High level motivation behind this feature: > SQL Scripting gives customers the ability to develop complex ETL and analysis > entirely in SQL. Until now, customers have had to write verbose SQL > statements or combine SQL + Python to efficiently write business logic. > Coming from another system, customers have to choose whether or not they want > to migrate to pyspark. Some customers end up not using Spark because of this > gap. SQL Scripting is a key milestone towards enabling SQL practitioners to > write sophisticated queries, without the need to use pyspark. Further, SQL > Scripting is a necessary step towards support for SQL Stored Procedures, and > along with SQL Variables (released) and Temp Tables (in progress), will allow > for more seamless data warehouse migrations. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-48990) Unified variable related SQL syntax keywords
[ https://issues.apache.org/jira/browse/SPARK-48990?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-48990: --- Labels: pull-request-available (was: ) > Unified variable related SQL syntax keywords > > > Key: SPARK-48990 > URL: https://issues.apache.org/jira/browse/SPARK-48990 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 4.0.0 >Reporter: BingKun Pan >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-48935) Make `checkEvaluation` directly check the `Collation` expression itself in UT
[ https://issues.apache.org/jira/browse/SPARK-48935?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-48935. - Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 47401 [https://github.com/apache/spark/pull/47401] > Make `checkEvaluation` directly check the `Collation` expression itself in UT > -- > > Key: SPARK-48935 > URL: https://issues.apache.org/jira/browse/SPARK-48935 > Project: Spark > Issue Type: Improvement > Components: SQL, Tests >Affects Versions: 4.0.0 >Reporter: BingKun Pan >Assignee: BingKun Pan >Priority: Minor > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-48990) Unified variable related SQL syntax keywords
BingKun Pan created SPARK-48990: --- Summary: Unified variable related SQL syntax keywords Key: SPARK-48990 URL: https://issues.apache.org/jira/browse/SPARK-48990 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 4.0.0 Reporter: BingKun Pan -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-48956) Spark Repartition Task Field Retry Cause Data Duplication
[ https://issues.apache.org/jira/browse/SPARK-48956?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] xuanzhiang updated SPARK-48956: --- Affects Version/s: 3.2.3 3.2.2 3.1.3 > Spark Repartition Task Field Retry Cause Data Duplication > - > > Key: SPARK-48956 > URL: https://issues.apache.org/jira/browse/SPARK-48956 > Project: Spark > Issue Type: Bug > Components: Shuffle >Affects Versions: 3.1.3, 3.2.1, 3.2.2, 3.2.3 >Reporter: xuanzhiang >Priority: Major > Attachments: image-2024-07-21-18-21-33-888.png, > image-2024-07-21-18-22-04-665.png, image-2024-07-22-10-00-45-793.png, > image-2024-07-22-14-47-50-773.png > > > The question seems like > [SPARK-23207|https://issues.apache.org/jira/browse/SPARK-23207] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] (SPARK-48956) Spark Repartition Task Field Retry Cause Data Duplication
[ https://issues.apache.org/jira/browse/SPARK-48956 ] xuanzhiang deleted comment on SPARK-48956: was (Author: JIRAUSER295364): Metric info error. Actual output 35351985,but got duplicate data. I will try to reproduce the problem and give use cases > Spark Repartition Task Field Retry Cause Data Duplication > - > > Key: SPARK-48956 > URL: https://issues.apache.org/jira/browse/SPARK-48956 > Project: Spark > Issue Type: Bug > Components: Shuffle >Affects Versions: 3.1.3, 3.2.1, 3.2.2, 3.2.3 >Reporter: xuanzhiang >Priority: Major > Attachments: image-2024-07-21-18-21-33-888.png, > image-2024-07-21-18-22-04-665.png, image-2024-07-22-10-00-45-793.png, > image-2024-07-22-14-47-50-773.png > > > The question seems like > [SPARK-23207|https://issues.apache.org/jira/browse/SPARK-23207] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-48989) WholeStageCodeGen error resulting in NumberFormatException when calling SUBSTRING_INDEX
[ https://issues.apache.org/jira/browse/SPARK-48989?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mithun Radhakrishnan updated SPARK-48989: - Environment: This was tested from the {{spark-shell}}, in local mode. All Spark versions were run with default settings. Spark 4.0 SNAPSHOT: Exception. Spark 4.0 Preview: Exception. Spark 3.5.1: Success. was: This was tested from the {{spark-shell}}, in local mode. All environments were run with default settings. Spark 4.0 SNAPSHOT: Exception. Spark 4.0 Preview: Exception. Spark 3.5.1: Success. > WholeStageCodeGen error resulting in NumberFormatException when calling > SUBSTRING_INDEX > --- > > Key: SPARK-48989 > URL: https://issues.apache.org/jira/browse/SPARK-48989 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 4.0.0 > Environment: This was tested from the {{spark-shell}}, in local mode. > All Spark versions were run with default settings. > Spark 4.0 SNAPSHOT: Exception. > Spark 4.0 Preview: Exception. > Spark 3.5.1: Success. >Reporter: Mithun Radhakrishnan >Priority: Major > > One seems to run into a {{NumberFormatException}}, possibly from an error in > WholeStageCodeGen, when I exercise {{SUBSTRING_INDEX}} with a null row, thus: > {code:scala} > // Create integer table with one null. > sql( " SELECT num FROM VALUES (1), (2), (3), (NULL) AS (num) > ").repartition(1).write.mode("overwrite").parquet("/tmp/mytable") > // Exercise substring-index. > sql( " SELECT num, SUBSTRING_INDEX('a_a_a', '_', num) AS subs FROM > PARQUET.`/tmp/mytable` ").show() > {code} > On Spark 4.0 (HEAD, as of today, and with the preview-1), I see the following > exception: > {code} > java.lang.NumberFormatException: For input string: "columnartorow_value_0" > at > java.base/java.lang.NumberFormatException.forInputString(NumberFormatException.java:67) > at java.base/java.lang.Integer.parseInt(Integer.java:668) > at > org.apache.spark.sql.catalyst.expressions.SubstringIndex.$anonfun$doGenCode$29(stringExpressions.scala:1660) > at > org.apache.spark.sql.catalyst.expressions.TernaryExpression.$anonfun$defineCodeGen$3(Expression.scala:869) > at > org.apache.spark.sql.catalyst.expressions.TernaryExpression.nullSafeCodeGen(Expression.scala:888) > at > org.apache.spark.sql.catalyst.expressions.TernaryExpression.defineCodeGen(Expression.scala:868) > at > org.apache.spark.sql.catalyst.expressions.SubstringIndex.doGenCode(stringExpressions.scala:1659) > at > org.apache.spark.sql.catalyst.expressions.Expression.$anonfun$genCode$3(Expression.scala:207) > at scala.Option.getOrElse(Option.scala:201) > at > org.apache.spark.sql.catalyst.expressions.Expression.genCode(Expression.scala:202) > at > org.apache.spark.sql.catalyst.expressions.ToPrettyString.doGenCode(ToPrettyString.scala:62) > at > org.apache.spark.sql.catalyst.expressions.Expression.$anonfun$genCode$3(Expression.scala:207) > at scala.Option.getOrElse(Option.scala:201) > at > org.apache.spark.sql.catalyst.expressions.Expression.genCode(Expression.scala:202) > at > org.apache.spark.sql.catalyst.expressions.Alias.genCode(namedExpressions.scala:162) > at > org.apache.spark.sql.execution.ProjectExec.$anonfun$doConsume$2(basicPhysicalOperators.scala:74) > at scala.collection.immutable.List.map(List.scala:247) > at scala.collection.immutable.List.map(List.scala:79) > at > org.apache.spark.sql.execution.ProjectExec.$anonfun$doConsume$1(basicPhysicalOperators.scala:74) > at > org.apache.spark.sql.catalyst.expressions.codegen.CodegenContext.withSubExprEliminationExprs(CodeGenerator.scala:1085) > at > org.apache.spark.sql.execution.ProjectExec.doConsume(basicPhysicalOperators.scala:74) > at > org.apache.spark.sql.execution.CodegenSupport.consume(WholeStageCodegenExec.scala:200) > at > org.apache.spark.sql.execution.CodegenSupport.consume$(WholeStageCodegenExec.scala:153) > at > org.apache.spark.sql.execution.ColumnarToRowExec.consume(Columnar.scala:68) > at > org.apache.spark.sql.execution.ColumnarToRowExec.doProduce(Columnar.scala:193) > at > org.apache.spark.sql.execution.CodegenSupport.$anonfun$produce$1(WholeStageCodegenExec.scala:99) > {code} > The same query seems to run alright on Spark 3.5.x: > {code} > ++-+ > | num| subs| > ++-+ > | 1|a| > | 2| a_a| > | 3|a_a_a| > |NULL| NULL| > ++-+ > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-48989) WholeStageCodeGen error resulting in NumberFormatException when calling SUBSTRING_INDEX
[ https://issues.apache.org/jira/browse/SPARK-48989?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mithun Radhakrishnan updated SPARK-48989: - Environment: This was tested from the {{spark-shell}}, in local mode. All environments were run with default settings. Spark 4.0 SNAPSHOT: Exception. Spark 4.0 Preview: Exception. Spark 3.5.1: Success. was: This was tested from the {{spark-shell}}, in local mode. Spark 4.0 SNAPSHOT: Exception. Spark 4.0 Preview: Exception. Spark 3.5.1: Success. > WholeStageCodeGen error resulting in NumberFormatException when calling > SUBSTRING_INDEX > --- > > Key: SPARK-48989 > URL: https://issues.apache.org/jira/browse/SPARK-48989 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 4.0.0 > Environment: This was tested from the {{spark-shell}}, in local mode. > All environments were run with default settings. > Spark 4.0 SNAPSHOT: Exception. > Spark 4.0 Preview: Exception. > Spark 3.5.1: Success. >Reporter: Mithun Radhakrishnan >Priority: Major > > One seems to run into a {{NumberFormatException}}, possibly from an error in > WholeStageCodeGen, when I exercise {{SUBSTRING_INDEX}} with a null row, thus: > {code:scala} > // Create integer table with one null. > sql( " SELECT num FROM VALUES (1), (2), (3), (NULL) AS (num) > ").repartition(1).write.mode("overwrite").parquet("/tmp/mytable") > // Exercise substring-index. > sql( " SELECT num, SUBSTRING_INDEX('a_a_a', '_', num) AS subs FROM > PARQUET.`/tmp/mytable` ").show() > {code} > On Spark 4.0 (HEAD, as of today, and with the preview-1), I see the following > exception: > {code} > java.lang.NumberFormatException: For input string: "columnartorow_value_0" > at > java.base/java.lang.NumberFormatException.forInputString(NumberFormatException.java:67) > at java.base/java.lang.Integer.parseInt(Integer.java:668) > at > org.apache.spark.sql.catalyst.expressions.SubstringIndex.$anonfun$doGenCode$29(stringExpressions.scala:1660) > at > org.apache.spark.sql.catalyst.expressions.TernaryExpression.$anonfun$defineCodeGen$3(Expression.scala:869) > at > org.apache.spark.sql.catalyst.expressions.TernaryExpression.nullSafeCodeGen(Expression.scala:888) > at > org.apache.spark.sql.catalyst.expressions.TernaryExpression.defineCodeGen(Expression.scala:868) > at > org.apache.spark.sql.catalyst.expressions.SubstringIndex.doGenCode(stringExpressions.scala:1659) > at > org.apache.spark.sql.catalyst.expressions.Expression.$anonfun$genCode$3(Expression.scala:207) > at scala.Option.getOrElse(Option.scala:201) > at > org.apache.spark.sql.catalyst.expressions.Expression.genCode(Expression.scala:202) > at > org.apache.spark.sql.catalyst.expressions.ToPrettyString.doGenCode(ToPrettyString.scala:62) > at > org.apache.spark.sql.catalyst.expressions.Expression.$anonfun$genCode$3(Expression.scala:207) > at scala.Option.getOrElse(Option.scala:201) > at > org.apache.spark.sql.catalyst.expressions.Expression.genCode(Expression.scala:202) > at > org.apache.spark.sql.catalyst.expressions.Alias.genCode(namedExpressions.scala:162) > at > org.apache.spark.sql.execution.ProjectExec.$anonfun$doConsume$2(basicPhysicalOperators.scala:74) > at scala.collection.immutable.List.map(List.scala:247) > at scala.collection.immutable.List.map(List.scala:79) > at > org.apache.spark.sql.execution.ProjectExec.$anonfun$doConsume$1(basicPhysicalOperators.scala:74) > at > org.apache.spark.sql.catalyst.expressions.codegen.CodegenContext.withSubExprEliminationExprs(CodeGenerator.scala:1085) > at > org.apache.spark.sql.execution.ProjectExec.doConsume(basicPhysicalOperators.scala:74) > at > org.apache.spark.sql.execution.CodegenSupport.consume(WholeStageCodegenExec.scala:200) > at > org.apache.spark.sql.execution.CodegenSupport.consume$(WholeStageCodegenExec.scala:153) > at > org.apache.spark.sql.execution.ColumnarToRowExec.consume(Columnar.scala:68) > at > org.apache.spark.sql.execution.ColumnarToRowExec.doProduce(Columnar.scala:193) > at > org.apache.spark.sql.execution.CodegenSupport.$anonfun$produce$1(WholeStageCodegenExec.scala:99) > {code} > The same query seems to run alright on Spark 3.5.x: > {code} > ++-+ > | num| subs| > ++-+ > | 1|a| > | 2| a_a| > | 3|a_a_a| > |NULL| NULL| > ++-+ > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-48989) WholeStageCodeGen error resulting in NumberFormatException when calling SUBSTRING_INDEX
[ https://issues.apache.org/jira/browse/SPARK-48989?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mithun Radhakrishnan updated SPARK-48989: - Description: I seem to be running into a {{NumberFormatException}}, possibly from an error in WholeStageCodeGen, when I exercise {{SUBSTRING_INDEX}} with a null row, thus: {code:scala} // Create integer table with one null. sql( " SELECT num FROM VALUES (1), (2), (3), (NULL) AS (num) ").repartition(1).write.mode("overwrite").parquet("/tmp/mytable") // Exercise substring-index. sql( " SELECT num, SUBSTRING_INDEX('a_a_a', '_', num) AS subs FROM PARQUET.`/tmp/mytable` ").show() {code} On Spark 4.0 (HEAD, as of today, and with the preview-1), I see the following exception: {code} java.lang.NumberFormatException: For input string: "columnartorow_value_0" at java.base/java.lang.NumberFormatException.forInputString(NumberFormatException.java:67) at java.base/java.lang.Integer.parseInt(Integer.java:668) at org.apache.spark.sql.catalyst.expressions.SubstringIndex.$anonfun$doGenCode$29(stringExpressions.scala:1660) at org.apache.spark.sql.catalyst.expressions.TernaryExpression.$anonfun$defineCodeGen$3(Expression.scala:869) at org.apache.spark.sql.catalyst.expressions.TernaryExpression.nullSafeCodeGen(Expression.scala:888) at org.apache.spark.sql.catalyst.expressions.TernaryExpression.defineCodeGen(Expression.scala:868) at org.apache.spark.sql.catalyst.expressions.SubstringIndex.doGenCode(stringExpressions.scala:1659) at org.apache.spark.sql.catalyst.expressions.Expression.$anonfun$genCode$3(Expression.scala:207) at scala.Option.getOrElse(Option.scala:201) at org.apache.spark.sql.catalyst.expressions.Expression.genCode(Expression.scala:202) at org.apache.spark.sql.catalyst.expressions.ToPrettyString.doGenCode(ToPrettyString.scala:62) at org.apache.spark.sql.catalyst.expressions.Expression.$anonfun$genCode$3(Expression.scala:207) at scala.Option.getOrElse(Option.scala:201) at org.apache.spark.sql.catalyst.expressions.Expression.genCode(Expression.scala:202) at org.apache.spark.sql.catalyst.expressions.Alias.genCode(namedExpressions.scala:162) at org.apache.spark.sql.execution.ProjectExec.$anonfun$doConsume$2(basicPhysicalOperators.scala:74) at scala.collection.immutable.List.map(List.scala:247) at scala.collection.immutable.List.map(List.scala:79) at org.apache.spark.sql.execution.ProjectExec.$anonfun$doConsume$1(basicPhysicalOperators.scala:74) at org.apache.spark.sql.catalyst.expressions.codegen.CodegenContext.withSubExprEliminationExprs(CodeGenerator.scala:1085) at org.apache.spark.sql.execution.ProjectExec.doConsume(basicPhysicalOperators.scala:74) at org.apache.spark.sql.execution.CodegenSupport.consume(WholeStageCodegenExec.scala:200) at org.apache.spark.sql.execution.CodegenSupport.consume$(WholeStageCodegenExec.scala:153) at org.apache.spark.sql.execution.ColumnarToRowExec.consume(Columnar.scala:68) at org.apache.spark.sql.execution.ColumnarToRowExec.doProduce(Columnar.scala:193) at org.apache.spark.sql.execution.CodegenSupport.$anonfun$produce$1(WholeStageCodegenExec.scala:99) {code} The same query seems to run alright on Spark 3.5.x: {code} ++-+ | num| subs| ++-+ | 1|a| | 2| a_a| | 3|a_a_a| |NULL| NULL| ++-+ {code} was: I seem to be running into a `NumberFormatException`, possibly from an error in WholeStageCodeGen, when I exercise `SUBSTRING_INDEX` with a null row, thus: {code:scala} // Create integer table with one null. sql( " SELECT num FROM VALUES (1), (2), (3), (NULL) AS (num) ").repartition(1).write.mode("overwrite").parquet("/tmp/mytable") // Exercise substring-index. sql( " SELECT num, SUBSTRING_INDEX('a_a_a', '_', num) AS subs FROM PARQUET.`/tmp/mytable` ").show() {code} On Spark 4.0 (HEAD, as of today, and with the preview-1), I see the following exception: {code} java.lang.NumberFormatException: For input string: "columnartorow_value_0" at java.base/java.lang.NumberFormatException.forInputString(NumberFormatException.java:67) at java.base/java.lang.Integer.parseInt(Integer.java:668) at org.apache.spark.sql.catalyst.expressions.SubstringIndex.$anonfun$doGenCode$29(stringExpressions.scala:1660) at org.apache.spark.sql.catalyst.expressions.TernaryExpression.$anonfun$defineCodeGen$3(Expression.scala:869) at org.apache.spark.sql.catalyst.expressions.TernaryExpression.nullSafeCodeGen(Expression.scala:888) at org.apache.spark.sql.catalyst.expressions.TernaryExpression.defineCodeGen(Expression.scala:868) at org.apache.spark.sql.catalyst.expressions.SubstringIndex.doGenCode(stringExpressions.scala:1659) at org.apache.spark.sql.catalyst.expressions.Expression.$anonfun$genCode$3(Expression.scala:207) at scala.Option.getOrElse(Optio
[jira] [Created] (SPARK-48989) WholeStageCodeGen error resulting in NumberFormatException when calling SUBSTRING_INDEX
Mithun Radhakrishnan created SPARK-48989: Summary: WholeStageCodeGen error resulting in NumberFormatException when calling SUBSTRING_INDEX Key: SPARK-48989 URL: https://issues.apache.org/jira/browse/SPARK-48989 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 4.0.0 Environment: This was tested from the {{spark-shell}}, in local mode. Spark 4.0 SNAPSHOT: Exception. Spark 4.0 Preview: Exception. Spark 3.5.1: Success. Reporter: Mithun Radhakrishnan I seem to be running into a `NumberFormatException`, possibly from an error in WholeStageCodeGen, when I exercise `SUBSTRING_INDEX` with a null row, thus: {code:scala} // Create integer table with one null. sql( " SELECT num FROM VALUES (1), (2), (3), (NULL) AS (num) ").repartition(1).write.mode("overwrite").parquet("/tmp/mytable") // Exercise substring-index. sql( " SELECT num, SUBSTRING_INDEX('a_a_a', '_', num) AS subs FROM PARQUET.`/tmp/mytable` ").show() {code} On Spark 4.0 (HEAD, as of today, and with the preview-1), I see the following exception: {code} java.lang.NumberFormatException: For input string: "columnartorow_value_0" at java.base/java.lang.NumberFormatException.forInputString(NumberFormatException.java:67) at java.base/java.lang.Integer.parseInt(Integer.java:668) at org.apache.spark.sql.catalyst.expressions.SubstringIndex.$anonfun$doGenCode$29(stringExpressions.scala:1660) at org.apache.spark.sql.catalyst.expressions.TernaryExpression.$anonfun$defineCodeGen$3(Expression.scala:869) at org.apache.spark.sql.catalyst.expressions.TernaryExpression.nullSafeCodeGen(Expression.scala:888) at org.apache.spark.sql.catalyst.expressions.TernaryExpression.defineCodeGen(Expression.scala:868) at org.apache.spark.sql.catalyst.expressions.SubstringIndex.doGenCode(stringExpressions.scala:1659) at org.apache.spark.sql.catalyst.expressions.Expression.$anonfun$genCode$3(Expression.scala:207) at scala.Option.getOrElse(Option.scala:201) at org.apache.spark.sql.catalyst.expressions.Expression.genCode(Expression.scala:202) at org.apache.spark.sql.catalyst.expressions.ToPrettyString.doGenCode(ToPrettyString.scala:62) at org.apache.spark.sql.catalyst.expressions.Expression.$anonfun$genCode$3(Expression.scala:207) at scala.Option.getOrElse(Option.scala:201) at org.apache.spark.sql.catalyst.expressions.Expression.genCode(Expression.scala:202) at org.apache.spark.sql.catalyst.expressions.Alias.genCode(namedExpressions.scala:162) at org.apache.spark.sql.execution.ProjectExec.$anonfun$doConsume$2(basicPhysicalOperators.scala:74) at scala.collection.immutable.List.map(List.scala:247) at scala.collection.immutable.List.map(List.scala:79) at org.apache.spark.sql.execution.ProjectExec.$anonfun$doConsume$1(basicPhysicalOperators.scala:74) at org.apache.spark.sql.catalyst.expressions.codegen.CodegenContext.withSubExprEliminationExprs(CodeGenerator.scala:1085) at org.apache.spark.sql.execution.ProjectExec.doConsume(basicPhysicalOperators.scala:74) at org.apache.spark.sql.execution.CodegenSupport.consume(WholeStageCodegenExec.scala:200) at org.apache.spark.sql.execution.CodegenSupport.consume$(WholeStageCodegenExec.scala:153) at org.apache.spark.sql.execution.ColumnarToRowExec.consume(Columnar.scala:68) at org.apache.spark.sql.execution.ColumnarToRowExec.doProduce(Columnar.scala:193) at org.apache.spark.sql.execution.CodegenSupport.$anonfun$produce$1(WholeStageCodegenExec.scala:99) {code} The same query seems to run alright on Spark 3.5.x: {code} ++-+ | num| subs| ++-+ | 1|a| | 2| a_a| | 3|a_a_a| |NULL| NULL| ++-+ {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-48961) Make the parameter naming of PySparkException consistent with JVM
[ https://issues.apache.org/jira/browse/SPARK-48961?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Haejoon Lee resolved SPARK-48961. - Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 47436 [https://github.com/apache/spark/pull/47436] > Make the parameter naming of PySparkException consistent with JVM > - > > Key: SPARK-48961 > URL: https://issues.apache.org/jira/browse/SPARK-48961 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 4.0.0 >Reporter: Haejoon Lee >Assignee: Haejoon Lee >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > > Parameter naming of PySparkException <> SparkException is different so there > are inconsistency when searching error logs. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-48961) Make the parameter naming of PySparkException consistent with JVM
[ https://issues.apache.org/jira/browse/SPARK-48961?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Haejoon Lee reassigned SPARK-48961: --- Assignee: Haejoon Lee > Make the parameter naming of PySparkException consistent with JVM > - > > Key: SPARK-48961 > URL: https://issues.apache.org/jira/browse/SPARK-48961 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 4.0.0 >Reporter: Haejoon Lee >Assignee: Haejoon Lee >Priority: Major > Labels: pull-request-available > > Parameter naming of PySparkException <> SparkException is different so there > are inconsistency when searching error logs. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-48931) Reduce Cloud Store List API cost for state store maintenance task
[ https://issues.apache.org/jira/browse/SPARK-48931?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jungtaek Lim resolved SPARK-48931. -- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 47393 [https://github.com/apache/spark/pull/47393] > Reduce Cloud Store List API cost for state store maintenance task > - > > Key: SPARK-48931 > URL: https://issues.apache.org/jira/browse/SPARK-48931 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 3.4.3 >Reporter: Riya Verma >Assignee: Riya Verma >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > > Currently, during the state store maintenance process, we find which old > version files of the RocksDB state store to delete by listing all existing > snapshotted version files in the checkpoint directory every 1 minute by > default. The frequent list calls in the cloud can result in high costs. To > address this concern and reduce the cost associated with state store > maintenance, we should aim to minimize the frequency of listing object stores > inside the maintenance task. To minimize the frequency, we will try to > accumulate versions to delete and only call list when the number of versions > to delete reaches a configured threshold. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-48931) Reduce Cloud Store List API cost for state store maintenance task
[ https://issues.apache.org/jira/browse/SPARK-48931?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jungtaek Lim reassigned SPARK-48931: Assignee: Riya Verma > Reduce Cloud Store List API cost for state store maintenance task > - > > Key: SPARK-48931 > URL: https://issues.apache.org/jira/browse/SPARK-48931 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 3.4.3 >Reporter: Riya Verma >Assignee: Riya Verma >Priority: Major > Labels: pull-request-available > > Currently, during the state store maintenance process, we find which old > version files of the RocksDB state store to delete by listing all existing > snapshotted version files in the checkpoint directory every 1 minute by > default. The frequent list calls in the cloud can result in high costs. To > address this concern and reduce the cost associated with state store > maintenance, we should aim to minimize the frequency of listing object stores > inside the maintenance task. To minimize the frequency, we will try to > accumulate versions to delete and only call list when the number of versions > to delete reaches a configured threshold. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-48988) Make DefaultParamsReader/Writer handle metadata with spark session
Ruifeng Zheng created SPARK-48988: - Summary: Make DefaultParamsReader/Writer handle metadata with spark session Key: SPARK-48988 URL: https://issues.apache.org/jira/browse/SPARK-48988 Project: Spark Issue Type: Improvement Components: ML Affects Versions: 4.0.0 Reporter: Ruifeng Zheng -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-48975) Remove unnecessary `ScalaReflectionLock` definition from `protobuf`
[ https://issues.apache.org/jira/browse/SPARK-48975?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-48975. --- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 47459 [https://github.com/apache/spark/pull/47459] > Remove unnecessary `ScalaReflectionLock` definition from `protobuf` > --- > > Key: SPARK-48975 > URL: https://issues.apache.org/jira/browse/SPARK-48975 > Project: Spark > Issue Type: Improvement > Components: Protobuf >Affects Versions: 4.0.0 >Reporter: Yang Jie >Assignee: Yang Jie >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-48975) Remove unnecessary `ScalaReflectionLock` definition from `protobuf`
[ https://issues.apache.org/jira/browse/SPARK-48975?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-48975: - Assignee: Yang Jie > Remove unnecessary `ScalaReflectionLock` definition from `protobuf` > --- > > Key: SPARK-48975 > URL: https://issues.apache.org/jira/browse/SPARK-48975 > Project: Spark > Issue Type: Improvement > Components: Protobuf >Affects Versions: 4.0.0 >Reporter: Yang Jie >Assignee: Yang Jie >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-48976) Improve the docs related to `variable`
[ https://issues.apache.org/jira/browse/SPARK-48976?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-48976. -- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 47460 [https://github.com/apache/spark/pull/47460] > Improve the docs related to `variable` > -- > > Key: SPARK-48976 > URL: https://issues.apache.org/jira/browse/SPARK-48976 > Project: Spark > Issue Type: Improvement > Components: Documentation, SQL >Affects Versions: 4.0.0 >Reporter: BingKun Pan >Assignee: BingKun Pan >Priority: Minor > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-48981) Fix pyspark simpleString method for collations
[ https://issues.apache.org/jira/browse/SPARK-48981?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-48981. -- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 47463 [https://github.com/apache/spark/pull/47463] > Fix pyspark simpleString method for collations > -- > > Key: SPARK-48981 > URL: https://issues.apache.org/jira/browse/SPARK-48981 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 4.0.0 >Reporter: Stefan Kandic >Assignee: Stefan Kandic >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-48987) Make `curl` retry 3 times in `bin/mvn`
[ https://issues.apache.org/jira/browse/SPARK-48987?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-48987: Assignee: BingKun Pan > Make `curl` retry 3 times in `bin/mvn` > -- > > Key: SPARK-48987 > URL: https://issues.apache.org/jira/browse/SPARK-48987 > Project: Spark > Issue Type: Improvement > Components: Project Infra >Affects Versions: 4.0.0 >Reporter: BingKun Pan >Assignee: BingKun Pan >Priority: Critical > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-48987) Make `curl` retry 3 times in `bin/mvn`
[ https://issues.apache.org/jira/browse/SPARK-48987?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-48987. -- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 47465 [https://github.com/apache/spark/pull/47465] > Make `curl` retry 3 times in `bin/mvn` > -- > > Key: SPARK-48987 > URL: https://issues.apache.org/jira/browse/SPARK-48987 > Project: Spark > Issue Type: Improvement > Components: Project Infra >Affects Versions: 4.0.0 >Reporter: BingKun Pan >Assignee: BingKun Pan >Priority: Critical > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-48986) Introduce a ColumnNode API
[ https://issues.apache.org/jira/browse/SPARK-48986?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-48986: --- Labels: pull-request-available (was: ) > Introduce a ColumnNode API > -- > > Key: SPARK-48986 > URL: https://issues.apache.org/jira/browse/SPARK-48986 > Project: Spark > Issue Type: New Feature > Components: Connect, SQL >Affects Versions: 4.0.0 >Reporter: Herman van Hövell >Assignee: Herman van Hövell >Priority: Major > Labels: pull-request-available > > Introduce an intermediate representation (IR) for Column operations. This > will allow us to share the Column API between the classic and connect Scala > API. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-48987) Make `curl` retry 3 times in `bin/mvn`
[ https://issues.apache.org/jira/browse/SPARK-48987?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-48987: --- Labels: pull-request-available (was: ) > Make `curl` retry 3 times in `bin/mvn` > -- > > Key: SPARK-48987 > URL: https://issues.apache.org/jira/browse/SPARK-48987 > Project: Spark > Issue Type: Improvement > Components: Project Infra >Affects Versions: 4.0.0 >Reporter: BingKun Pan >Priority: Critical > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-48987) Make `curl` retry 3 times in `bin/mvn`
BingKun Pan created SPARK-48987: --- Summary: Make `curl` retry 3 times in `bin/mvn` Key: SPARK-48987 URL: https://issues.apache.org/jira/browse/SPARK-48987 Project: Spark Issue Type: Improvement Components: Project Infra Affects Versions: 4.0.0 Reporter: BingKun Pan -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-45201) NoClassDefFoundError: InternalFutureFailureAccess when compiling Spark 3.5.0
[ https://issues.apache.org/jira/browse/SPARK-45201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] XiDuo You resolved SPARK-45201. --- Fix Version/s: 3.5.2 Resolution: Fixed > NoClassDefFoundError: InternalFutureFailureAccess when compiling Spark 3.5.0 > > > Key: SPARK-45201 > URL: https://issues.apache.org/jira/browse/SPARK-45201 > Project: Spark > Issue Type: Bug > Components: Connect >Affects Versions: 3.5.0, 3.5.1 >Reporter: Sebastian Daberdaku >Priority: Major > Fix For: 3.5.2 > > Attachments: Dockerfile, spark-3.5.0.patch, spark-3.5.1.patch > > > I am trying to compile Spark 3.5.0 and make a distribution that supports > Spark Connect and Kubernetes. The compilation seems to complete correctly, > but when I try to run the Spark Connect server on kubernetes I get a > "NoClassDefFoundError" as follows: > {code:java} > Exception in thread "main" java.lang.NoClassDefFoundError: > org/sparkproject/guava/util/concurrent/internal/InternalFutureFailureAccess > at java.base/java.lang.ClassLoader.defineClass1(Native Method) > at java.base/java.lang.ClassLoader.defineClass(ClassLoader.java:1017) > at > java.base/java.security.SecureClassLoader.defineClass(SecureClassLoader.java:150) > at > java.base/jdk.internal.loader.BuiltinClassLoader.defineClass(BuiltinClassLoader.java:862) > at > java.base/jdk.internal.loader.BuiltinClassLoader.findClassOnClassPathOrNull(BuiltinClassLoader.java:760) > at > java.base/jdk.internal.loader.BuiltinClassLoader.loadClassOrNull(BuiltinClassLoader.java:681) > at > java.base/jdk.internal.loader.BuiltinClassLoader.loadClass(BuiltinClassLoader.java:639) > at > java.base/jdk.internal.loader.ClassLoaders$AppClassLoader.loadClass(ClassLoaders.java:188) > at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:525) > at java.base/java.lang.ClassLoader.defineClass1(Native Method) > at java.base/java.lang.ClassLoader.defineClass(ClassLoader.java:1017) > at > java.base/java.security.SecureClassLoader.defineClass(SecureClassLoader.java:150) > at > java.base/jdk.internal.loader.BuiltinClassLoader.defineClass(BuiltinClassLoader.java:862) > at > java.base/jdk.internal.loader.BuiltinClassLoader.findClassOnClassPathOrNull(BuiltinClassLoader.java:760) > at > java.base/jdk.internal.loader.BuiltinClassLoader.loadClassOrNull(BuiltinClassLoader.java:681) > at > java.base/jdk.internal.loader.BuiltinClassLoader.loadClass(BuiltinClassLoader.java:639) > at > java.base/jdk.internal.loader.ClassLoaders$AppClassLoader.loadClass(ClassLoaders.java:188) > at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:525) > at java.base/java.lang.ClassLoader.defineClass1(Native Method) > at java.base/java.lang.ClassLoader.defineClass(ClassLoader.java:1017) > at > java.base/java.security.SecureClassLoader.defineClass(SecureClassLoader.java:150) > at > java.base/jdk.internal.loader.BuiltinClassLoader.defineClass(BuiltinClassLoader.java:862) > at > java.base/jdk.internal.loader.BuiltinClassLoader.findClassOnClassPathOrNull(BuiltinClassLoader.java:760) > at > java.base/jdk.internal.loader.BuiltinClassLoader.loadClassOrNull(BuiltinClassLoader.java:681) > at > java.base/jdk.internal.loader.BuiltinClassLoader.loadClass(BuiltinClassLoader.java:639) > at > java.base/jdk.internal.loader.ClassLoaders$AppClassLoader.loadClass(ClassLoaders.java:188) > at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:525) > at > org.sparkproject.guava.cache.LocalCache$LoadingValueReference.(LocalCache.java:3511) > at > org.sparkproject.guava.cache.LocalCache$LoadingValueReference.(LocalCache.java:3515) > at > org.sparkproject.guava.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2168) > at > org.sparkproject.guava.cache.LocalCache$Segment.get(LocalCache.java:2079) > at org.sparkproject.guava.cache.LocalCache.get(LocalCache.java:4011) > at org.sparkproject.guava.cache.LocalCache.getOrLoad(LocalCache.java:4034) > at > org.sparkproject.guava.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:5010) > at > org.apache.spark.storage.BlockManagerId$.getCachedBlockManagerId(BlockManagerId.scala:146) > at > org.apache.spark.storage.BlockManagerId$.apply(BlockManagerId.scala:127) > at > org.apache.spark.storage.BlockManager.initialize(BlockManager.scala:536) > at org.apache.spark.SparkContext.(SparkContext.scal
[jira] [Commented] (SPARK-45201) NoClassDefFoundError: InternalFutureFailureAccess when compiling Spark 3.5.0
[ https://issues.apache.org/jira/browse/SPARK-45201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17868221#comment-17868221 ] XiDuo You commented on SPARK-45201: --- This issue has been fixed by https://github.com/apache/spark/pull/45775 > NoClassDefFoundError: InternalFutureFailureAccess when compiling Spark 3.5.0 > > > Key: SPARK-45201 > URL: https://issues.apache.org/jira/browse/SPARK-45201 > Project: Spark > Issue Type: Bug > Components: Connect >Affects Versions: 3.5.0, 3.5.1 >Reporter: Sebastian Daberdaku >Priority: Major > Attachments: Dockerfile, spark-3.5.0.patch, spark-3.5.1.patch > > > I am trying to compile Spark 3.5.0 and make a distribution that supports > Spark Connect and Kubernetes. The compilation seems to complete correctly, > but when I try to run the Spark Connect server on kubernetes I get a > "NoClassDefFoundError" as follows: > {code:java} > Exception in thread "main" java.lang.NoClassDefFoundError: > org/sparkproject/guava/util/concurrent/internal/InternalFutureFailureAccess > at java.base/java.lang.ClassLoader.defineClass1(Native Method) > at java.base/java.lang.ClassLoader.defineClass(ClassLoader.java:1017) > at > java.base/java.security.SecureClassLoader.defineClass(SecureClassLoader.java:150) > at > java.base/jdk.internal.loader.BuiltinClassLoader.defineClass(BuiltinClassLoader.java:862) > at > java.base/jdk.internal.loader.BuiltinClassLoader.findClassOnClassPathOrNull(BuiltinClassLoader.java:760) > at > java.base/jdk.internal.loader.BuiltinClassLoader.loadClassOrNull(BuiltinClassLoader.java:681) > at > java.base/jdk.internal.loader.BuiltinClassLoader.loadClass(BuiltinClassLoader.java:639) > at > java.base/jdk.internal.loader.ClassLoaders$AppClassLoader.loadClass(ClassLoaders.java:188) > at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:525) > at java.base/java.lang.ClassLoader.defineClass1(Native Method) > at java.base/java.lang.ClassLoader.defineClass(ClassLoader.java:1017) > at > java.base/java.security.SecureClassLoader.defineClass(SecureClassLoader.java:150) > at > java.base/jdk.internal.loader.BuiltinClassLoader.defineClass(BuiltinClassLoader.java:862) > at > java.base/jdk.internal.loader.BuiltinClassLoader.findClassOnClassPathOrNull(BuiltinClassLoader.java:760) > at > java.base/jdk.internal.loader.BuiltinClassLoader.loadClassOrNull(BuiltinClassLoader.java:681) > at > java.base/jdk.internal.loader.BuiltinClassLoader.loadClass(BuiltinClassLoader.java:639) > at > java.base/jdk.internal.loader.ClassLoaders$AppClassLoader.loadClass(ClassLoaders.java:188) > at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:525) > at java.base/java.lang.ClassLoader.defineClass1(Native Method) > at java.base/java.lang.ClassLoader.defineClass(ClassLoader.java:1017) > at > java.base/java.security.SecureClassLoader.defineClass(SecureClassLoader.java:150) > at > java.base/jdk.internal.loader.BuiltinClassLoader.defineClass(BuiltinClassLoader.java:862) > at > java.base/jdk.internal.loader.BuiltinClassLoader.findClassOnClassPathOrNull(BuiltinClassLoader.java:760) > at > java.base/jdk.internal.loader.BuiltinClassLoader.loadClassOrNull(BuiltinClassLoader.java:681) > at > java.base/jdk.internal.loader.BuiltinClassLoader.loadClass(BuiltinClassLoader.java:639) > at > java.base/jdk.internal.loader.ClassLoaders$AppClassLoader.loadClass(ClassLoaders.java:188) > at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:525) > at > org.sparkproject.guava.cache.LocalCache$LoadingValueReference.(LocalCache.java:3511) > at > org.sparkproject.guava.cache.LocalCache$LoadingValueReference.(LocalCache.java:3515) > at > org.sparkproject.guava.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2168) > at > org.sparkproject.guava.cache.LocalCache$Segment.get(LocalCache.java:2079) > at org.sparkproject.guava.cache.LocalCache.get(LocalCache.java:4011) > at org.sparkproject.guava.cache.LocalCache.getOrLoad(LocalCache.java:4034) > at > org.sparkproject.guava.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:5010) > at > org.apache.spark.storage.BlockManagerId$.getCachedBlockManagerId(BlockManagerId.scala:146) > at > org.apache.spark.storage.BlockManagerId$.apply(BlockManagerId.scala:127) > at > org.apache.spark.storage.BlockManager.initialize(BlockManager.scala:536) > at org.apa
[jira] [Resolved] (SPARK-48414) Fix breaking change in python's `fromJson`
[ https://issues.apache.org/jira/browse/SPARK-48414?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-48414. -- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 46737 [https://github.com/apache/spark/pull/46737] > Fix breaking change in python's `fromJson` > -- > > Key: SPARK-48414 > URL: https://issues.apache.org/jira/browse/SPARK-48414 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 4.0.0 >Reporter: Stefan Kandic >Assignee: Stefan Kandic >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-48985) Remote (most) hard coded expressions from SparkConnectPlanner/functions.scala
[ https://issues.apache.org/jira/browse/SPARK-48985?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-48985: --- Labels: pull-request-available (was: ) > Remote (most) hard coded expressions from SparkConnectPlanner/functions.scala > - > > Key: SPARK-48985 > URL: https://issues.apache.org/jira/browse/SPARK-48985 > Project: Spark > Issue Type: New Feature > Components: Connect, SQL >Affects Versions: 4.0.0 >Reporter: Herman van Hövell >Assignee: Herman van Hövell >Priority: Major > Labels: pull-request-available > > There are a number of hard coded expressions in the SparkConnectPlanner. Most > of these expressions are hardcoded because they are missing a proper > constructor, or because they are not registered in the FunctionRegistry. > functions.scala has a similar problem. We should try to remove these > exceptions. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-48985) Remote (most) hard coded expressions from SparkConnectPlanner/functions.scala
Herman van Hövell created SPARK-48985: - Summary: Remote (most) hard coded expressions from SparkConnectPlanner/functions.scala Key: SPARK-48985 URL: https://issues.apache.org/jira/browse/SPARK-48985 Project: Spark Issue Type: New Feature Components: Connect, SQL Affects Versions: 4.0.0 Reporter: Herman van Hövell Assignee: Herman van Hövell There are a number of hard coded expressions in the SparkConnectPlanner. Most of these expressions are hardcoded because they are missing a proper constructor, or because they are not registered in the FunctionRegistry. functions.scala has a similar problem. We should try to remove these exceptions. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-48984) Add Controller Metrics System and Utils
[ https://issues.apache.org/jira/browse/SPARK-48984?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-48984: --- Labels: pull-request-available (was: ) > Add Controller Metrics System and Utils > --- > > Key: SPARK-48984 > URL: https://issues.apache.org/jira/browse/SPARK-48984 > Project: Spark > Issue Type: Sub-task > Components: k8s >Affects Versions: kubernetes-operator-0.1.0 >Reporter: Zhou JIANG >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-48984) Add Controller Metrics System and Utils
Zhou JIANG created SPARK-48984: -- Summary: Add Controller Metrics System and Utils Key: SPARK-48984 URL: https://issues.apache.org/jira/browse/SPARK-48984 Project: Spark Issue Type: Sub-task Components: k8s Affects Versions: kubernetes-operator-0.1.0 Reporter: Zhou JIANG -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-48935) Make `checkEvaluation` directly check the `Collation` expression itself in UT
[ https://issues.apache.org/jira/browse/SPARK-48935?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] BingKun Pan updated SPARK-48935: Summary: Make `checkEvaluation` directly check the `Collation` expression itself in UT (was: Restrictions on`collatinId` should be added to the constructor of `StringType`) > Make `checkEvaluation` directly check the `Collation` expression itself in UT > -- > > Key: SPARK-48935 > URL: https://issues.apache.org/jira/browse/SPARK-48935 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 4.0.0 >Reporter: BingKun Pan >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-48982) [GO] Extract Spark Exceptions from GRPC response
[ https://issues.apache.org/jira/browse/SPARK-48982?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-48982: --- Labels: pull-request-available (was: ) > [GO] Extract Spark Exceptions from GRPC response > > > Key: SPARK-48982 > URL: https://issues.apache.org/jira/browse/SPARK-48982 > Project: Spark > Issue Type: Improvement > Components: Connect >Affects Versions: 3.5.1 >Reporter: Martin Grund >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-48982) [GO] Extract Spark Exceptions from GRPC response
Martin Grund created SPARK-48982: Summary: [GO] Extract Spark Exceptions from GRPC response Key: SPARK-48982 URL: https://issues.apache.org/jira/browse/SPARK-48982 Project: Spark Issue Type: Improvement Components: Connect Affects Versions: 3.5.1 Reporter: Martin Grund Fix For: 4.0.0 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-48967) Improve performance and memory footprint of "INSERT INTO ... VALUES" Statements
[ https://issues.apache.org/jira/browse/SPARK-48967?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-48967: --- Labels: pull-request-available (was: ) > Improve performance and memory footprint of "INSERT INTO ... VALUES" > Statements > --- > > Key: SPARK-48967 > URL: https://issues.apache.org/jira/browse/SPARK-48967 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.4.4 >Reporter: Costas Zarifis >Priority: Major > Labels: pull-request-available > > Currently very large "INSERT INTO ... VALUES" statements result into > disproportionally large parse trees as each literal will need to remain in > the parse tree, until it eventually gets evaluated into a LocalTable, once > the appropriate analyzer/optimizer rules have been applied. > > This results in increased memory pressure on the driver, when such large > statements are generated, that can lead to OOMs and GC pauses. It also > results in suboptimal runtime performance as the time it takes to apply > analyzer/optimizer rules is typically proportional to the size of the parse > tree. > > Both these issues can be resolved by applying the functions that evaluate the > unresolved table into a local table eagerly from the AST Builder, thus > short-circuiting the evaluation of such statements. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-48981) Fix pyspark simpleString method for collations
[ https://issues.apache.org/jira/browse/SPARK-48981?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stefan Kandic updated SPARK-48981: -- Summary: Fix pyspark simpleString method for collations (was: Fix pyspark simpleString method) > Fix pyspark simpleString method for collations > -- > > Key: SPARK-48981 > URL: https://issues.apache.org/jira/browse/SPARK-48981 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 4.0.0 >Reporter: Stefan Kandic >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-48928) Log Warning for Calling .unpersist() on Locally Checkpointed RDDs
[ https://issues.apache.org/jira/browse/SPARK-48928?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mridul Muralidharan resolved SPARK-48928. - Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 47391 [https://github.com/apache/spark/pull/47391] > Log Warning for Calling .unpersist() on Locally Checkpointed RDDs > - > > Key: SPARK-48928 > URL: https://issues.apache.org/jira/browse/SPARK-48928 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 4.0.0 >Reporter: Mingkang Li >Assignee: Mingkang Li >Priority: Minor > Labels: pull-request-available > Fix For: 4.0.0 > > > *Summary:* > This change proposes to log a warning message when the {{.unpersist()}} > method is called on RDDs that have been locally checkpointed in Apache Spark. > This aims to inform users about the potential risks of unpersisting such RDDs > without altering the existing behavior of the method. > *Background:* > Local checkpointing in Spark truncates the lineage of an RDD, meaning that > the RDD cannot be recomputed from its source. If an RDD that has been locally > checkpointed is unpersisted, it loses its data and cannot be regenerated. > This can lead to job failures if subsequent actions or transformations are > attempted on the unpersisted RDD. > *Proposed Change:* > To mitigate this issue, a warning message will be logged whenever > {{.unpersist()}} is called on a locally checkpointed RDD. This approach > maintains the current functionality while alerting users to the potential > consequences of their actions. This change is intended to be non-disruptive > and is a step towards better user awareness and debugging. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-48347) [M0] Support for WHILE statement
[ https://issues.apache.org/jira/browse/SPARK-48347?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-48347: --- Labels: pull-request-available (was: ) > [M0] Support for WHILE statement > > > Key: SPARK-48347 > URL: https://issues.apache.org/jira/browse/SPARK-48347 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 4.0.0 >Reporter: David Milicevic >Priority: Major > Labels: pull-request-available > > Add support for WHILE statements to SQL scripting parser & interpreter. > > For more details, design doc can be found in parent Jira item. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-48925) Introduce interface to ensure extra strategy do planning of scan plan (with additional filters and projections)
[ https://issues.apache.org/jira/browse/SPARK-48925?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-48925: --- Labels: pull-request-available (was: ) > Introduce interface to ensure extra strategy do planning of scan plan (with > additional filters and projections) > --- > > Key: SPARK-48925 > URL: https://issues.apache.org/jira/browse/SPARK-48925 > Project: Spark > Issue Type: Task > Components: Spark Core >Affects Versions: 4.0.0 >Reporter: Uros Stankovic >Priority: Minor > Labels: pull-request-available > > If we have some plan that contains scan with filter (or project) as parent, > it can happen we want to do planning of scan using extra strategies instead > of DataSourceV2Strategy. > One of use-cases: > Snowflake and BigQuery connectors have their own strategies and we want to > prohibit DataSourceV2Strategy to do planning of scan node. > Despite the fact extra strategies have priority, it can happen that strategy > fails to plan > Filter->Relation, but it can plan only Relation without its parent. In that > case, DataSourceV2Strategy will jump in and it will be able to plan > Filter->Relation in just one pass. > So we want to have ability to prevent that for certain Scan classes -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-48980) Avoid per-row param read in `LSH/DCT/NGram/PolynomialExpansion`
[ https://issues.apache.org/jira/browse/SPARK-48980?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-48980: --- Labels: pull-request-available (was: ) > Avoid per-row param read in `LSH/DCT/NGram/PolynomialExpansion` > --- > > Key: SPARK-48980 > URL: https://issues.apache.org/jira/browse/SPARK-48980 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 4.0.0 >Reporter: Ruifeng Zheng >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-48980) Avoid per-row param read in `LSH/DCT/NGram/PolynomialExpansion`
Ruifeng Zheng created SPARK-48980: - Summary: Avoid per-row param read in `LSH/DCT/NGram/PolynomialExpansion` Key: SPARK-48980 URL: https://issues.apache.org/jira/browse/SPARK-48980 Project: Spark Issue Type: Improvement Components: ML Affects Versions: 4.0.0 Reporter: Ruifeng Zheng -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-48979) CONV function behaves inconsistently
Dylan He created SPARK-48979: Summary: CONV function behaves inconsistently Key: SPARK-48979 URL: https://issues.apache.org/jira/browse/SPARK-48979 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.5.1 Reporter: Dylan He I'm currently working on CONV function, and I found something confused about the implementation in Spark. All codes below is from NumberConverter.scala. h3. negative and signed situation {code:sql} spark-sql (default)> select conv('FFFE', 16, -16); -2 spark-sql (default)> select conv('-FFFE', 16, -16); -2 {code} Ideally, these two queries should yield different results, but they both return -2. {code:java} if (toBase < 0 && v < 0) { v = -v negative = true } {code} According to code above, when toBase < 0 and v < 0, negative sign is set to true regardless of the original value. This will lead to incorrect result as the examples above, because negative sign is ignored in the second case. A potential adjustment is negative = !negative, which would correctly interpret the double negation and yield 2. h3. ansi mode {code:java} if (negative && toBase > 0) { if (v < 0) { v = -1 } else { v = -v } } {code} Here, -1 is used to indicate an overflow condition but does not throw an exception when ANSI mode is enabled, unlike the overflow handling in the encode method. h3. overflow check {code:java} val bound = java.lang.Long.divideUnsigned(-1 - radix, radix) if (v >= bound) {...} {code} The inclusion of the equality in the overflow check seems unnecessary. I am still learning function in Spark. Please feel free to point out any mistakes I might have. And some of these questions are also mentioned in [SPARK-44943|https://issues.apache.org/jira/browse/SPARK-44943]. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-48970) Avoid using SparkSession.getActiveSession in spark ML reader/writer
[ https://issues.apache.org/jira/browse/SPARK-48970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Weichen Xu resolved SPARK-48970. Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 47453 [https://github.com/apache/spark/pull/47453] > Avoid using SparkSession.getActiveSession in spark ML reader/writer > --- > > Key: SPARK-48970 > URL: https://issues.apache.org/jira/browse/SPARK-48970 > Project: Spark > Issue Type: Bug > Components: ML, PySpark >Affects Versions: 4.0.0 >Reporter: Weichen Xu >Assignee: Weichen Xu >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > > `SparkSession.getActiveSession` is thread-local session, but spark ML reader > / writer might be executed in different threads which causes > `SparkSession.getActiveSession` returning None. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-48978) Optimize collation support for ASCII strings (all collations)
Uroš Bojanić created SPARK-48978: Summary: Optimize collation support for ASCII strings (all collations) Key: SPARK-48978 URL: https://issues.apache.org/jira/browse/SPARK-48978 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 4.0.0 Reporter: Uroš Bojanić -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-48977) Optimize collation support for string search (UTF8_LCASE collation)
Uroš Bojanić created SPARK-48977: Summary: Optimize collation support for string search (UTF8_LCASE collation) Key: SPARK-48977 URL: https://issues.apache.org/jira/browse/SPARK-48977 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 4.0.0 Reporter: Uroš Bojanić -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-47988) When the collationId is invalid, throw `COLLATION_INVALID_ID`
[ https://issues.apache.org/jira/browse/SPARK-47988?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uroš Bojanić resolved SPARK-47988. -- Resolution: Won't Fix > When the collationId is invalid, throw `COLLATION_INVALID_ID` > - > > Key: SPARK-47988 > URL: https://issues.apache.org/jira/browse/SPARK-47988 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 4.0.0 >Reporter: BingKun Pan >Priority: Minor > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-47988) When the collationId is invalid, throw `COLLATION_INVALID_ID`
[ https://issues.apache.org/jira/browse/SPARK-47988?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17868008#comment-17868008 ] Uroš Bojanić commented on SPARK-47988: -- Closing this ticket for now, given that it's no longer relevant after the recent CollationFactory rewrite. > When the collationId is invalid, throw `COLLATION_INVALID_ID` > - > > Key: SPARK-47988 > URL: https://issues.apache.org/jira/browse/SPARK-47988 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 4.0.0 >Reporter: BingKun Pan >Priority: Minor > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-48338) Sql Scripting support for Spark SQL
[ https://issues.apache.org/jira/browse/SPARK-48338?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot reassigned SPARK-48338: -- Assignee: (was: Apache Spark) > Sql Scripting support for Spark SQL > --- > > Key: SPARK-48338 > URL: https://issues.apache.org/jira/browse/SPARK-48338 > Project: Spark > Issue Type: Epic > Components: Spark Core >Affects Versions: 4.0.0 >Reporter: Aleksandar Tomic >Priority: Major > Labels: pull-request-available > Attachments: Sql Scripting - OSS.odt, [Design Doc] Sql Scripting - > OSS.pdf > > > Design doc for this feature is in attachment. > High level example of Sql Script: > ``` > BEGIN > DECLARE c INT = 10; > WHILE c > 0 DO > INSERT INTO tscript VALUES (c); > SET c = c - 1; > END WHILE; > END > ``` > High level motivation behind this feature: > SQL Scripting gives customers the ability to develop complex ETL and analysis > entirely in SQL. Until now, customers have had to write verbose SQL > statements or combine SQL + Python to efficiently write business logic. > Coming from another system, customers have to choose whether or not they want > to migrate to pyspark. Some customers end up not using Spark because of this > gap. SQL Scripting is a key milestone towards enabling SQL practitioners to > write sophisticated queries, without the need to use pyspark. Further, SQL > Scripting is a necessary step towards support for SQL Stored Procedures, and > along with SQL Variables (released) and Temp Tables (in progress), will allow > for more seamless data warehouse migrations. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-48338) Sql Scripting support for Spark SQL
[ https://issues.apache.org/jira/browse/SPARK-48338?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot reassigned SPARK-48338: -- Assignee: Apache Spark > Sql Scripting support for Spark SQL > --- > > Key: SPARK-48338 > URL: https://issues.apache.org/jira/browse/SPARK-48338 > Project: Spark > Issue Type: Epic > Components: Spark Core >Affects Versions: 4.0.0 >Reporter: Aleksandar Tomic >Assignee: Apache Spark >Priority: Major > Labels: pull-request-available > Attachments: Sql Scripting - OSS.odt, [Design Doc] Sql Scripting - > OSS.pdf > > > Design doc for this feature is in attachment. > High level example of Sql Script: > ``` > BEGIN > DECLARE c INT = 10; > WHILE c > 0 DO > INSERT INTO tscript VALUES (c); > SET c = c - 1; > END WHILE; > END > ``` > High level motivation behind this feature: > SQL Scripting gives customers the ability to develop complex ETL and analysis > entirely in SQL. Until now, customers have had to write verbose SQL > statements or combine SQL + Python to efficiently write business logic. > Coming from another system, customers have to choose whether or not they want > to migrate to pyspark. Some customers end up not using Spark because of this > gap. SQL Scripting is a key milestone towards enabling SQL practitioners to > write sophisticated queries, without the need to use pyspark. Further, SQL > Scripting is a necessary step towards support for SQL Stored Procedures, and > along with SQL Variables (released) and Temp Tables (in progress), will allow > for more seamless data warehouse migrations. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-48976) Improve the docs related to `variable`
[ https://issues.apache.org/jira/browse/SPARK-48976?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot reassigned SPARK-48976: -- Assignee: (was: Apache Spark) > Improve the docs related to `variable` > -- > > Key: SPARK-48976 > URL: https://issues.apache.org/jira/browse/SPARK-48976 > Project: Spark > Issue Type: Improvement > Components: Documentation, SQL >Affects Versions: 4.0.0 >Reporter: BingKun Pan >Priority: Minor > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-48976) Improve the docs related to `variable`
[ https://issues.apache.org/jira/browse/SPARK-48976?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot reassigned SPARK-48976: -- Assignee: Apache Spark > Improve the docs related to `variable` > -- > > Key: SPARK-48976 > URL: https://issues.apache.org/jira/browse/SPARK-48976 > Project: Spark > Issue Type: Improvement > Components: Documentation, SQL >Affects Versions: 4.0.0 >Reporter: BingKun Pan >Assignee: Apache Spark >Priority: Minor > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-48910) Slow linear searches in PreprocessTableCreation
[ https://issues.apache.org/jira/browse/SPARK-48910?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot reassigned SPARK-48910: -- Assignee: (was: Apache Spark) > Slow linear searches in PreprocessTableCreation > --- > > Key: SPARK-48910 > URL: https://issues.apache.org/jira/browse/SPARK-48910 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 4.0.0 >Reporter: Vladimir Golubev >Priority: Major > Labels: pull-request-available > > PreprocessTableCreation does Seq.contains over partition columns, which > becomes very slow in case of 1000s of partitions -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-48976) Improve the docs related to `variable`
[ https://issues.apache.org/jira/browse/SPARK-48976?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot reassigned SPARK-48976: -- Assignee: (was: Apache Spark) > Improve the docs related to `variable` > -- > > Key: SPARK-48976 > URL: https://issues.apache.org/jira/browse/SPARK-48976 > Project: Spark > Issue Type: Improvement > Components: Documentation, SQL >Affects Versions: 4.0.0 >Reporter: BingKun Pan >Priority: Minor > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-48910) Slow linear searches in PreprocessTableCreation
[ https://issues.apache.org/jira/browse/SPARK-48910?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot reassigned SPARK-48910: -- Assignee: Apache Spark > Slow linear searches in PreprocessTableCreation > --- > > Key: SPARK-48910 > URL: https://issues.apache.org/jira/browse/SPARK-48910 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 4.0.0 >Reporter: Vladimir Golubev >Assignee: Apache Spark >Priority: Major > Labels: pull-request-available > > PreprocessTableCreation does Seq.contains over partition columns, which > becomes very slow in case of 1000s of partitions -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-48976) Improve the docs related to `variable`
[ https://issues.apache.org/jira/browse/SPARK-48976?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot reassigned SPARK-48976: -- Assignee: (was: Apache Spark) > Improve the docs related to `variable` > -- > > Key: SPARK-48976 > URL: https://issues.apache.org/jira/browse/SPARK-48976 > Project: Spark > Issue Type: Improvement > Components: Documentation, SQL >Affects Versions: 4.0.0 >Reporter: BingKun Pan >Priority: Minor > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-48761) Add clusterBy DataFrameWriter API for Scala
[ https://issues.apache.org/jira/browse/SPARK-48761?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot reassigned SPARK-48761: -- Assignee: (was: Apache Spark) > Add clusterBy DataFrameWriter API for Scala > --- > > Key: SPARK-48761 > URL: https://issues.apache.org/jira/browse/SPARK-48761 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 4.0.0 >Reporter: Jiaheng Tang >Priority: Major > Labels: pull-request-available > > Add a new `clusterBy` DataFrameWriter API for Scala. This allows users to > interact with clustered tables using DataFrameWriter API. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-48761) Add clusterBy DataFrameWriter API for Scala
[ https://issues.apache.org/jira/browse/SPARK-48761?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot reassigned SPARK-48761: -- Assignee: Apache Spark > Add clusterBy DataFrameWriter API for Scala > --- > > Key: SPARK-48761 > URL: https://issues.apache.org/jira/browse/SPARK-48761 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 4.0.0 >Reporter: Jiaheng Tang >Assignee: Apache Spark >Priority: Major > Labels: pull-request-available > > Add a new `clusterBy` DataFrameWriter API for Scala. This allows users to > interact with clustered tables using DataFrameWriter API. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-48975) Remove unnecessary `ScalaReflectionLock` definition from `protobuf`
[ https://issues.apache.org/jira/browse/SPARK-48975?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot reassigned SPARK-48975: -- Assignee: (was: Apache Spark) > Remove unnecessary `ScalaReflectionLock` definition from `protobuf` > --- > > Key: SPARK-48975 > URL: https://issues.apache.org/jira/browse/SPARK-48975 > Project: Spark > Issue Type: Improvement > Components: Protobuf >Affects Versions: 4.0.0 >Reporter: Yang Jie >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-48975) Remove unnecessary `ScalaReflectionLock` definition from `protobuf`
[ https://issues.apache.org/jira/browse/SPARK-48975?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot reassigned SPARK-48975: -- Assignee: Apache Spark > Remove unnecessary `ScalaReflectionLock` definition from `protobuf` > --- > > Key: SPARK-48975 > URL: https://issues.apache.org/jira/browse/SPARK-48975 > Project: Spark > Issue Type: Improvement > Components: Protobuf >Affects Versions: 4.0.0 >Reporter: Yang Jie >Assignee: Apache Spark >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-48976) Improve the docs related to `variable`
[ https://issues.apache.org/jira/browse/SPARK-48976?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-48976: --- Labels: pull-request-available (was: ) > Improve the docs related to `variable` > -- > > Key: SPARK-48976 > URL: https://issues.apache.org/jira/browse/SPARK-48976 > Project: Spark > Issue Type: Improvement > Components: Documentation, SQL >Affects Versions: 4.0.0 >Reporter: BingKun Pan >Priority: Minor > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-48975) Remove unnecessary `ScalaReflectionLock` definition from `protobuf`
[ https://issues.apache.org/jira/browse/SPARK-48975?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-48975: --- Labels: pull-request-available (was: ) > Remove unnecessary `ScalaReflectionLock` definition from `protobuf` > --- > > Key: SPARK-48975 > URL: https://issues.apache.org/jira/browse/SPARK-48975 > Project: Spark > Issue Type: Improvement > Components: Protobuf >Affects Versions: 4.0.0 >Reporter: Yang Jie >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-48975) Remove unnecessary `ScalaReflectionLock` definition from `protobuf`
Yang Jie created SPARK-48975: Summary: Remove unnecessary `ScalaReflectionLock` definition from `protobuf` Key: SPARK-48975 URL: https://issues.apache.org/jira/browse/SPARK-48975 Project: Spark Issue Type: Improvement Components: Protobuf Affects Versions: 4.0.0 Reporter: Yang Jie -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-48957) Return classified store load exception type on load failure
[ https://issues.apache.org/jira/browse/SPARK-48957?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jungtaek Lim resolved SPARK-48957. -- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 47431 [https://github.com/apache/spark/pull/47431] > Return classified store load exception type on load failure > --- > > Key: SPARK-48957 > URL: https://issues.apache.org/jira/browse/SPARK-48957 > Project: Spark > Issue Type: Task > Components: Structured Streaming >Affects Versions: 4.0.0 >Reporter: Anish Shrigondekar >Assignee: Anish Shrigondekar >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > > Return classified store load exception type on load failure -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-48973) Unexpected behavior using spark mask function handle string contains invalid UTF-8 or wide character
[ https://issues.apache.org/jira/browse/SPARK-48973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wei Guo updated SPARK-48973: Description: In the spark the mask function when apply with a string contains invalid character or wide character would cause unexpected behavior. Example to use `*` mask a string contains wide-character {{}} {code:sql} select mask("", "Y", "y", "n", "*"); {code} could cause result is ** instead of *. Looks spark mask treat {{}} as 2 characters. Example to use wide-character {{}} do mask would cause wrong garbled code problem {code:sql} select mask("ABC", ""); {code} result is `???`. Example to mask a string contains a invalid UTF-8 character {code:java} select mask("\xED"); {code} result is `xXX` instead of `\xED`, looks spark treat it as four character `\`, `x`, `E`, `D`. Looks spark mask can only handle BMP character (that is 16 bits) and can't guarantee result for invalid UTC-8 character and wide-character when doing mask. My question here is *does that the limitation / issue of spark mask function or spark mask by design only handle for BMP character ?* If it is a limitation of mask function, could spark address this part in mask function document or comments ? was: In the spark the mask function when apply with a string contains invalid character or wide character would cause unexpected behavior. Example to use `*` mask a stirng contains wide-character {{}} {code:sql} select mask("", "Y", "y", "n", "*"); {code} could cause result is ** instead of *. Looks spark mask treat {{}} as 2 characters. Example to use wide-character {{}} do mask would cause wrong garbled code problem {code:sql} select mask("ABC", ""); {code} result is `???`. Example to mask a string contains a invalid UTF-8 character {code:java} select mask("\xED"); {code} result is `xXX` instead of `\xED`, looks spark treat it as four character `\`, `x`, `E`, `D`. Looks spark mask can only handle BMP character (that is 16 bits) and can't guarantee result for invalid UTC-8 character and wide-character when doing mask. My question here is *does that the limitation / issue of spark mask function or spark mask by design only handle for BMP character ?* If it is a limitation of mask function, could spark address this part in mask function document or comments ? > Unexpected behavior using spark mask function handle string contains invalid > UTF-8 or wide character > ---- > > Key: SPARK-48973 > URL: https://issues.apache.org/jira/browse/SPARK-48973 > Project: Spark > Issue Type: Question > Components: SQL >Affects Versions: 3.3.1, 3.2.4, 3.3.2, 3.4.1, 3.5.0, 4.0.0, 3.5.1, 3.3.4 > Environment: Ubuntu 22.04 >Reporter: Yangyang Gao >Priority: Major > > In the spark the mask function when apply with a string contains invalid > character or wide character would cause unexpected behavior. > Example to use `*` mask a string contains wide-character {{}} > {code:sql} > select mask("", "Y", "y", "n", "*"); > {code} > could cause result is ** instead of *. Looks spark mask treat {{}} as 2 > characters. > Example to use wide-character {{}} do mask would cause wrong garbled code > problem > {code:sql} > select mask("ABC", ""); > {code} > result is `???`. > Example to mask a string contains a invalid UTF-8 character > {code:java} > select mask("\xED"); > {code} > result is `xXX` instead of `\xED`, looks spark treat it as four character > `\`, `x`, `E`, `D`. > Looks spark mask can only handle BMP character (that is 16 bits) and can't > guarantee result for invalid UTC-8 character and wide-character when doing > mask. > My question here is *does that the limitation / issue of spark mask function > or spark mask by design only handle for BMP character ?* > If it is a limitation of mask function, could spark address this part in mask > function document or comments ? > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-48973) Unexpected behavior using spark mask function handle string contains invalid UTF-8 or wide character
[ https://issues.apache.org/jira/browse/SPARK-48973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wei Guo updated SPARK-48973: Description: In the spark the mask function when apply with a string contains invalid character or wide character would cause unexpected behavior. Example to use `*` mask a stirng contains wide-character {{}} {code:sql} select mask("", "Y", "y", "n", "*"); {code} could cause result is ** instead of *. Looks spark mask treat {{}} as 2 characters. Example to use wide-character {{}} do mask would cause wrong garbled code problem {code:sql} select mask("ABC", ""); {code} result is `???`. Example to mask a string contains a invalid UTF-8 character {code:java} select mask("\xED"); {code} result is `xXX` instead of `\xED`, looks spark treat it as four character `\`, `x`, `E`, `D`. Looks spark mask can only handle BMP character (that is 16 bits) and can't guarantee result for invalid UTC-8 character and wide-character when doing mask. My question here is *does that the limitation / issue of spark mask function or spark mask by design only handle for BMP character ?* If it is a limitation of mask function, could spark address this part in mask function document or comments ? was: In the spark the mask function when apply with a stirng contains invalid character or wide character would cause unexpected behavior. Example to use `*` mask a stirng contains wide-character {{}} {code:sql} select mask("", "Y", "y", "n", "*"); {code} could cause result is ** instead of *. Looks spark mask treat {{}} as 2 characters. Example to use wide-character {{}} do mask would cause wrong garbled code problem {code:sql} select mask("ABC", ""); {code} result is `???`. Example to mask a string contains a invalid UTF-8 character {code:java} select mask("\xED"); {code} result is `xXX` instead of `\xED`, looks spark treat it as four character `\`, `x`, `E`, `D`. Looks spark mask can only handle BMP character (that is 16 bits) and can't guarantee result for invalid UTC-8 character and wide-character when doing mask. My question here is *does that the limitation / issue of spark mask function or spark mask by design only handle for BMP character ?* If it is a limitation of mask function, could spark address this part in mask function document or comments ? > Unexpected behavior using spark mask function handle string contains invalid > UTF-8 or wide character > ---- > > Key: SPARK-48973 > URL: https://issues.apache.org/jira/browse/SPARK-48973 > Project: Spark > Issue Type: Question > Components: SQL >Affects Versions: 3.3.1, 3.2.4, 3.3.2, 3.4.1, 3.5.0, 4.0.0, 3.5.1, 3.3.4 > Environment: Ubuntu 22.04 >Reporter: Yangyang Gao >Priority: Major > > In the spark the mask function when apply with a string contains invalid > character or wide character would cause unexpected behavior. > Example to use `*` mask a stirng contains wide-character {{}} > {code:sql} > select mask("", "Y", "y", "n", "*"); > {code} > could cause result is ** instead of *. Looks spark mask treat {{}} as 2 > characters. > Example to use wide-character {{}} do mask would cause wrong garbled code > problem > {code:sql} > select mask("ABC", ""); > {code} > result is `???`. > Example to mask a string contains a invalid UTF-8 character > {code:java} > select mask("\xED"); > {code} > result is `xXX` instead of `\xED`, looks spark treat it as four character > `\`, `x`, `E`, `D`. > Looks spark mask can only handle BMP character (that is 16 bits) and can't > guarantee result for invalid UTC-8 character and wide-character when doing > mask. > My question here is *does that the limitation / issue of spark mask function > or spark mask by design only handle for BMP character ?* > If it is a limitation of mask function, could spark address this part in mask > function document or comments ? > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-48964) Fix the discrepancy between implementation, comment and documentation of option recursive.fields.max.depth in ProtoBuf connector
[ https://issues.apache.org/jira/browse/SPARK-48964?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-48964: --- Labels: pull-request-available (was: ) > Fix the discrepancy between implementation, comment and documentation of > option recursive.fields.max.depth in ProtoBuf connector > > > Key: SPARK-48964 > URL: https://issues.apache.org/jira/browse/SPARK-48964 > Project: Spark > Issue Type: Documentation > Components: Connect >Affects Versions: 3.5.0, 4.0.0, 3.5.1, 3.5.2, 3.5.3 >Reporter: Yuchen Liu >Priority: Major > Labels: pull-request-available > > After the three PRs ([https://github.com/apache/spark/pull/38922,] > [https://github.com/apache/spark/pull/40011,] > [https://github.com/apache/spark/pull/40141]) working on the same option, > there are some legacy comments and documentation that has not been updated to > the latest implementation. This task should consolidate them. Below is the > correct description of the behavior. > The `recursive.fields.max.depth` parameter can be specified in the > from_protobuf options to control the maximum allowed recursion depth for a > field. Setting `recursive.fields.max.depth` to 1 drops all-recursive fields, > setting it to 2 allows it to be recursed once, and setting it to 3 allows it > to be recursed twice. Attempting to set the `recursive.fields.max.depth` to a > value greater than 10 is not allowed. If the `recursive.fields.max.depth` is > specified to a value smaller than 1, recursive fields are not permitted. The > default value of the option is -1. if a protobuf record has more depth for > recursive fields than the allowed value, it will be truncated and some fields > may be discarded. This check is based on the fully qualified field type. SQL > Schema for the protobuf message > {code:java} > message Person { string name = 1; Person bff = 2 }{code} > will vary based on the value of `recursive.fields.max.depth`. > {code:java} > 1: struct > 2: struct> > 3: struct>> ... > {code} > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-48972) Unify the literal string handling
[ https://issues.apache.org/jira/browse/SPARK-48972?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-48972. -- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 47454 [https://github.com/apache/spark/pull/47454] > Unify the literal string handling > - > > Key: SPARK-48972 > URL: https://issues.apache.org/jira/browse/SPARK-48972 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 4.0.0 >Reporter: Ruifeng Zheng >Assignee: Ruifeng Zheng >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-48972) Unify the literal string handling
[ https://issues.apache.org/jira/browse/SPARK-48972?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-48972: Assignee: Ruifeng Zheng > Unify the literal string handling > - > > Key: SPARK-48972 > URL: https://issues.apache.org/jira/browse/SPARK-48972 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 4.0.0 >Reporter: Ruifeng Zheng >Assignee: Ruifeng Zheng >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-48973) Unexpected behavior using spark mask function handle string contains invalid UTF-8 or wide character
[ https://issues.apache.org/jira/browse/SPARK-48973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yangyang Gao updated SPARK-48973: - Description: In the spark the mask function when apply with a stirng contains invalid character or wide character would cause unexpected behavior. Example to use `*` mask a stirng contains wide-character {{}} {code:sql} select mask("", "Y", "y", "n", "*"); {code} could cause result is {code:java}**{code} instead of `*`. Looks spark mask treat {{}} as 2 characters. Example to use wide-character {{}} do mask would cause wrong garbled code problem {code:sql} select mask("ABC", ""); {code} result is `???`. Example to mask a string contains a invalid UTF-8 character {code:java} select mask("\xED"); {code} result is `xXX` instead of `\xED`, looks spark treat it as four character `\`, `x`, `E`, `D`. Looks spark mask can only handle BMP character (that is 16 bits) and can't guarantee result for invalid UTC-8 character and wide-character when doing mask. My question here is *does that the limitation / issue of spark mask function or spark mask by design only handle for BMP character ?* If it is a limitation of mask function, could spark address this part in mask function document or comments ? was: In the spark the mask function when apply with a stirng contains invalid character or wide character would cause unexpected behavior. Example to use `*` mask a stirng contains wide-character {{}} {code:sql} select mask("", "Y", "y", "n", "*"); {code} could cause result is `**` instead of `*`. Looks spark mask treat {{}} as 2 characters. Example to use wide-character {{}} do mask would cause wrong garbled code problem {code:sql} select mask("ABC", ""); {code} result is `???`. Example to mask a string contains a invalid UTF-8 character {code:java} select mask("\xED"); {code} result is `xXX` instead of `\xED`, looks spark treat it as four character `\`, `x`, `E`, `D`. Looks spark mask can only handle BMP character (that is 16 bits) and can't guarantee result for invalid UTC-8 character and wide-character when doing mask. My question here is *does that the limitation / issue of spark mask function or spark mask by design only handle for BMP character ?* If it is a limitation of mask function, could spark address this part in mask function document or comments ? > Unexpected behavior using spark mask function handle string contains invalid > UTF-8 or wide character > ---- > > Key: SPARK-48973 > URL: https://issues.apache.org/jira/browse/SPARK-48973 > Project: Spark > Issue Type: Question > Components: SQL >Affects Versions: 3.3.1, 3.2.4, 3.3.2, 3.4.1, 3.5.0, 4.0.0, 3.5.1, 3.3.4 > Environment: Ubuntu 22.04 >Reporter: Yangyang Gao >Priority: Major > > In the spark the mask function when apply with a stirng contains invalid > character or wide character would cause unexpected behavior. > Example to use `*` mask a stirng contains wide-character {{}} > {code:sql} > select mask("", "Y", "y", "n", "*"); > {code} > could cause result is {code:java}**{code} > instead of `*`. Looks spark mask treat {{}} as 2 characters. > Example to use wide-character {{}} do mask would cause wrong garbled code > problem > {code:sql} > select mask("ABC", ""); > {code} > result is `???`. > Example to mask a string contains a invalid UTF-8 character > {code:java} > select mask("\xED"); > {code} > result is `xXX` instead of `\xED`, looks spark treat it as four character > `\`, `x`, `E`, `D`. > Looks spark mask can only handle BMP character (that is 16 bits) and can't > guarantee result for invalid UTC-8 character and wide-character when doing > mask. > My question here is *does that the limitation / issue of spark mask function > or spark mask by design only handle for BMP character ?* > If it is a limitation of mask function, could spark address this part in mask > function document or comments ? > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-48973) Unexpected behavior using spark mask function handle string contains invalid UTF-8 or wide character
[ https://issues.apache.org/jira/browse/SPARK-48973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yangyang Gao updated SPARK-48973: - Description: In the spark the mask function when apply with a stirng contains invalid character or wide character would cause unexpected behavior. Example to use `*` mask a stirng contains wide-character {{}} {code:sql} select mask("", "Y", "y", "n", "*"); {code} could cause result is {{**}} instead of {{*}}. Looks spark mask treat {{}} as 2 characters. Example to use wide-character {{}} do mask would cause wrong garbled code problem {code:sql} select mask("ABC", ""); {code} result is `???`. Example to mask a string contains a invalid UTF-8 character {code:java} select mask("\xED"); {code} result is `xXX` instead of `\xED`, looks spark treat it as four character `\`, `x`, `E`, `D`. Looks spark mask can only handle BMP character (that is 16 bits) and can't guarantee result for invalid UTC-8 character and wide-character when doing mask. My question here is *does that the limitation / issue of spark mask function or spark mask by design only handle for BMP character ?* If it is a limitation of mask function, could spark address this part in mask function document or comments ? was: In the spark the mask function when apply with a stirng contains invalid character or wide character would cause unexpected behavior. Example to use `*` mask a stirng contains wide-character {{}} {code:sql} select mask("", "Y", "y", "n", "*"); {code} could cause result `**` instead of `*`. Looks spark mask treat {{}} as 2 characters. Example to use wide-character {{}} do mask would cause wrong garbled code problem {code:sql} select mask("ABC", ""); {code} result is `???`. Example to mask a string contains a invalid UTF-8 character {code:java} select mask("\xED"); {code} result is `xXX` instead of `\xED`, looks spark treat it as four character `\`, `x`, `E`, `D`. Looks spark mask can only handle BMP character (that is 16 bits) and can't guarantee result for invalid UTC-8 character and wide-character when doing mask. My question here is *does that the limitation / issue of spark mask function or spark mask by design only handle for BMP character ?* If it is a limitation of mask function, could spark address this part in mask function document or comments ? > Unexpected behavior using spark mask function handle string contains invalid > UTF-8 or wide character > ---- > > Key: SPARK-48973 > URL: https://issues.apache.org/jira/browse/SPARK-48973 > Project: Spark > Issue Type: Question > Components: SQL >Affects Versions: 3.3.1, 3.2.4, 3.3.2, 3.4.1, 3.5.0, 4.0.0, 3.5.1, 3.3.4 > Environment: Ubuntu 22.04 >Reporter: Yangyang Gao >Priority: Major > > In the spark the mask function when apply with a stirng contains invalid > character or wide character would cause unexpected behavior. > Example to use `*` mask a stirng contains wide-character {{}} > {code:sql} > select mask("", "Y", "y", "n", "*"); > {code} > could cause result is {{**}} instead of {{*}}. Looks spark mask treat {{}} > as 2 characters. > Example to use wide-character {{}} do mask would cause wrong garbled code > problem > {code:sql} > select mask("ABC", ""); > {code} > result is `???`. > Example to mask a string contains a invalid UTF-8 character > {code:java} > select mask("\xED"); > {code} > result is `xXX` instead of `\xED`, looks spark treat it as four character > `\`, `x`, `E`, `D`. > Looks spark mask can only handle BMP character (that is 16 bits) and can't > guarantee result for invalid UTC-8 character and wide-character when doing > mask. > My question here is *does that the limitation / issue of spark mask function > or spark mask by design only handle for BMP character ?* > If it is a limitation of mask function, could spark address this part in mask > function document or comments ? > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-48973) Unexpected behavior using spark mask function handle string contains invalid UTF-8 or wide character
[ https://issues.apache.org/jira/browse/SPARK-48973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yangyang Gao updated SPARK-48973: - Description: In the spark the mask function when apply with a stirng contains invalid character or wide character would cause unexpected behavior. Example to use `*` mask a stirng contains wide-character {{}} {code:sql} select mask("", "Y", "y", "n", "*"); {code} could cause result is `**` instead of `*`. Looks spark mask treat {{}} as 2 characters. Example to use wide-character {{}} do mask would cause wrong garbled code problem {code:sql} select mask("ABC", ""); {code} result is `???`. Example to mask a string contains a invalid UTF-8 character {code:java} select mask("\xED"); {code} result is `xXX` instead of `\xED`, looks spark treat it as four character `\`, `x`, `E`, `D`. Looks spark mask can only handle BMP character (that is 16 bits) and can't guarantee result for invalid UTC-8 character and wide-character when doing mask. My question here is *does that the limitation / issue of spark mask function or spark mask by design only handle for BMP character ?* If it is a limitation of mask function, could spark address this part in mask function document or comments ? was: In the spark the mask function when apply with a stirng contains invalid character or wide character would cause unexpected behavior. Example to use `*` mask a stirng contains wide-character {{}} {code:sql} select mask("", "Y", "y", "n", "*"); {code} could cause result is {{**}} instead of {{*}}. Looks spark mask treat {{}} as 2 characters. Example to use wide-character {{}} do mask would cause wrong garbled code problem {code:sql} select mask("ABC", ""); {code} result is `???`. Example to mask a string contains a invalid UTF-8 character {code:java} select mask("\xED"); {code} result is `xXX` instead of `\xED`, looks spark treat it as four character `\`, `x`, `E`, `D`. Looks spark mask can only handle BMP character (that is 16 bits) and can't guarantee result for invalid UTC-8 character and wide-character when doing mask. My question here is *does that the limitation / issue of spark mask function or spark mask by design only handle for BMP character ?* If it is a limitation of mask function, could spark address this part in mask function document or comments ? > Unexpected behavior using spark mask function handle string contains invalid > UTF-8 or wide character > ---- > > Key: SPARK-48973 > URL: https://issues.apache.org/jira/browse/SPARK-48973 > Project: Spark > Issue Type: Question > Components: SQL >Affects Versions: 3.3.1, 3.2.4, 3.3.2, 3.4.1, 3.5.0, 4.0.0, 3.5.1, 3.3.4 > Environment: Ubuntu 22.04 >Reporter: Yangyang Gao >Priority: Major > > In the spark the mask function when apply with a stirng contains invalid > character or wide character would cause unexpected behavior. > Example to use `*` mask a stirng contains wide-character {{}} > {code:sql} > select mask("", "Y", "y", "n", "*"); > {code} > could cause result is `**` instead of `*`. Looks spark mask treat {{}} as 2 > characters. > Example to use wide-character {{}} do mask would cause wrong garbled code > problem > {code:sql} > select mask("ABC", ""); > {code} > result is `???`. > Example to mask a string contains a invalid UTF-8 character > {code:java} > select mask("\xED"); > {code} > result is `xXX` instead of `\xED`, looks spark treat it as four character > `\`, `x`, `E`, `D`. > Looks spark mask can only handle BMP character (that is 16 bits) and can't > guarantee result for invalid UTC-8 character and wide-character when doing > mask. > My question here is *does that the limitation / issue of spark mask function > or spark mask by design only handle for BMP character ?* > If it is a limitation of mask function, could spark address this part in mask > function document or comments ? > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-48973) Unexpected behavior using spark mask function handle string contains invalid UTF-8 or wide character
[ https://issues.apache.org/jira/browse/SPARK-48973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yangyang Gao updated SPARK-48973: - Description: In the spark the mask function when apply with a stirng contains invalid character or wide character would cause unexpected behavior. Example to use `*` mask a stirng contains wide-character {{}} {code:sql} select mask("", "Y", "y", "n", "*"); {code} could cause result is ** instead of *. Looks spark mask treat {{}} as 2 characters. Example to use wide-character {{}} do mask would cause wrong garbled code problem {code:sql} select mask("ABC", ""); {code} result is `???`. Example to mask a string contains a invalid UTF-8 character {code:java} select mask("\xED"); {code} result is `xXX` instead of `\xED`, looks spark treat it as four character `\`, `x`, `E`, `D`. Looks spark mask can only handle BMP character (that is 16 bits) and can't guarantee result for invalid UTC-8 character and wide-character when doing mask. My question here is *does that the limitation / issue of spark mask function or spark mask by design only handle for BMP character ?* If it is a limitation of mask function, could spark address this part in mask function document or comments ? was: In the spark the mask function when apply with a stirng contains invalid character or wide character would cause unexpected behavior. Example to use `*` mask a stirng contains wide-character {{}} {code:sql} select mask("", "Y", "y", "n", "*"); {code} could cause result is {code:java}**{code} instead of `*`. Looks spark mask treat {{}} as 2 characters. Example to use wide-character {{}} do mask would cause wrong garbled code problem {code:sql} select mask("ABC", ""); {code} result is `???`. Example to mask a string contains a invalid UTF-8 character {code:java} select mask("\xED"); {code} result is `xXX` instead of `\xED`, looks spark treat it as four character `\`, `x`, `E`, `D`. Looks spark mask can only handle BMP character (that is 16 bits) and can't guarantee result for invalid UTC-8 character and wide-character when doing mask. My question here is *does that the limitation / issue of spark mask function or spark mask by design only handle for BMP character ?* If it is a limitation of mask function, could spark address this part in mask function document or comments ? > Unexpected behavior using spark mask function handle string contains invalid > UTF-8 or wide character > ---- > > Key: SPARK-48973 > URL: https://issues.apache.org/jira/browse/SPARK-48973 > Project: Spark > Issue Type: Question > Components: SQL >Affects Versions: 3.3.1, 3.2.4, 3.3.2, 3.4.1, 3.5.0, 4.0.0, 3.5.1, 3.3.4 > Environment: Ubuntu 22.04 >Reporter: Yangyang Gao >Priority: Major > > In the spark the mask function when apply with a stirng contains invalid > character or wide character would cause unexpected behavior. > Example to use `*` mask a stirng contains wide-character {{}} > {code:sql} > select mask("", "Y", "y", "n", "*"); > {code} > could cause result is ** instead of *. Looks spark mask treat {{}} as 2 > characters. > Example to use wide-character {{}} do mask would cause wrong garbled code > problem > {code:sql} > select mask("ABC", ""); > {code} > result is `???`. > Example to mask a string contains a invalid UTF-8 character > {code:java} > select mask("\xED"); > {code} > result is `xXX` instead of `\xED`, looks spark treat it as four character > `\`, `x`, `E`, `D`. > Looks spark mask can only handle BMP character (that is 16 bits) and can't > guarantee result for invalid UTC-8 character and wide-character when doing > mask. > My question here is *does that the limitation / issue of spark mask function > or spark mask by design only handle for BMP character ?* > If it is a limitation of mask function, could spark address this part in mask > function document or comments ? > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-48973) Unexpected behavior using spark mask function handle string contains invalid UTF-8 or wide character
[ https://issues.apache.org/jira/browse/SPARK-48973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yangyang Gao updated SPARK-48973: - Description: In the spark the mask function when apply with a stirng contains invalid character or wide character would cause unexpected behavior. Example to use `*` mask a stirng contains wide-character {{}} {code:sql} select mask("", "Y", "y", "n", "*"); {code} could cause result `**` instead of `*`. Looks spark mask treat {{}} as 2 characters. Example to use wide-character {{}} do mask would cause wrong garbled code problem {code:sql} select mask("ABC", ""); {code} result is `???`. Example to mask a string contains a invalid UTF-8 character {code:java} select mask("\xED"); {code} result is `xXX` instead of `\xED`, looks spark treat it as four character `\`, `x`, `E`, `D`. Looks spark mask can only handle BMP character (that is 16 bits) and can't guarantee result for invalid UTC-8 character and wide-character when doing mask. My question here is *does that the limitation / issue of spark mask function or spark mask by design only handle for BMP character ?* If it is a limitation of mask function, could spark address this part in mask function document or comments ? was: In the spark the mask function when apply with a stirng contains invalid character or wide character would cause unexpected behavior. Example to use `*` mask a stirng contains wide-character {{}} {code:sql} select mask("", "Y", "y", "n", "*"); {code} could cause result `{{{}*{*}`{*}{}}} \{*}instead of `{{{}{}}}{*}{{{}`{}}}. Looks spark mask treat {{}} as 2 characters. Example to use wide-character {{}} do mask would cause wrong garbled code problem {code:sql} select mask("ABC", ""); {code} result is {{{}`???{}}}`. Example to mask a string contains a invalid UTF-8 character {code:java} select mask("\xED"); {code} result is `xXX` instead of `\xED`, looks spark treat it as four character `\`, `x`, `E`, `D`. Looks spark mask can only handle BMP character (that is 16 bits) and can't guarantee result for invalid UTC-8 character and wide-character when doing mask. My question here is *does that the limitation / issue of spark mask function or spark mask by design only handle for BMP character ?* If it is a limitation of mask function, could spark address this part in mask function document or comments ? > Unexpected behavior using spark mask function handle string contains invalid > UTF-8 or wide character > ---- > > Key: SPARK-48973 > URL: https://issues.apache.org/jira/browse/SPARK-48973 > Project: Spark > Issue Type: Question > Components: SQL >Affects Versions: 3.3.1, 3.2.4, 3.3.2, 3.4.1, 3.5.0, 4.0.0, 3.5.1, 3.3.4 > Environment: Ubuntu 22.04 >Reporter: Yangyang Gao >Priority: Major > > In the spark the mask function when apply with a stirng contains invalid > character or wide character would cause unexpected behavior. > Example to use `*` mask a stirng contains wide-character {{}} > {code:sql} > select mask("", "Y", "y", "n", "*"); > {code} > could cause result `**` instead of `*`. Looks spark mask treat {{}} as 2 > characters. > Example to use wide-character {{}} do mask would cause wrong garbled code > problem > {code:sql} > select mask("ABC", ""); > {code} > result is `???`. > Example to mask a string contains a invalid UTF-8 character > {code:java} > select mask("\xED"); > {code} > result is `xXX` instead of `\xED`, looks spark treat it as four character > `\`, `x`, `E`, `D`. > Looks spark mask can only handle BMP character (that is 16 bits) and can't > guarantee result for invalid UTC-8 character and wide-character when doing > mask. > My question here is *does that the limitation / issue of spark mask function > or spark mask by design only handle for BMP character ?* > If it is a limitation of mask function, could spark address this part in mask > function document or comments ? > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-48973) Unexpected behavior using spark mask function handle string contains invalid UTF-8 or wide character
[ https://issues.apache.org/jira/browse/SPARK-48973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yangyang Gao updated SPARK-48973: - Description: In the spark the mask function when apply with a stirng contains invalid character or wide character would cause unexpected behavior. Example to use `*` mask a stirng contains wide-character {{}} {code:sql} select mask("", "Y", "y", "n", "*"); {code} could cause result `{{{}*{*}`{*}{}}} \{*}instead of `{{{}{}}}{*}{{{}`{}}}. Looks spark mask treat {{}} as 2 characters. Example to use wide-character {{}} do mask would cause wrong garbled code problem {code:sql} select mask("ABC", ""); {code} result is {{{}`???{}}}`. Example to mask a string contains a invalid UTF-8 character {code:java} select mask("\xED"); {code} result is `xXX` instead of `\xED`, looks spark treat it as four character `\`, `x`, `E`, `D`. Looks spark mask can only handle BMP character (that is 16 bits) and can't guarantee result for invalid UTC-8 character and wide-character when doing mask. My question here is *does that the limitation / issue of spark mask function or spark mask by design only handle for BMP character ?* If it is a limitation of mask function, could spark address this part in mask function document or comments ? was: In the spark the mask function when apply with a stirng contains invalid character or wide character would cause unexpected behavior. Example to use `*` mask a stirng contains wide-character {{}} ```sql select mask("", "Y", "y", "n", "*"); ``` could cause result `{{{}*{*}`{*}{}}} \{*}instead of `{{{}{}}}{*}{{{}`{}}}. Looks spark mask treat {{}} as 2 characters. Example to use wide-character {{}} do mask would cause wrong garbled code problem ```sql select mask("ABC", ""); ``` result is {{{}`???{}}}`. Example to mask a string contains a invalid UTF-8 character ```sql select mask("\xED"); ``` result is `xXX` instead of `\xED`, looks spark treat it as four character `\`, `x`, `E`, `D`. Looks spark mask can only handle BMP character (that is 16 bits) and can't guarantee result for invalid UTC-8 character and wide-character when doing mask. My question here is *does that the limitation / issue of spark mask function or spark mask by design only handle for BMP character ?* If it is a limitation of mask function, could spark address this part in mask function document or comments ? > Unexpected behavior using spark mask function handle string contains invalid > UTF-8 or wide character > ---- > > Key: SPARK-48973 > URL: https://issues.apache.org/jira/browse/SPARK-48973 > Project: Spark > Issue Type: Question > Components: SQL >Affects Versions: 3.3.1, 3.2.4, 3.3.2, 3.4.1, 3.5.0, 4.0.0, 3.5.1, 3.3.4 > Environment: Ubuntu 22.04 >Reporter: Yangyang Gao >Priority: Major > > In the spark the mask function when apply with a stirng contains invalid > character or wide character would cause unexpected behavior. > Example to use `*` mask a stirng contains wide-character {{}} > {code:sql} > select mask("", "Y", "y", "n", "*"); > {code} > could cause result `{{{}*{*}`{*}{}}} \{*}instead of `{{{}{}}}{*}{{{}`{}}}. > Looks spark mask treat {{}} as 2 characters. > Example to use wide-character {{}} do mask would cause wrong garbled code > problem > {code:sql} > select mask("ABC", ""); > {code} > result is {{{}`???{}}}`. > Example to mask a string contains a invalid UTF-8 character > {code:java} > select mask("\xED"); > {code} > result is `xXX` instead of `\xED`, looks spark treat it as four character > `\`, `x`, `E`, `D`. > Looks spark mask can only handle BMP character (that is 16 bits) and can't > guarantee result for invalid UTC-8 character and wide-character when doing > mask. > My question here is *does that the limitation / issue of spark mask function > or spark mask by design only handle for BMP character ?* > If it is a limitation of mask function, could spark address this part in mask > function document or comments ? > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-48973) Unexpected behavior using spark mask function handle string contains invalid UTF-8 or wide character
[ https://issues.apache.org/jira/browse/SPARK-48973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yangyang Gao updated SPARK-48973: - Description: In the spark the mask function when apply with a stirng contains invalid character or wide character would cause unexpected behavior. Example to use `*` mask a stirng contains wide-character {{}} ```sql select mask("", "Y", "y", "n", "*"); ``` could cause result `{{{}*{*}`{*}{}}} \{*}instead of `{{{}{}}}{*}{{{}`{}}}. Looks spark mask treat {{}} as 2 characters. Example to use wide-character {{}} do mask would cause wrong garbled code problem ```sql select mask("ABC", ""); ``` result is {{{}`???{}}}`. Example to mask a string contains a invalid UTF-8 character ```sql select mask("\xED"); ``` result is `xXX` instead of `\xED`, looks spark treat it as four character `\`, `x`, `E`, `D`. Looks spark mask can only handle BMP character (that is 16 bits) and can't guarantee result for invalid UTC-8 character and wide-character when doing mask. My question here is *does that the limitation / issue of spark mask function or spark mask by design only handle for BMP character ?* If it is a limitation of mask function, could spark address this part in mask function document or comments ? was: In the spark the mask function when apply with a stirng contains invalid character or wide character would cause unexpected behavior. Example to use `{{{}*`{}}} mask a stirng contains wide-character {{}} ```sql select mask("", "Y", "y", "n", "*"); ``` could cause result `{{{}*{*}`{*}{}}} \{*}instead of `{{{}{}}}{*}{{{}`{}}}. Looks spark mask treat {{}} as 2 characters. Example to use wide-character {{}} do mask would cause wrong garbled code problem ```sql select mask("ABC", ""); ``` result is {{{}`???{}}}`. Example to mask a string contains a invalid UTF-8 character ```sql select mask("\xED"); ``` result is `xXX` instead of `\xED`, looks spark treat it as four character `\`, `x`, `E`, `D`. Looks spark mask can only handle BMP character (that is 16 bits) and can't guarantee result for invalid UTC-8 character and wide-character when doing mask. My question here is *does that the limitation / issue of spark mask function or spark mask by design only handle for BMP character ?* If it is a limitation of mask function, could spark address this part in mask function document or comments ? > Unexpected behavior using spark mask function handle string contains invalid > UTF-8 or wide character > ---- > > Key: SPARK-48973 > URL: https://issues.apache.org/jira/browse/SPARK-48973 > Project: Spark > Issue Type: Question > Components: SQL >Affects Versions: 3.3.1, 3.2.4, 3.3.2, 3.4.1, 3.5.0, 4.0.0, 3.5.1, 3.3.4 > Environment: Ubuntu 22.04 >Reporter: Yangyang Gao >Priority: Major > > In the spark the mask function when apply with a stirng contains invalid > character or wide character would cause unexpected behavior. > Example to use `*` mask a stirng contains wide-character {{}} > ```sql > select mask("", "Y", "y", "n", "*"); > ``` > could cause result `{{{}*{*}`{*}{}}} \{*}instead of `{{{}{}}}{*}{{{}`{}}}. > Looks spark mask treat {{}} as 2 characters. > Example to use wide-character {{}} do mask would cause wrong garbled code > problem > ```sql > select mask("ABC", ""); > ``` > result is {{{}`???{}}}`. > Example to mask a string contains a invalid UTF-8 character > ```sql > select mask("\xED"); > ``` > result is `xXX` instead of `\xED`, looks spark treat it as four character > `\`, `x`, `E`, `D`. > Looks spark mask can only handle BMP character (that is 16 bits) and can't > guarantee result for invalid UTC-8 character and wide-character when doing > mask. > My question here is *does that the limitation / issue of spark mask function > or spark mask by design only handle for BMP character ?* > If it is a limitation of mask function, could spark address this part in mask > function document or comments ? > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-48973) Unexpected behavior using spark mask function handle string contains invalid UTF-8 or wide character
[ https://issues.apache.org/jira/browse/SPARK-48973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yangyang Gao updated SPARK-48973: - Description: In the spark the mask function when apply with a stirng contains invalid character or wide character would cause unexpected behavior. Example to use `{{{}*`{}}} mask a stirng contains wide-character {{}} ```sql select mask("", "Y", "y", "n", "*"); ``` could cause result `{{{}*{*}`{*}{}}} \{*}instead of `{{{}{}}}{*}{{{}`{}}}. Looks spark mask treat {{}} as 2 characters. Example to use wide-character {{}} do mask would cause wrong garbled code problem ```sql select mask("ABC", ""); ``` result is {{{}`???{}}}`. Example to mask a string contains a invalid UTF-8 character ```sql select mask("\xED"); ``` result is `xXX` instead of `\xED`, looks spark treat it as four character `\`, `x`, `E`, `D`. Looks spark mask can only handle BMP character (that is 16 bits) and can't guarantee result for invalid UTC-8 character and wide-character when doing mask. My question here is *does that the limitation / issue of spark mask function or spark mask by design only handle for BMP character ?* If it is a limitation of mask function, could spark address this part in mask function document or comments ? was: In the spark the mask function when apply with a stirng contains invalid character or wide character would cause unexpected behavior. Example to use `{{{}*`{}}} mask a stirng contains wide-character {{}} ``` select mask("", "Y", "y", "n", "*"); ``` could cause result `{{{}*{*}`{*}{}}} \{*}instead of `{{{}{}}}{*}{{{}`{}}}. Looks spark mask treat {{}} as 2 characters. Example to use wide-character {{}} do mask would cause wrong garbled code problem ``` select mask("ABC", ""); ``` result is {{{}`???{}}}`. Example to mask a string contains a invalid UTF-8 character ``` select mask("\xED"); ``` result is `xXX` instead of `\xED`, looks spark treat it as four character `\`, `x`, `E`, `D`. Looks spark mask can only handle BMP character (that is 16 bits) and can't guarantee result for invalid UTC-8 character and wide-character when doing mask. My question here is *does that the limitation / issue of spark mask function or spark mask by design only handle for BMP character ?* If it is a limitation of mask function, could spark address this part in mask function document or comments ? > Unexpected behavior using spark mask function handle string contains invalid > UTF-8 or wide character > ---- > > Key: SPARK-48973 > URL: https://issues.apache.org/jira/browse/SPARK-48973 > Project: Spark > Issue Type: Question > Components: SQL >Affects Versions: 3.3.1, 3.2.4, 3.3.2, 3.4.1, 3.5.0, 4.0.0, 3.5.1, 3.3.4 > Environment: Ubuntu 22.04 >Reporter: Yangyang Gao >Priority: Major > > In the spark the mask function when apply with a stirng contains invalid > character or wide character would cause unexpected behavior. > Example to use `{{{}*`{}}} mask a stirng contains wide-character {{}} > ```sql > select mask("", "Y", "y", "n", "*"); > ``` > could cause result `{{{}*{*}`{*}{}}} \{*}instead of `{{{}{}}}{*}{{{}`{}}}. > Looks spark mask treat {{}} as 2 characters. > Example to use wide-character {{}} do mask would cause wrong garbled code > problem > ```sql > select mask("ABC", ""); > ``` > result is {{{}`???{}}}`. > Example to mask a string contains a invalid UTF-8 character > ```sql > select mask("\xED"); > ``` > result is `xXX` instead of `\xED`, looks spark treat it as four character > `\`, `x`, `E`, `D`. > Looks spark mask can only handle BMP character (that is 16 bits) and can't > guarantee result for invalid UTC-8 character and wide-character when doing > mask. > My question here is *does that the limitation / issue of spark mask function > or spark mask by design only handle for BMP character ?* > If it is a limitation of mask function, could spark address this part in mask > function document or comments ? > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org