from:"JIRA"

[jira] [Commented] (SPARK-48992) applyInPandas does not respect streaming watermark

2024-07-24 Thread Jungtaek Lim (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-48992?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17868515#comment-17868515
 ] 

Jungtaek Lim commented on SPARK-48992:
--

I understand the confusion - but you'll need to use applyInPandasWithState, not 
applyInPandas. It is purposed to be used for batch (for streaming, it's per 
microbatch).

> applyInPandas does not respect streaming watermark
> --
>
> Key: SPARK-48992
> URL: https://issues.apache.org/jira/browse/SPARK-48992
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.5.0
> Environment: Azure Databricks runtime 14.3 LTS
>Reporter: Richard Swinbank
>Priority: Minor
>
> When I use GroupedData.applyInPandas to implement aggregation in a streaming 
> query, it fails to respect a watermark specified using 
> DataFrame.withWatermark.
> This query reproduces the behaviour I'm seeing:
>  
> {code:python}
> from pyspark.sql.functions import window
> from typing import Tuple
> import pandas as pd
> df_source_stream = (
> spark.readStream
> .format("rate")
> .option("rowsPerSecond", 3)
> .load()
> .withColumn("bucket", window("timestamp", "10 seconds").end)
> )
> def my_function(
> key: Tuple[str], df: pd.DataFrame
> ) -> pd.DataFrame:
> return pd.DataFrame({"bucket": [key[0]], "count": [df.shape[0]]})
> df = (
> df_source_stream
> .withWatermark("bucket", "10 seconds")
> .groupBy("bucket")
> .applyInPandas(my_function, "bucket TIMESTAMP, count INT")
> )
> display(df)
> {code}
> I expect the output of the query to contain one row per {{bucket}} value, but 
> a new row is emitted for each incoming microbatch.
> In contrast, an out of the box aggregate behaves as expected. For example:
> {code:python}
> df = (
> df_source_stream
> .withWatermark("bucket", "10 seconds")
> .groupBy("bucket")
> .count()  # standard aggregate in place of applyInPandas
> )
> {code}
> The output of this query contains *one* row per {{bucket}} value.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-48996) Allow bare literals for and and or of Column

2024-07-24 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48996?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-48996:
---
Labels: pull-request-available  (was: )

> Allow bare literals for __and__ and __or__ of Column
> 
>
> Key: SPARK-48996
> URL: https://issues.apache.org/jira/browse/SPARK-48996
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 4.0.0
>Reporter: Takuya Ueshin
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-48997) Maintenance thread pool error should not cause the entire executor to crash

2024-07-24 Thread Neil Ramaswamy (Jira)

Neil Ramaswamy created SPARK-48997:
--

 Summary: Maintenance thread pool error should not cause the entire 
executor to crash
 Key: SPARK-48997
 URL: https://issues.apache.org/jira/browse/SPARK-48997
 Project: Spark
  Issue Type: Improvement
  Components: Structured Streaming
Affects Versions: 4.0.0
Reporter: Neil Ramaswamy


Today, it's possible for an exception within a thread in the maintenance pool 
to cause the entire executor to crash. Here's how:
 # An error occurs in a maintenance pool thread
 # It gets passed to the maintenance task thread, which `throw`s it
 # That gets caught by `onError`, which `.stop()`s the maintenance thread pool
 # If any of the maintenance pool threads are waiting on a lock, they will 
receive an `InterruptedException` (this happens if they are verifying if the 
their state store instance is active)
 # This `InterruptedException` is not caught, which is not `NonFatal`
 # This uncaught exception bubbles all the way to the 
`SparkUncaughtExceptionHandler`, causing the executor to exit

A fix that is better is to modify the maintenance thread pool to only `unload` 
providers that experience errors, not stop the entire thread pool.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-48996) Allow bare literals for and and or of Column

2024-07-24 Thread Takuya Ueshin (Jira)

Takuya Ueshin created SPARK-48996:
-

 Summary: Allow bare literals for __and__ and __or__ of Column
 Key: SPARK-48996
 URL: https://issues.apache.org/jira/browse/SPARK-48996
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Affects Versions: 4.0.0
Reporter: Takuya Ueshin






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-48937) Fix collation support for the StringToMap expression (binary & lowercase collation only)

2024-07-24 Thread psyren99 (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-48937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17868501#comment-17868501
 ] 

psyren99 commented on SPARK-48937:
--

[~uros-db] still working, should have it done in a couple days

> Fix collation support for the StringToMap expression (binary & lowercase 
> collation only)
> 
>
> Key: SPARK-48937
> URL: https://issues.apache.org/jira/browse/SPARK-48937
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Uroš Bojanić
>Priority: Major
>
> Enable collation support for *StringToMap* built-in string function in Spark 
> ({*}str_to_map{*}). First confirm what is the expected behaviour for this 
> function when given collated strings, and then move on to implementation and 
> testing. You will find this expression in the *complexTypeCreator.scala* 
> file. However, this experssion is currently implemented as pass-through 
> function, which is wrong because it doesn't provide appropriate collation 
> awareness for non-default delimiters.
>  
> Example 1.
> {code:java}
> SELECT str_to_map('a:1,b:2,c:3', ',', ':');{code}
> This query will give the correct result, regardless of the collation.
> {code:java}
> {"a":"1","b":"2","c":"3"}{code}
>  
> Example 2.
> {code:java}
> SELECT str_to_map('ay1xby2xcy3', 'X', 'Y');{code}
> This query will give the *incorrect* result, under UTF8_LCASE collation. The 
> correct result should be:
> {code:java}
> {"a":"1","b":"2","c":"3"}{code}
>  
> Update the corresponding E2E SQL tests (CollationSQLExpressionsSuite) to 
> reflect how this function should be used with collation in SparkSQL, and feel 
> free to use your chosen Spark SQL Editor to experiment with the existing 
> functions to learn more about how they work. In addition, look into the 
> possible use-cases and implementation of similar functions within other other 
> open-source DBMS, such as [PostgreSQL|https://www.postgresql.org/docs/].
>  
> The goal for this Jira ticket is to implement the *StringToMap* expression so 
> that it supports UTF8_BINARY and UTF8_LCASE collations (i.e. 
> StringTypeBinaryLcase). To understand what changes were introduced in order 
> to enable full collation support for other existing functions in Spark, take 
> a look at the related Spark PRs and Jira tickets for completed tasks in this 
> parent (for example: https://issues.apache.org/jira/browse/SPARK-47414).
>  
> Read more about ICU [Collation Concepts|http://example.com/] and 
> [Collator|http://example.com/] class. Also, refer to the Unicode Technical 
> Standard for string 
> [collation|https://www.unicode.org/reports/tr35/tr35-collation.html#Collation_Type_Fallback].



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-48995) Column.endswith(None) occasionally causes NPE

2024-07-24 Thread Mithun Radhakrishnan (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-48995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17868494#comment-17868494
 ] 

Mithun Radhakrishnan commented on SPARK-48995:
--

To the untrained eye, it would appear that the {{None}} argument to 
{{Column.endswith()}} turns up as a null-Column-reference in {{Column.fn}}.

> Column.endswith(None) occasionally causes NPE
> -
>
> Key: SPARK-48995
> URL: https://issues.apache.org/jira/browse/SPARK-48995
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 4.0.0
> Environment: Tested from {{pyspark}} shell, on Apache Spark 4.0.
>Reporter: Mithun Radhakrishnan
>Priority: Major
>
> This one is pretty hard to repro, since it only seems to happen occasionally.
> Invoking `Column.endswith()` seems to result in an NPE, with Spark 4.0:
> {code:python}
> from pyspark.sql.types import *
> import pyspark.sql.functions as f
> schema = StructType([StructField("s", StringType(), True)])
> strings = [Row("abc"), Row("bcd"), Row(None)]
> df = sc.parallelize(strings).toDF(schema)
> df.select( f.col('s').endswith(None) ).collect()
> {code}
> Here is the resulting stack trace:
> {code}
> py4j.protocol.Py4JJavaError: An error occurred while calling o205.endsWith.
> : java.lang.NullPointerException: Cannot invoke 
> "org.apache.spark.sql.Column.expr()" because "x$1" is null
>   at org.apache.spark.sql.Column$.$anonfun$fn$2(Column.scala:77)
>   at scala.collection.immutable.ArraySeq.map(ArraySeq.scala:75)
>   at scala.collection.immutable.ArraySeq.map(ArraySeq.scala:35)
>   at org.apache.spark.sql.Column$.$anonfun$fn$1(Column.scala:77)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(origin.scala:84)
>   at org.apache.spark.sql.package$.withOrigin(package.scala:111)
>   at org.apache.spark.sql.Column$.fn(Column.scala:76)
>   at org.apache.spark.sql.Column$.fn(Column.scala:64)
>   at org.apache.spark.sql.Column.fn(Column.scala:169)
>   at org.apache.spark.sql.Column.endsWith(Column.scala:1078)
>   at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native 
> Method)
>   at 
> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77)
>   at 
> java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.base/java.lang.reflect.Method.invoke(Method.java:568)
>   at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
>   at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:374)
>   at py4j.Gateway.invoke(Gateway.java:282)
>   at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
>   at py4j.commands.CallCommand.execute(CallCommand.java:79)
>   at 
> py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)
>   at py4j.ClientServerConnection.run(ClientServerConnection.java:106)
>   at java.base/java.lang.Thread.run(Thread.java:840)
> {code}
> This seems to point to {{Column::fn}}, which looks new to Spark 4.0.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-48995) Column.endswith(None) occasionally causes NPE

2024-07-24 Thread Mithun Radhakrishnan (Jira)

Mithun Radhakrishnan created SPARK-48995:


 Summary: Column.endswith(None) occasionally causes NPE
 Key: SPARK-48995
 URL: https://issues.apache.org/jira/browse/SPARK-48995
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 4.0.0
 Environment: Tested from {{pyspark}} shell, on Apache Spark 4.0.
Reporter: Mithun Radhakrishnan


This one is pretty hard to repro, since it only seems to happen occasionally.

Invoking `Column.endswith()` seems to result in an NPE, with Spark 4.0:

{code:python}
from pyspark.sql.types import *
import pyspark.sql.functions as f

schema = StructType([StructField("s", StringType(), True)])

strings = [Row("abc"), Row("bcd"), Row(None)]

df = sc.parallelize(strings).toDF(schema)

df.select( f.col('s').endswith(None) ).collect()
{code}

Here is the resulting stack trace:
{code}
py4j.protocol.Py4JJavaError: An error occurred while calling o205.endsWith.
: java.lang.NullPointerException: Cannot invoke 
"org.apache.spark.sql.Column.expr()" because "x$1" is null
  at org.apache.spark.sql.Column$.$anonfun$fn$2(Column.scala:77)
  at scala.collection.immutable.ArraySeq.map(ArraySeq.scala:75)
  at scala.collection.immutable.ArraySeq.map(ArraySeq.scala:35)
  at org.apache.spark.sql.Column$.$anonfun$fn$1(Column.scala:77)
  at 
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(origin.scala:84)
  at org.apache.spark.sql.package$.withOrigin(package.scala:111)
  at org.apache.spark.sql.Column$.fn(Column.scala:76)
  at org.apache.spark.sql.Column$.fn(Column.scala:64)
  at org.apache.spark.sql.Column.fn(Column.scala:169)
  at org.apache.spark.sql.Column.endsWith(Column.scala:1078)
  at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native 
Method)
  at 
java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77)
  at 
java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
  at java.base/java.lang.reflect.Method.invoke(Method.java:568)
  at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
  at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:374)
  at py4j.Gateway.invoke(Gateway.java:282)
  at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
  at py4j.commands.CallCommand.execute(CallCommand.java:79)
  at 
py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)
  at py4j.ClientServerConnection.run(ClientServerConnection.java:106)
  at java.base/java.lang.Thread.run(Thread.java:840)
{code}

This seems to point to {{Column::fn}}, which looks new to Spark 4.0.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-48993) Maximum number of maxRecursiveFieldDepth should be a spark conf

2024-07-24 Thread Wei Liu (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-48993?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17868457#comment-17868457
 ] 

Wei Liu commented on SPARK-48993:
-

ill followup on this

> Maximum number of maxRecursiveFieldDepth should be a spark conf
> ---
>
> Key: SPARK-48993
> URL: https://issues.apache.org/jira/browse/SPARK-48993
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Wei Liu
>Priority: Major
>
> [https://github.com/apache/spark/pull/38922#discussion_r1051294998]
>  
> There is no reason to hard code a 10 here



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-48993) Maximum number of maxRecursiveFieldDepth should be a spark conf

2024-07-24 Thread Wei Liu (Jira)

Wei Liu created SPARK-48993:
---

 Summary: Maximum number of maxRecursiveFieldDepth should be a 
spark conf
 Key: SPARK-48993
 URL: https://issues.apache.org/jira/browse/SPARK-48993
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 4.0.0
Reporter: Wei Liu


[https://github.com/apache/spark/pull/38922#discussion_r1051294998]

 

There is no reason to hard code a 10 here



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-48992) applyInPandas does not respect streaming watermark

2024-07-24 Thread Richard Swinbank (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48992?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Richard Swinbank updated SPARK-48992:
-
Description: 
When I use GroupedData.applyInPandas to implement aggregation in a streaming 
query, it fails to respect a watermark specified using DataFrame.withWatermark.

This query reproduces the behaviour I'm seeing:
 
{code:python}
from pyspark.sql.functions import window
from typing import Tuple
import pandas as pd

df_source_stream = (
spark.readStream
.format("rate")
.option("rowsPerSecond", 3)
.load()
.withColumn("bucket", window("timestamp", "10 seconds").end)
)

def my_function(
key: Tuple[str], df: pd.DataFrame
) -> pd.DataFrame:
return pd.DataFrame({"bucket": [key[0]], "count": [df.shape[0]]})

df = (
df_source_stream
.withWatermark("bucket", "10 seconds")
.groupBy("bucket")
.applyInPandas(my_function, "bucket TIMESTAMP, count INT")
)
display(df)
{code}
I expect the output of the query to contain one row per {{bucket}} value, but a 
new row is emitted for each incoming microbatch.

In contrast, an out of the box aggregate behaves as expected. For example:
{code:python}
df = (
df_source_stream
.withWatermark("bucket", "10 seconds")
.groupBy("bucket")
.count()  # standard aggregate in place of applyInPandas
)
{code}
The output of this query contains *one* row per {{bucket}} value.
 

  was:
When I use GroupedData.applyInPandas to implement aggregation in a streaming 
query, it fails to respect a watermark specified using DataFrame.withWatermark.

This query reproduces the behvaiour I'm seeing:
 
{code:python}
from pyspark.sql.functions import window
from typing import Tuple
import pandas as pd

df_source_stream = (
spark.readStream
.format("rate")
.option("rowsPerSecond", 3)
.load()
.withColumn("bucket", window("timestamp", "10 seconds").end)
)

def my_function(
key: Tuple[str], df: pd.DataFrame
) -> pd.DataFrame:
return pd.DataFrame({"bucket": [key[0]], "count": [df.shape[0]]})

df = (
df_source_stream
.withWatermark("bucket", "10 seconds")
.groupBy("bucket")
.applyInPandas(my_function, "bucket TIMESTAMP, count INT")
)
display(df)
{code}
I expect the output of the query to contain one row per {{bucket}} value, but a 
new row is emitted for each incoming microbatch.

In contrast, an out of the box aggregate behaves as expected. For example:
{code:python}
df = (
df_source_stream
.withWatermark("bucket", "10 seconds")
.groupBy("bucket")
.count()  # standard aggregate in place of applyInPandas
)
{code}
The output of this query contains *one* row per {{bucket}} value.
 


> applyInPandas does not respect streaming watermark
> --
>
> Key: SPARK-48992
> URL: https://issues.apache.org/jira/browse/SPARK-48992
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.5.0
> Environment: Azure Databricks runtime 14.3 LTS
>Reporter: Richard Swinbank
>Priority: Minor
>
> When I use GroupedData.applyInPandas to implement aggregation in a streaming 
> query, it fails to respect a watermark specified using 
> DataFrame.withWatermark.
> This query reproduces the behaviour I'm seeing:
>  
> {code:python}
> from pyspark.sql.functions import window
> from typing import Tuple
> import pandas as pd
> df_source_stream = (
> spark.readStream
> .format("rate")
> .option("rowsPerSecond", 3)
> .load()
> .withColumn("bucket", window("timestamp", "10 seconds").end)
> )
> def my_function(
> key: Tuple[str], df: pd.DataFrame
> ) -> pd.DataFrame:
> return pd.DataFrame({"bucket": [key[0]], "count": [df.shape[0]]})
> df = (
> df_source_stream
> .withWatermark("bucket", "10 seconds")
> .groupBy("bucket")
> .applyInPandas(my_function, "bucket TIMESTAMP, count INT")
> )
> display(df)
> {code}
> I expect the output of the query to contain one row per {{bucket}} value, but 
> a new row is emitted for each incoming microbatch.
> In contrast, an out of the box aggregate behaves as expected. For example:
> {code:python}
> df = (
> df_source_stream
> .withWatermark("bucket", "10 seconds")
> .groupBy("bucket")
> .count()  # standard aggregate in place of applyInPandas
> )
> {code}
> The output of this query contains *one* row per {{bucket}} value.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-48992) applyInPandas does not respect streaming watermark

2024-07-24 Thread Richard Swinbank (Jira)

Richard Swinbank created SPARK-48992:


 Summary: applyInPandas does not respect streaming watermark
 Key: SPARK-48992
 URL: https://issues.apache.org/jira/browse/SPARK-48992
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 3.5.0
 Environment: Azure Databricks runtime 14.3 LTS
Reporter: Richard Swinbank


When I use GroupedData.applyInPandas to implement aggregation in a streaming 
query, it fails to respect a watermark specified using DataFrame.withWatermark.

This query reproduces the behvaiour I'm seeing:
 
{code:python}
from pyspark.sql.functions import window
from typing import Tuple
import pandas as pd

df_source_stream = (
spark.readStream
.format("rate")
.option("rowsPerSecond", 3)
.load()
.withColumn("bucket", window("timestamp", "10 seconds").end)
)

def my_function(
key: Tuple[str], df: pd.DataFrame
) -> pd.DataFrame:
return pd.DataFrame({"bucket": [key[0]], "count": [df.shape[0]]})

df = (
df_source_stream
.withWatermark("bucket", "10 seconds")
.groupBy("bucket")
.applyInPandas(my_function, "bucket TIMESTAMP, count INT")
)
display(df)
{code}
I expect the output of the query to contain one row per {{bucket}} value, but a 
new row is emitted for each incoming microbatch.

In contrast, an out of the box aggregate behaves as expected. For example:
{code:python}
df = (
df_source_stream
.withWatermark("bucket", "10 seconds")
.groupBy("bucket")
.count()  # standard aggregate in place of applyInPandas
)
{code}
The output of this query contains *one* row per {{bucket}} value.
 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-47677) Pandas circular import error in Python 3.10

2024-07-24 Thread Jira



[ 
https://issues.apache.org/jira/browse/SPARK-47677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17868384#comment-17868384
 ] 

Mikó Szilárd commented on SPARK-47677:
--

Hi [~XinrongM],

Is it possible that this change fixed the problem?
[https://github.com/apache/spark/pull/45832]

> Pandas circular import error in Python 3.10 
> 
>
> Key: SPARK-47677
> URL: https://issues.apache.org/jira/browse/SPARK-47677
> Project: Spark
>  Issue Type: Test
>  Components: PySpark, Tests
>Affects Versions: 4.0.0
>Reporter: Xinrong Meng
>Priority: Major
>
> {{AttributeError: partially initialized module 'pandas' has no attribute 
> '_pandas_datetime_CAPI' (most likely due to a circular import)}}
>  
> The above error appears in multiple tests with Python 3.10.
> Python 3.11, 3.12 and pypy3 don't have the issue.
>  
> See [https://github.com/apache/spark/actions/runs/8469356110/job/23208894575] 
> for details.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-47865) Deflaky PythonForeachWriterSuite

2024-07-24 Thread Jira



[ 
https://issues.apache.org/jira/browse/SPARK-47865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17868365#comment-17868365
 ] 

Mikó Szilárd commented on SPARK-47865:
--

Hi [~dongjoon] ,

Is this a duplicate of https://issues.apache.org/jira/browse/SPARK-47866?

> Deflaky PythonForeachWriterSuite
> 
>
> Key: SPARK-47865
> URL: https://issues.apache.org/jira/browse/SPARK-47865
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Tests
>Affects Versions: 4.0.0
>Reporter: Dongjoon Hyun
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-48991) FileStreamSink.hasMetadata handles invalid path

2024-07-24 Thread Kent Yao (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48991?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kent Yao updated SPARK-48991:
-
Fix Version/s: 3.5.3
   (was: 3.5.2)

> FileStreamSink.hasMetadata handles invalid path
> ---
>
> Key: SPARK-48991
> URL: https://issues.apache.org/jira/browse/SPARK-48991
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 4.0.0, 3.5.1, 3.4.3
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0, 3.4.4, 3.5.3
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-48988) Make DefaultParamsReader/Writer handle metadata with spark session

2024-07-24 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48988?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-48988.
--
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 47467
[https://github.com/apache/spark/pull/47467]

> Make DefaultParamsReader/Writer handle metadata with spark session
> --
>
> Key: SPARK-48988
> URL: https://issues.apache.org/jira/browse/SPARK-48988
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 4.0.0
>Reporter: Ruifeng Zheng
>Assignee: Ruifeng Zheng
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-48988) Make DefaultParamsReader/Writer handle metadata with spark session

2024-07-24 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48988?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-48988:
---
Labels: pull-request-available  (was: )

> Make DefaultParamsReader/Writer handle metadata with spark session
> --
>
> Key: SPARK-48988
> URL: https://issues.apache.org/jira/browse/SPARK-48988
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 4.0.0
>Reporter: Ruifeng Zheng
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-48988) Make DefaultParamsReader/Writer handle metadata with spark session

2024-07-24 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48988?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-48988:


Assignee: Ruifeng Zheng

> Make DefaultParamsReader/Writer handle metadata with spark session
> --
>
> Key: SPARK-48988
> URL: https://issues.apache.org/jira/browse/SPARK-48988
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 4.0.0
>Reporter: Ruifeng Zheng
>Assignee: Ruifeng Zheng
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-48833) Support variant in `InMemoryTableScan`

2024-07-24 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48833?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-48833.
-
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 47252
[https://github.com/apache/spark/pull/47252]

> Support variant in `InMemoryTableScan`
> --
>
> Key: SPARK-48833
> URL: https://issues.apache.org/jira/browse/SPARK-48833
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Richard Chen
>Assignee: Richard Chen
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> Currently, df.cache() does not support tables with variant types. We should 
> allow for support



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-48833) Support variant in `InMemoryTableScan`

2024-07-24 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48833?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-48833:
---

Assignee: Richard Chen

> Support variant in `InMemoryTableScan`
> --
>
> Key: SPARK-48833
> URL: https://issues.apache.org/jira/browse/SPARK-48833
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Richard Chen
>Assignee: Richard Chen
>Priority: Major
>  Labels: pull-request-available
>
> Currently, df.cache() does not support tables with variant types. We should 
> allow for support



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-48567) Pyspark StreamingQuery lastProgress and friend should return actual StreamingQueryProgress

2024-07-24 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48567?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-48567:


Assignee: Hyukjin Kwon

> Pyspark StreamingQuery lastProgress and friend should return actual 
> StreamingQueryProgress
> --
>
> Key: SPARK-48567
> URL: https://issues.apache.org/jira/browse/SPARK-48567
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark, SS
>Affects Versions: 4.0.0
>Reporter: Wei Liu
>Assignee: Hyukjin Kwon
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-48567) Pyspark StreamingQuery lastProgress and friend should return actual StreamingQueryProgress

2024-07-24 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48567?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-48567.
--
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 47470
[https://github.com/apache/spark/pull/47470]

> Pyspark StreamingQuery lastProgress and friend should return actual 
> StreamingQueryProgress
> --
>
> Key: SPARK-48567
> URL: https://issues.apache.org/jira/browse/SPARK-48567
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark, SS
>Affects Versions: 4.0.0
>Reporter: Wei Liu
>Assignee: Hyukjin Kwon
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-48991) FileStreamSink.hasMetadata handles invalid path

2024-07-24 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48991?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-48991:
---
Labels: pull-request-available  (was: )

> FileStreamSink.hasMetadata handles invalid path
> ---
>
> Key: SPARK-48991
> URL: https://issues.apache.org/jira/browse/SPARK-48991
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 4.0.0, 3.5.1, 3.4.3
>Reporter: Kent Yao
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-48991) FileStreamSink.hasMetadata handles invalid path

2024-07-24 Thread Kent Yao (Jira)

Kent Yao created SPARK-48991:


 Summary: FileStreamSink.hasMetadata handles invalid path
 Key: SPARK-48991
 URL: https://issues.apache.org/jira/browse/SPARK-48991
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.4.3, 3.5.1, 4.0.0
Reporter: Kent Yao






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-48338) Sql Scripting support for Spark SQL

2024-07-24 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48338?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-48338.
-
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 47404
[https://github.com/apache/spark/pull/47404]

> Sql Scripting support for Spark SQL
> ---
>
> Key: SPARK-48338
> URL: https://issues.apache.org/jira/browse/SPARK-48338
> Project: Spark
>  Issue Type: Epic
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Aleksandar Tomic
>Assignee: Aleksandar Tomic
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
> Attachments: Sql Scripting - OSS.odt, [Design Doc] Sql Scripting - 
> OSS.pdf
>
>
> Design doc for this feature is in attachment.
> High level example of Sql Script:
> ```
> BEGIN
>   DECLARE c INT = 10;
>   WHILE c > 0 DO
> INSERT INTO tscript VALUES (c);
> SET c = c - 1;
>   END WHILE;
> END
> ```
> High level motivation behind this feature:
> SQL Scripting gives customers the ability to develop complex ETL and analysis 
> entirely in SQL. Until now, customers have had to write verbose SQL 
> statements or combine SQL + Python to efficiently write business logic. 
> Coming from another system, customers have to choose whether or not they want 
> to migrate to pyspark. Some customers end up not using Spark because of this 
> gap. SQL Scripting is a key milestone towards enabling SQL practitioners to 
> write sophisticated queries, without the need to use pyspark. Further, SQL 
> Scripting is a necessary step towards support for SQL Stored Procedures, and 
> along with SQL Variables (released) and Temp Tables (in progress), will allow 
> for more seamless data warehouse migrations.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-48338) Sql Scripting support for Spark SQL

2024-07-24 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48338?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-48338:
---

Assignee: Aleksandar Tomic

> Sql Scripting support for Spark SQL
> ---
>
> Key: SPARK-48338
> URL: https://issues.apache.org/jira/browse/SPARK-48338
> Project: Spark
>  Issue Type: Epic
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Aleksandar Tomic
>Assignee: Aleksandar Tomic
>Priority: Major
>  Labels: pull-request-available
> Attachments: Sql Scripting - OSS.odt, [Design Doc] Sql Scripting - 
> OSS.pdf
>
>
> Design doc for this feature is in attachment.
> High level example of Sql Script:
> ```
> BEGIN
>   DECLARE c INT = 10;
>   WHILE c > 0 DO
> INSERT INTO tscript VALUES (c);
> SET c = c - 1;
>   END WHILE;
> END
> ```
> High level motivation behind this feature:
> SQL Scripting gives customers the ability to develop complex ETL and analysis 
> entirely in SQL. Until now, customers have had to write verbose SQL 
> statements or combine SQL + Python to efficiently write business logic. 
> Coming from another system, customers have to choose whether or not they want 
> to migrate to pyspark. Some customers end up not using Spark because of this 
> gap. SQL Scripting is a key milestone towards enabling SQL practitioners to 
> write sophisticated queries, without the need to use pyspark. Further, SQL 
> Scripting is a necessary step towards support for SQL Stored Procedures, and 
> along with SQL Variables (released) and Temp Tables (in progress), will allow 
> for more seamless data warehouse migrations.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-48990) Unified variable related SQL syntax keywords

2024-07-24 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48990?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-48990:
---
Labels: pull-request-available  (was: )

> Unified variable related SQL syntax keywords
> 
>
> Key: SPARK-48990
> URL: https://issues.apache.org/jira/browse/SPARK-48990
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: BingKun Pan
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-48935) Make `checkEvaluation` directly check the `Collation` expression itself in UT

2024-07-24 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48935?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-48935.
-
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 47401
[https://github.com/apache/spark/pull/47401]

> Make `checkEvaluation` directly check the `Collation` expression itself in UT 
> --
>
> Key: SPARK-48935
> URL: https://issues.apache.org/jira/browse/SPARK-48935
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Tests
>Affects Versions: 4.0.0
>Reporter: BingKun Pan
>Assignee: BingKun Pan
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-48990) Unified variable related SQL syntax keywords

2024-07-24 Thread BingKun Pan (Jira)

BingKun Pan created SPARK-48990:
---

 Summary: Unified variable related SQL syntax keywords
 Key: SPARK-48990
 URL: https://issues.apache.org/jira/browse/SPARK-48990
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 4.0.0
Reporter: BingKun Pan






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-48956) Spark Repartition Task Field Retry Cause Data Duplication

2024-07-24 Thread xuanzhiang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48956?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

xuanzhiang updated SPARK-48956:
---
Affects Version/s: 3.2.3
   3.2.2
   3.1.3

> Spark Repartition Task Field Retry Cause Data Duplication
> -
>
> Key: SPARK-48956
> URL: https://issues.apache.org/jira/browse/SPARK-48956
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle
>Affects Versions: 3.1.3, 3.2.1, 3.2.2, 3.2.3
>Reporter: xuanzhiang
>Priority: Major
> Attachments: image-2024-07-21-18-21-33-888.png, 
> image-2024-07-21-18-22-04-665.png, image-2024-07-22-10-00-45-793.png, 
> image-2024-07-22-14-47-50-773.png
>
>
> The question seems like 
> [SPARK-23207|https://issues.apache.org/jira/browse/SPARK-23207]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] (SPARK-48956) Spark Repartition Task Field Retry Cause Data Duplication

2024-07-24 Thread xuanzhiang (Jira)



[ https://issues.apache.org/jira/browse/SPARK-48956 ]


xuanzhiang deleted comment on SPARK-48956:


was (Author: JIRAUSER295364):
Metric info error. Actual output 35351985，but got duplicate data. I will try to 
reproduce the problem and give use cases

> Spark Repartition Task Field Retry Cause Data Duplication
> -
>
> Key: SPARK-48956
> URL: https://issues.apache.org/jira/browse/SPARK-48956
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle
>Affects Versions: 3.1.3, 3.2.1, 3.2.2, 3.2.3
>Reporter: xuanzhiang
>Priority: Major
> Attachments: image-2024-07-21-18-21-33-888.png, 
> image-2024-07-21-18-22-04-665.png, image-2024-07-22-10-00-45-793.png, 
> image-2024-07-22-14-47-50-773.png
>
>
> The question seems like 
> [SPARK-23207|https://issues.apache.org/jira/browse/SPARK-23207]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-48989) WholeStageCodeGen error resulting in NumberFormatException when calling SUBSTRING_INDEX

2024-07-23 Thread Mithun Radhakrishnan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48989?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mithun Radhakrishnan updated SPARK-48989:
-
Environment: 
This was tested from the {{spark-shell}}, in local mode.  All Spark versions 
were run with default settings.

Spark 4.0 SNAPSHOT:  Exception.
Spark 4.0 Preview:  Exception.
Spark 3.5.1:  Success.

  was:
This was tested from the {{spark-shell}}, in local mode.  All environments were 
run with default settings.

Spark 4.0 SNAPSHOT:  Exception.
Spark 4.0 Preview:  Exception.
Spark 3.5.1:  Success.


> WholeStageCodeGen error resulting in NumberFormatException when calling 
> SUBSTRING_INDEX
> ---
>
> Key: SPARK-48989
> URL: https://issues.apache.org/jira/browse/SPARK-48989
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 4.0.0
> Environment: This was tested from the {{spark-shell}}, in local mode. 
>  All Spark versions were run with default settings.
> Spark 4.0 SNAPSHOT:  Exception.
> Spark 4.0 Preview:  Exception.
> Spark 3.5.1:  Success.
>Reporter: Mithun Radhakrishnan
>Priority: Major
>
> One seems to run into a {{NumberFormatException}}, possibly from an error in 
> WholeStageCodeGen, when I exercise {{SUBSTRING_INDEX}} with a null row, thus:
> {code:scala}
> // Create integer table with one null.
> sql( " SELECT num FROM VALUES (1), (2), (3), (NULL) AS (num) 
> ").repartition(1).write.mode("overwrite").parquet("/tmp/mytable")
> // Exercise substring-index.
> sql( " SELECT num, SUBSTRING_INDEX('a_a_a', '_', num) AS subs FROM 
> PARQUET.`/tmp/mytable` ").show()
> {code}
> On Spark 4.0 (HEAD, as of today, and with the preview-1), I see the following 
> exception:
> {code}
> java.lang.NumberFormatException: For input string: "columnartorow_value_0"
>   at 
> java.base/java.lang.NumberFormatException.forInputString(NumberFormatException.java:67)
>   at java.base/java.lang.Integer.parseInt(Integer.java:668)
>   at 
> org.apache.spark.sql.catalyst.expressions.SubstringIndex.$anonfun$doGenCode$29(stringExpressions.scala:1660)
>   at 
> org.apache.spark.sql.catalyst.expressions.TernaryExpression.$anonfun$defineCodeGen$3(Expression.scala:869)
>   at 
> org.apache.spark.sql.catalyst.expressions.TernaryExpression.nullSafeCodeGen(Expression.scala:888)
>   at 
> org.apache.spark.sql.catalyst.expressions.TernaryExpression.defineCodeGen(Expression.scala:868)
>   at 
> org.apache.spark.sql.catalyst.expressions.SubstringIndex.doGenCode(stringExpressions.scala:1659)
>   at 
> org.apache.spark.sql.catalyst.expressions.Expression.$anonfun$genCode$3(Expression.scala:207)
>   at scala.Option.getOrElse(Option.scala:201)
>   at 
> org.apache.spark.sql.catalyst.expressions.Expression.genCode(Expression.scala:202)
>   at 
> org.apache.spark.sql.catalyst.expressions.ToPrettyString.doGenCode(ToPrettyString.scala:62)
>   at 
> org.apache.spark.sql.catalyst.expressions.Expression.$anonfun$genCode$3(Expression.scala:207)
>   at scala.Option.getOrElse(Option.scala:201)
>   at 
> org.apache.spark.sql.catalyst.expressions.Expression.genCode(Expression.scala:202)
>   at 
> org.apache.spark.sql.catalyst.expressions.Alias.genCode(namedExpressions.scala:162)
>   at 
> org.apache.spark.sql.execution.ProjectExec.$anonfun$doConsume$2(basicPhysicalOperators.scala:74)
>   at scala.collection.immutable.List.map(List.scala:247)
>   at scala.collection.immutable.List.map(List.scala:79)
>   at 
> org.apache.spark.sql.execution.ProjectExec.$anonfun$doConsume$1(basicPhysicalOperators.scala:74)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodegenContext.withSubExprEliminationExprs(CodeGenerator.scala:1085)
>   at 
> org.apache.spark.sql.execution.ProjectExec.doConsume(basicPhysicalOperators.scala:74)
>   at 
> org.apache.spark.sql.execution.CodegenSupport.consume(WholeStageCodegenExec.scala:200)
>   at 
> org.apache.spark.sql.execution.CodegenSupport.consume$(WholeStageCodegenExec.scala:153)
>   at 
> org.apache.spark.sql.execution.ColumnarToRowExec.consume(Columnar.scala:68)
>   at 
> org.apache.spark.sql.execution.ColumnarToRowExec.doProduce(Columnar.scala:193)
>   at 
> org.apache.spark.sql.execution.CodegenSupport.$anonfun$produce$1(WholeStageCodegenExec.scala:99)
> {code}
> The same query seems to run alright on Spark 3.5.x:
> {code}
> ++-+
> | num| subs|
> ++-+
> |   1|a|
> |   2|  a_a|
> |   3|a_a_a|
> |NULL| NULL|
> ++-+
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-48989) WholeStageCodeGen error resulting in NumberFormatException when calling SUBSTRING_INDEX

2024-07-23 Thread Mithun Radhakrishnan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48989?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mithun Radhakrishnan updated SPARK-48989:
-
Environment: 
This was tested from the {{spark-shell}}, in local mode.  All environments were 
run with default settings.

Spark 4.0 SNAPSHOT:  Exception.
Spark 4.0 Preview:  Exception.
Spark 3.5.1:  Success.

  was:
This was tested from the {{spark-shell}}, in local mode.

Spark 4.0 SNAPSHOT:  Exception.
Spark 4.0 Preview:  Exception.
Spark 3.5.1:  Success.


> WholeStageCodeGen error resulting in NumberFormatException when calling 
> SUBSTRING_INDEX
> ---
>
> Key: SPARK-48989
> URL: https://issues.apache.org/jira/browse/SPARK-48989
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 4.0.0
> Environment: This was tested from the {{spark-shell}}, in local mode. 
>  All environments were run with default settings.
> Spark 4.0 SNAPSHOT:  Exception.
> Spark 4.0 Preview:  Exception.
> Spark 3.5.1:  Success.
>Reporter: Mithun Radhakrishnan
>Priority: Major
>
> One seems to run into a {{NumberFormatException}}, possibly from an error in 
> WholeStageCodeGen, when I exercise {{SUBSTRING_INDEX}} with a null row, thus:
> {code:scala}
> // Create integer table with one null.
> sql( " SELECT num FROM VALUES (1), (2), (3), (NULL) AS (num) 
> ").repartition(1).write.mode("overwrite").parquet("/tmp/mytable")
> // Exercise substring-index.
> sql( " SELECT num, SUBSTRING_INDEX('a_a_a', '_', num) AS subs FROM 
> PARQUET.`/tmp/mytable` ").show()
> {code}
> On Spark 4.0 (HEAD, as of today, and with the preview-1), I see the following 
> exception:
> {code}
> java.lang.NumberFormatException: For input string: "columnartorow_value_0"
>   at 
> java.base/java.lang.NumberFormatException.forInputString(NumberFormatException.java:67)
>   at java.base/java.lang.Integer.parseInt(Integer.java:668)
>   at 
> org.apache.spark.sql.catalyst.expressions.SubstringIndex.$anonfun$doGenCode$29(stringExpressions.scala:1660)
>   at 
> org.apache.spark.sql.catalyst.expressions.TernaryExpression.$anonfun$defineCodeGen$3(Expression.scala:869)
>   at 
> org.apache.spark.sql.catalyst.expressions.TernaryExpression.nullSafeCodeGen(Expression.scala:888)
>   at 
> org.apache.spark.sql.catalyst.expressions.TernaryExpression.defineCodeGen(Expression.scala:868)
>   at 
> org.apache.spark.sql.catalyst.expressions.SubstringIndex.doGenCode(stringExpressions.scala:1659)
>   at 
> org.apache.spark.sql.catalyst.expressions.Expression.$anonfun$genCode$3(Expression.scala:207)
>   at scala.Option.getOrElse(Option.scala:201)
>   at 
> org.apache.spark.sql.catalyst.expressions.Expression.genCode(Expression.scala:202)
>   at 
> org.apache.spark.sql.catalyst.expressions.ToPrettyString.doGenCode(ToPrettyString.scala:62)
>   at 
> org.apache.spark.sql.catalyst.expressions.Expression.$anonfun$genCode$3(Expression.scala:207)
>   at scala.Option.getOrElse(Option.scala:201)
>   at 
> org.apache.spark.sql.catalyst.expressions.Expression.genCode(Expression.scala:202)
>   at 
> org.apache.spark.sql.catalyst.expressions.Alias.genCode(namedExpressions.scala:162)
>   at 
> org.apache.spark.sql.execution.ProjectExec.$anonfun$doConsume$2(basicPhysicalOperators.scala:74)
>   at scala.collection.immutable.List.map(List.scala:247)
>   at scala.collection.immutable.List.map(List.scala:79)
>   at 
> org.apache.spark.sql.execution.ProjectExec.$anonfun$doConsume$1(basicPhysicalOperators.scala:74)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodegenContext.withSubExprEliminationExprs(CodeGenerator.scala:1085)
>   at 
> org.apache.spark.sql.execution.ProjectExec.doConsume(basicPhysicalOperators.scala:74)
>   at 
> org.apache.spark.sql.execution.CodegenSupport.consume(WholeStageCodegenExec.scala:200)
>   at 
> org.apache.spark.sql.execution.CodegenSupport.consume$(WholeStageCodegenExec.scala:153)
>   at 
> org.apache.spark.sql.execution.ColumnarToRowExec.consume(Columnar.scala:68)
>   at 
> org.apache.spark.sql.execution.ColumnarToRowExec.doProduce(Columnar.scala:193)
>   at 
> org.apache.spark.sql.execution.CodegenSupport.$anonfun$produce$1(WholeStageCodegenExec.scala:99)
> {code}
> The same query seems to run alright on Spark 3.5.x:
> {code}
> ++-+
> | num| subs|
> ++-+
> |   1|a|
> |   2|  a_a|
> |   3|a_a_a|
> |NULL| NULL|
> ++-+
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-48989) WholeStageCodeGen error resulting in NumberFormatException when calling SUBSTRING_INDEX

2024-07-23 Thread Mithun Radhakrishnan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48989?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mithun Radhakrishnan updated SPARK-48989:
-
Description: 
I seem to be running into a {{NumberFormatException}}, possibly from an error 
in WholeStageCodeGen, when I exercise {{SUBSTRING_INDEX}} with a null row, thus:

{code:scala}

// Create integer table with one null.
sql( " SELECT num FROM VALUES (1), (2), (3), (NULL) AS (num) 
").repartition(1).write.mode("overwrite").parquet("/tmp/mytable")

// Exercise substring-index.
sql( " SELECT num, SUBSTRING_INDEX('a_a_a', '_', num) AS subs FROM 
PARQUET.`/tmp/mytable` ").show()


{code}

On Spark 4.0 (HEAD, as of today, and with the preview-1), I see the following 
exception:
{code}
java.lang.NumberFormatException: For input string: "columnartorow_value_0"
  at 
java.base/java.lang.NumberFormatException.forInputString(NumberFormatException.java:67)
  at java.base/java.lang.Integer.parseInt(Integer.java:668)
  at 
org.apache.spark.sql.catalyst.expressions.SubstringIndex.$anonfun$doGenCode$29(stringExpressions.scala:1660)
  at 
org.apache.spark.sql.catalyst.expressions.TernaryExpression.$anonfun$defineCodeGen$3(Expression.scala:869)
  at 
org.apache.spark.sql.catalyst.expressions.TernaryExpression.nullSafeCodeGen(Expression.scala:888)
  at 
org.apache.spark.sql.catalyst.expressions.TernaryExpression.defineCodeGen(Expression.scala:868)
  at 
org.apache.spark.sql.catalyst.expressions.SubstringIndex.doGenCode(stringExpressions.scala:1659)
  at 
org.apache.spark.sql.catalyst.expressions.Expression.$anonfun$genCode$3(Expression.scala:207)
  at scala.Option.getOrElse(Option.scala:201)
  at 
org.apache.spark.sql.catalyst.expressions.Expression.genCode(Expression.scala:202)
  at 
org.apache.spark.sql.catalyst.expressions.ToPrettyString.doGenCode(ToPrettyString.scala:62)
  at 
org.apache.spark.sql.catalyst.expressions.Expression.$anonfun$genCode$3(Expression.scala:207)
  at scala.Option.getOrElse(Option.scala:201)
  at 
org.apache.spark.sql.catalyst.expressions.Expression.genCode(Expression.scala:202)
  at 
org.apache.spark.sql.catalyst.expressions.Alias.genCode(namedExpressions.scala:162)
  at 
org.apache.spark.sql.execution.ProjectExec.$anonfun$doConsume$2(basicPhysicalOperators.scala:74)
  at scala.collection.immutable.List.map(List.scala:247)
  at scala.collection.immutable.List.map(List.scala:79)
  at 
org.apache.spark.sql.execution.ProjectExec.$anonfun$doConsume$1(basicPhysicalOperators.scala:74)
  at 
org.apache.spark.sql.catalyst.expressions.codegen.CodegenContext.withSubExprEliminationExprs(CodeGenerator.scala:1085)
  at 
org.apache.spark.sql.execution.ProjectExec.doConsume(basicPhysicalOperators.scala:74)
  at 
org.apache.spark.sql.execution.CodegenSupport.consume(WholeStageCodegenExec.scala:200)
  at 
org.apache.spark.sql.execution.CodegenSupport.consume$(WholeStageCodegenExec.scala:153)
  at org.apache.spark.sql.execution.ColumnarToRowExec.consume(Columnar.scala:68)
  at 
org.apache.spark.sql.execution.ColumnarToRowExec.doProduce(Columnar.scala:193)
  at 
org.apache.spark.sql.execution.CodegenSupport.$anonfun$produce$1(WholeStageCodegenExec.scala:99)
{code}

The same query seems to run alright on Spark 3.5.x:
{code}
++-+
| num| subs|
++-+
|   1|a|
|   2|  a_a|
|   3|a_a_a|
|NULL| NULL|
++-+

{code}

  was:
I seem to be running into a `NumberFormatException`, possibly from an error in 
WholeStageCodeGen, when I exercise `SUBSTRING_INDEX` with a null row, thus:

{code:scala}

// Create integer table with one null.
sql( " SELECT num FROM VALUES (1), (2), (3), (NULL) AS (num) 
").repartition(1).write.mode("overwrite").parquet("/tmp/mytable")

// Exercise substring-index.
sql( " SELECT num, SUBSTRING_INDEX('a_a_a', '_', num) AS subs FROM 
PARQUET.`/tmp/mytable` ").show()


{code}

On Spark 4.0 (HEAD, as of today, and with the preview-1), I see the following 
exception:
{code}
java.lang.NumberFormatException: For input string: "columnartorow_value_0"
  at 
java.base/java.lang.NumberFormatException.forInputString(NumberFormatException.java:67)
  at java.base/java.lang.Integer.parseInt(Integer.java:668)
  at 
org.apache.spark.sql.catalyst.expressions.SubstringIndex.$anonfun$doGenCode$29(stringExpressions.scala:1660)
  at 
org.apache.spark.sql.catalyst.expressions.TernaryExpression.$anonfun$defineCodeGen$3(Expression.scala:869)
  at 
org.apache.spark.sql.catalyst.expressions.TernaryExpression.nullSafeCodeGen(Expression.scala:888)
  at 
org.apache.spark.sql.catalyst.expressions.TernaryExpression.defineCodeGen(Expression.scala:868)
  at 
org.apache.spark.sql.catalyst.expressions.SubstringIndex.doGenCode(stringExpressions.scala:1659)
  at 
org.apache.spark.sql.catalyst.expressions.Expression.$anonfun$genCode$3(Expression.scala:207)
  at scala.Option.getOrElse(Optio

[jira] [Created] (SPARK-48989) WholeStageCodeGen error resulting in NumberFormatException when calling SUBSTRING_INDEX

2024-07-23 Thread Mithun Radhakrishnan (Jira)

Mithun Radhakrishnan created SPARK-48989:


 Summary: WholeStageCodeGen error resulting in 
NumberFormatException when calling SUBSTRING_INDEX
 Key: SPARK-48989
 URL: https://issues.apache.org/jira/browse/SPARK-48989
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 4.0.0
 Environment: This was tested from the {{spark-shell}}, in local mode.

Spark 4.0 SNAPSHOT:  Exception.
Spark 4.0 Preview:  Exception.
Spark 3.5.1:  Success.
Reporter: Mithun Radhakrishnan


I seem to be running into a `NumberFormatException`, possibly from an error in 
WholeStageCodeGen, when I exercise `SUBSTRING_INDEX` with a null row, thus:

{code:scala}

// Create integer table with one null.
sql( " SELECT num FROM VALUES (1), (2), (3), (NULL) AS (num) 
").repartition(1).write.mode("overwrite").parquet("/tmp/mytable")

// Exercise substring-index.
sql( " SELECT num, SUBSTRING_INDEX('a_a_a', '_', num) AS subs FROM 
PARQUET.`/tmp/mytable` ").show()


{code}

On Spark 4.0 (HEAD, as of today, and with the preview-1), I see the following 
exception:
{code}
java.lang.NumberFormatException: For input string: "columnartorow_value_0"
  at 
java.base/java.lang.NumberFormatException.forInputString(NumberFormatException.java:67)
  at java.base/java.lang.Integer.parseInt(Integer.java:668)
  at 
org.apache.spark.sql.catalyst.expressions.SubstringIndex.$anonfun$doGenCode$29(stringExpressions.scala:1660)
  at 
org.apache.spark.sql.catalyst.expressions.TernaryExpression.$anonfun$defineCodeGen$3(Expression.scala:869)
  at 
org.apache.spark.sql.catalyst.expressions.TernaryExpression.nullSafeCodeGen(Expression.scala:888)
  at 
org.apache.spark.sql.catalyst.expressions.TernaryExpression.defineCodeGen(Expression.scala:868)
  at 
org.apache.spark.sql.catalyst.expressions.SubstringIndex.doGenCode(stringExpressions.scala:1659)
  at 
org.apache.spark.sql.catalyst.expressions.Expression.$anonfun$genCode$3(Expression.scala:207)
  at scala.Option.getOrElse(Option.scala:201)
  at 
org.apache.spark.sql.catalyst.expressions.Expression.genCode(Expression.scala:202)
  at 
org.apache.spark.sql.catalyst.expressions.ToPrettyString.doGenCode(ToPrettyString.scala:62)
  at 
org.apache.spark.sql.catalyst.expressions.Expression.$anonfun$genCode$3(Expression.scala:207)
  at scala.Option.getOrElse(Option.scala:201)
  at 
org.apache.spark.sql.catalyst.expressions.Expression.genCode(Expression.scala:202)
  at 
org.apache.spark.sql.catalyst.expressions.Alias.genCode(namedExpressions.scala:162)
  at 
org.apache.spark.sql.execution.ProjectExec.$anonfun$doConsume$2(basicPhysicalOperators.scala:74)
  at scala.collection.immutable.List.map(List.scala:247)
  at scala.collection.immutable.List.map(List.scala:79)
  at 
org.apache.spark.sql.execution.ProjectExec.$anonfun$doConsume$1(basicPhysicalOperators.scala:74)
  at 
org.apache.spark.sql.catalyst.expressions.codegen.CodegenContext.withSubExprEliminationExprs(CodeGenerator.scala:1085)
  at 
org.apache.spark.sql.execution.ProjectExec.doConsume(basicPhysicalOperators.scala:74)
  at 
org.apache.spark.sql.execution.CodegenSupport.consume(WholeStageCodegenExec.scala:200)
  at 
org.apache.spark.sql.execution.CodegenSupport.consume$(WholeStageCodegenExec.scala:153)
  at org.apache.spark.sql.execution.ColumnarToRowExec.consume(Columnar.scala:68)
  at 
org.apache.spark.sql.execution.ColumnarToRowExec.doProduce(Columnar.scala:193)
  at 
org.apache.spark.sql.execution.CodegenSupport.$anonfun$produce$1(WholeStageCodegenExec.scala:99)
{code}

The same query seems to run alright on Spark 3.5.x:
{code}
++-+
| num| subs|
++-+
|   1|a|
|   2|  a_a|
|   3|a_a_a|
|NULL| NULL|
++-+

{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-48961) Make the parameter naming of PySparkException consistent with JVM

2024-07-23 Thread Haejoon Lee (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48961?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Haejoon Lee resolved SPARK-48961.
-
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 47436
[https://github.com/apache/spark/pull/47436]

> Make the parameter naming of PySparkException consistent with JVM
> -
>
> Key: SPARK-48961
> URL: https://issues.apache.org/jira/browse/SPARK-48961
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 4.0.0
>Reporter: Haejoon Lee
>Assignee: Haejoon Lee
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> Parameter naming of PySparkException <> SparkException is different so there 
> are inconsistency when searching error logs.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-48961) Make the parameter naming of PySparkException consistent with JVM

2024-07-23 Thread Haejoon Lee (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48961?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Haejoon Lee reassigned SPARK-48961:
---

Assignee: Haejoon Lee

> Make the parameter naming of PySparkException consistent with JVM
> -
>
> Key: SPARK-48961
> URL: https://issues.apache.org/jira/browse/SPARK-48961
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 4.0.0
>Reporter: Haejoon Lee
>Assignee: Haejoon Lee
>Priority: Major
>  Labels: pull-request-available
>
> Parameter naming of PySparkException <> SparkException is different so there 
> are inconsistency when searching error logs.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-48931) Reduce Cloud Store List API cost for state store maintenance task

2024-07-23 Thread Jungtaek Lim (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48931?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jungtaek Lim resolved SPARK-48931.
--
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 47393
[https://github.com/apache/spark/pull/47393]

> Reduce Cloud Store List API cost for state store maintenance task
> -
>
> Key: SPARK-48931
> URL: https://issues.apache.org/jira/browse/SPARK-48931
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 3.4.3
>Reporter: Riya Verma
>Assignee: Riya Verma
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> Currently, during the state store maintenance process, we find which old 
> version files of the RocksDB state store to delete by listing all existing 
> snapshotted version files in the checkpoint directory every 1 minute by 
> default. The frequent list calls in the cloud can result in high costs. To 
> address this concern and reduce the cost associated with state store 
> maintenance, we should aim to minimize the frequency of listing object stores 
> inside the maintenance task. To minimize the frequency, we will try to 
> accumulate versions to delete and only call list when the number of versions 
> to delete reaches a configured threshold. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-48931) Reduce Cloud Store List API cost for state store maintenance task

2024-07-23 Thread Jungtaek Lim (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48931?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jungtaek Lim reassigned SPARK-48931:


Assignee: Riya Verma

> Reduce Cloud Store List API cost for state store maintenance task
> -
>
> Key: SPARK-48931
> URL: https://issues.apache.org/jira/browse/SPARK-48931
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 3.4.3
>Reporter: Riya Verma
>Assignee: Riya Verma
>Priority: Major
>  Labels: pull-request-available
>
> Currently, during the state store maintenance process, we find which old 
> version files of the RocksDB state store to delete by listing all existing 
> snapshotted version files in the checkpoint directory every 1 minute by 
> default. The frequent list calls in the cloud can result in high costs. To 
> address this concern and reduce the cost associated with state store 
> maintenance, we should aim to minimize the frequency of listing object stores 
> inside the maintenance task. To minimize the frequency, we will try to 
> accumulate versions to delete and only call list when the number of versions 
> to delete reaches a configured threshold. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-48988) Make DefaultParamsReader/Writer handle metadata with spark session

2024-07-23 Thread Ruifeng Zheng (Jira)

Ruifeng Zheng created SPARK-48988:
-

 Summary: Make DefaultParamsReader/Writer handle metadata with 
spark session
 Key: SPARK-48988
 URL: https://issues.apache.org/jira/browse/SPARK-48988
 Project: Spark
  Issue Type: Improvement
  Components: ML
Affects Versions: 4.0.0
Reporter: Ruifeng Zheng






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-48975) Remove unnecessary `ScalaReflectionLock` definition from `protobuf`

2024-07-23 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48975?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-48975.
---
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 47459
[https://github.com/apache/spark/pull/47459]

> Remove unnecessary `ScalaReflectionLock` definition from `protobuf`
> ---
>
> Key: SPARK-48975
> URL: https://issues.apache.org/jira/browse/SPARK-48975
> Project: Spark
>  Issue Type: Improvement
>  Components: Protobuf
>Affects Versions: 4.0.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-48975) Remove unnecessary `ScalaReflectionLock` definition from `protobuf`

2024-07-23 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48975?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-48975:
-

Assignee: Yang Jie

> Remove unnecessary `ScalaReflectionLock` definition from `protobuf`
> ---
>
> Key: SPARK-48975
> URL: https://issues.apache.org/jira/browse/SPARK-48975
> Project: Spark
>  Issue Type: Improvement
>  Components: Protobuf
>Affects Versions: 4.0.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-48976) Improve the docs related to `variable`

2024-07-23 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48976?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-48976.
--
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 47460
[https://github.com/apache/spark/pull/47460]

> Improve the docs related to `variable`
> --
>
> Key: SPARK-48976
> URL: https://issues.apache.org/jira/browse/SPARK-48976
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, SQL
>Affects Versions: 4.0.0
>Reporter: BingKun Pan
>Assignee: BingKun Pan
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-48981) Fix pyspark simpleString method for collations

2024-07-23 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48981?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-48981.
--
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 47463
[https://github.com/apache/spark/pull/47463]

> Fix pyspark simpleString method for collations
> --
>
> Key: SPARK-48981
> URL: https://issues.apache.org/jira/browse/SPARK-48981
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 4.0.0
>Reporter: Stefan Kandic
>Assignee: Stefan Kandic
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-48987) Make `curl` retry 3 times in `bin/mvn`

2024-07-23 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48987?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-48987:


Assignee: BingKun Pan

> Make `curl` retry 3 times in `bin/mvn`
> --
>
> Key: SPARK-48987
> URL: https://issues.apache.org/jira/browse/SPARK-48987
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra
>Affects Versions: 4.0.0
>Reporter: BingKun Pan
>Assignee: BingKun Pan
>Priority: Critical
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-48987) Make `curl` retry 3 times in `bin/mvn`

2024-07-23 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48987?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-48987.
--
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 47465
[https://github.com/apache/spark/pull/47465]

> Make `curl` retry 3 times in `bin/mvn`
> --
>
> Key: SPARK-48987
> URL: https://issues.apache.org/jira/browse/SPARK-48987
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra
>Affects Versions: 4.0.0
>Reporter: BingKun Pan
>Assignee: BingKun Pan
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-48986) Introduce a ColumnNode API

2024-07-23 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48986?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-48986:
---
Labels: pull-request-available  (was: )

> Introduce a ColumnNode API
> --
>
> Key: SPARK-48986
> URL: https://issues.apache.org/jira/browse/SPARK-48986
> Project: Spark
>  Issue Type: New Feature
>  Components: Connect, SQL
>Affects Versions: 4.0.0
>Reporter: Herman van Hövell
>Assignee: Herman van Hövell
>Priority: Major
>  Labels: pull-request-available
>
> Introduce an intermediate representation (IR) for Column operations. This 
> will allow us to share the Column API between the classic and connect Scala 
> API.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-48987) Make `curl` retry 3 times in `bin/mvn`

2024-07-23 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48987?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-48987:
---
Labels: pull-request-available  (was: )

> Make `curl` retry 3 times in `bin/mvn`
> --
>
> Key: SPARK-48987
> URL: https://issues.apache.org/jira/browse/SPARK-48987
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra
>Affects Versions: 4.0.0
>Reporter: BingKun Pan
>Priority: Critical
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-48987) Make `curl` retry 3 times in `bin/mvn`

2024-07-23 Thread BingKun Pan (Jira)

BingKun Pan created SPARK-48987:
---

 Summary: Make `curl` retry 3 times in `bin/mvn`
 Key: SPARK-48987
 URL: https://issues.apache.org/jira/browse/SPARK-48987
 Project: Spark
  Issue Type: Improvement
  Components: Project Infra
Affects Versions: 4.0.0
Reporter: BingKun Pan






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-45201) NoClassDefFoundError: InternalFutureFailureAccess when compiling Spark 3.5.0

2024-07-23 Thread XiDuo You (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-45201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

XiDuo You resolved SPARK-45201.
---
Fix Version/s: 3.5.2
   Resolution: Fixed

> NoClassDefFoundError: InternalFutureFailureAccess when compiling Spark 3.5.0
> 
>
> Key: SPARK-45201
> URL: https://issues.apache.org/jira/browse/SPARK-45201
> Project: Spark
>  Issue Type: Bug
>  Components: Connect
>Affects Versions: 3.5.0, 3.5.1
>Reporter: Sebastian Daberdaku
>Priority: Major
> Fix For: 3.5.2
>
> Attachments: Dockerfile, spark-3.5.0.patch, spark-3.5.1.patch
>
>
> I am trying to compile Spark 3.5.0 and make a distribution that supports 
> Spark Connect and Kubernetes. The compilation seems to complete correctly, 
> but when I try to run the Spark Connect server on kubernetes I get a 
> "NoClassDefFoundError" as follows:
> {code:java}
> Exception in thread "main" java.lang.NoClassDefFoundError: 
> org/sparkproject/guava/util/concurrent/internal/InternalFutureFailureAccess
>     at java.base/java.lang.ClassLoader.defineClass1(Native Method)
>     at java.base/java.lang.ClassLoader.defineClass(ClassLoader.java:1017)
>     at 
> java.base/java.security.SecureClassLoader.defineClass(SecureClassLoader.java:150)
>     at 
> java.base/jdk.internal.loader.BuiltinClassLoader.defineClass(BuiltinClassLoader.java:862)
>     at 
> java.base/jdk.internal.loader.BuiltinClassLoader.findClassOnClassPathOrNull(BuiltinClassLoader.java:760)
>     at 
> java.base/jdk.internal.loader.BuiltinClassLoader.loadClassOrNull(BuiltinClassLoader.java:681)
>     at 
> java.base/jdk.internal.loader.BuiltinClassLoader.loadClass(BuiltinClassLoader.java:639)
>     at 
> java.base/jdk.internal.loader.ClassLoaders$AppClassLoader.loadClass(ClassLoaders.java:188)
>     at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:525)
>     at java.base/java.lang.ClassLoader.defineClass1(Native Method)
>     at java.base/java.lang.ClassLoader.defineClass(ClassLoader.java:1017)
>     at 
> java.base/java.security.SecureClassLoader.defineClass(SecureClassLoader.java:150)
>     at 
> java.base/jdk.internal.loader.BuiltinClassLoader.defineClass(BuiltinClassLoader.java:862)
>     at 
> java.base/jdk.internal.loader.BuiltinClassLoader.findClassOnClassPathOrNull(BuiltinClassLoader.java:760)
>     at 
> java.base/jdk.internal.loader.BuiltinClassLoader.loadClassOrNull(BuiltinClassLoader.java:681)
>     at 
> java.base/jdk.internal.loader.BuiltinClassLoader.loadClass(BuiltinClassLoader.java:639)
>     at 
> java.base/jdk.internal.loader.ClassLoaders$AppClassLoader.loadClass(ClassLoaders.java:188)
>     at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:525)
>     at java.base/java.lang.ClassLoader.defineClass1(Native Method)
>     at java.base/java.lang.ClassLoader.defineClass(ClassLoader.java:1017)
>     at 
> java.base/java.security.SecureClassLoader.defineClass(SecureClassLoader.java:150)
>     at 
> java.base/jdk.internal.loader.BuiltinClassLoader.defineClass(BuiltinClassLoader.java:862)
>     at 
> java.base/jdk.internal.loader.BuiltinClassLoader.findClassOnClassPathOrNull(BuiltinClassLoader.java:760)
>     at 
> java.base/jdk.internal.loader.BuiltinClassLoader.loadClassOrNull(BuiltinClassLoader.java:681)
>     at 
> java.base/jdk.internal.loader.BuiltinClassLoader.loadClass(BuiltinClassLoader.java:639)
>     at 
> java.base/jdk.internal.loader.ClassLoaders$AppClassLoader.loadClass(ClassLoaders.java:188)
>     at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:525)
>     at 
> org.sparkproject.guava.cache.LocalCache$LoadingValueReference.(LocalCache.java:3511)
>     at 
> org.sparkproject.guava.cache.LocalCache$LoadingValueReference.(LocalCache.java:3515)
>     at 
> org.sparkproject.guava.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2168)
>     at 
> org.sparkproject.guava.cache.LocalCache$Segment.get(LocalCache.java:2079)
>     at org.sparkproject.guava.cache.LocalCache.get(LocalCache.java:4011)
>     at org.sparkproject.guava.cache.LocalCache.getOrLoad(LocalCache.java:4034)
>     at 
> org.sparkproject.guava.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:5010)
>     at 
> org.apache.spark.storage.BlockManagerId$.getCachedBlockManagerId(BlockManagerId.scala:146)
>     at 
> org.apache.spark.storage.BlockManagerId$.apply(BlockManagerId.scala:127)
>     at 
> org.apache.spark.storage.BlockManager.initialize(BlockManager.scala:536)
>     at org.apache.spark.SparkContext.(SparkContext.scal

[jira] [Commented] (SPARK-45201) NoClassDefFoundError: InternalFutureFailureAccess when compiling Spark 3.5.0

2024-07-23 Thread XiDuo You (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-45201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17868221#comment-17868221
 ] 

XiDuo You commented on SPARK-45201:
---

This issue has been fixed by https://github.com/apache/spark/pull/45775

> NoClassDefFoundError: InternalFutureFailureAccess when compiling Spark 3.5.0
> 
>
> Key: SPARK-45201
> URL: https://issues.apache.org/jira/browse/SPARK-45201
> Project: Spark
>  Issue Type: Bug
>  Components: Connect
>Affects Versions: 3.5.0, 3.5.1
>Reporter: Sebastian Daberdaku
>Priority: Major
> Attachments: Dockerfile, spark-3.5.0.patch, spark-3.5.1.patch
>
>
> I am trying to compile Spark 3.5.0 and make a distribution that supports 
> Spark Connect and Kubernetes. The compilation seems to complete correctly, 
> but when I try to run the Spark Connect server on kubernetes I get a 
> "NoClassDefFoundError" as follows:
> {code:java}
> Exception in thread "main" java.lang.NoClassDefFoundError: 
> org/sparkproject/guava/util/concurrent/internal/InternalFutureFailureAccess
>     at java.base/java.lang.ClassLoader.defineClass1(Native Method)
>     at java.base/java.lang.ClassLoader.defineClass(ClassLoader.java:1017)
>     at 
> java.base/java.security.SecureClassLoader.defineClass(SecureClassLoader.java:150)
>     at 
> java.base/jdk.internal.loader.BuiltinClassLoader.defineClass(BuiltinClassLoader.java:862)
>     at 
> java.base/jdk.internal.loader.BuiltinClassLoader.findClassOnClassPathOrNull(BuiltinClassLoader.java:760)
>     at 
> java.base/jdk.internal.loader.BuiltinClassLoader.loadClassOrNull(BuiltinClassLoader.java:681)
>     at 
> java.base/jdk.internal.loader.BuiltinClassLoader.loadClass(BuiltinClassLoader.java:639)
>     at 
> java.base/jdk.internal.loader.ClassLoaders$AppClassLoader.loadClass(ClassLoaders.java:188)
>     at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:525)
>     at java.base/java.lang.ClassLoader.defineClass1(Native Method)
>     at java.base/java.lang.ClassLoader.defineClass(ClassLoader.java:1017)
>     at 
> java.base/java.security.SecureClassLoader.defineClass(SecureClassLoader.java:150)
>     at 
> java.base/jdk.internal.loader.BuiltinClassLoader.defineClass(BuiltinClassLoader.java:862)
>     at 
> java.base/jdk.internal.loader.BuiltinClassLoader.findClassOnClassPathOrNull(BuiltinClassLoader.java:760)
>     at 
> java.base/jdk.internal.loader.BuiltinClassLoader.loadClassOrNull(BuiltinClassLoader.java:681)
>     at 
> java.base/jdk.internal.loader.BuiltinClassLoader.loadClass(BuiltinClassLoader.java:639)
>     at 
> java.base/jdk.internal.loader.ClassLoaders$AppClassLoader.loadClass(ClassLoaders.java:188)
>     at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:525)
>     at java.base/java.lang.ClassLoader.defineClass1(Native Method)
>     at java.base/java.lang.ClassLoader.defineClass(ClassLoader.java:1017)
>     at 
> java.base/java.security.SecureClassLoader.defineClass(SecureClassLoader.java:150)
>     at 
> java.base/jdk.internal.loader.BuiltinClassLoader.defineClass(BuiltinClassLoader.java:862)
>     at 
> java.base/jdk.internal.loader.BuiltinClassLoader.findClassOnClassPathOrNull(BuiltinClassLoader.java:760)
>     at 
> java.base/jdk.internal.loader.BuiltinClassLoader.loadClassOrNull(BuiltinClassLoader.java:681)
>     at 
> java.base/jdk.internal.loader.BuiltinClassLoader.loadClass(BuiltinClassLoader.java:639)
>     at 
> java.base/jdk.internal.loader.ClassLoaders$AppClassLoader.loadClass(ClassLoaders.java:188)
>     at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:525)
>     at 
> org.sparkproject.guava.cache.LocalCache$LoadingValueReference.(LocalCache.java:3511)
>     at 
> org.sparkproject.guava.cache.LocalCache$LoadingValueReference.(LocalCache.java:3515)
>     at 
> org.sparkproject.guava.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2168)
>     at 
> org.sparkproject.guava.cache.LocalCache$Segment.get(LocalCache.java:2079)
>     at org.sparkproject.guava.cache.LocalCache.get(LocalCache.java:4011)
>     at org.sparkproject.guava.cache.LocalCache.getOrLoad(LocalCache.java:4034)
>     at 
> org.sparkproject.guava.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:5010)
>     at 
> org.apache.spark.storage.BlockManagerId$.getCachedBlockManagerId(BlockManagerId.scala:146)
>     at 
> org.apache.spark.storage.BlockManagerId$.apply(BlockManagerId.scala:127)
>     at 
> org.apache.spark.storage.BlockManager.initialize(BlockManager.scala:536)
>     at org.apa

[jira] [Resolved] (SPARK-48414) Fix breaking change in python's `fromJson`

2024-07-23 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48414?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-48414.
--
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 46737
[https://github.com/apache/spark/pull/46737]

> Fix breaking change in python's `fromJson`
> --
>
> Key: SPARK-48414
> URL: https://issues.apache.org/jira/browse/SPARK-48414
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 4.0.0
>Reporter: Stefan Kandic
>Assignee: Stefan Kandic
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-48985) Remote (most) hard coded expressions from SparkConnectPlanner/functions.scala

2024-07-23 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48985?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-48985:
---
Labels: pull-request-available  (was: )

> Remote (most) hard coded expressions from SparkConnectPlanner/functions.scala
> -
>
> Key: SPARK-48985
> URL: https://issues.apache.org/jira/browse/SPARK-48985
> Project: Spark
>  Issue Type: New Feature
>  Components: Connect, SQL
>Affects Versions: 4.0.0
>Reporter: Herman van Hövell
>Assignee: Herman van Hövell
>Priority: Major
>  Labels: pull-request-available
>
> There are a number of hard coded expressions in the SparkConnectPlanner. Most 
> of these expressions are hardcoded because they are missing a proper 
> constructor, or because they are not registered in the FunctionRegistry. 
> functions.scala has a similar problem. We should try to remove these 
> exceptions.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-48985) Remote (most) hard coded expressions from SparkConnectPlanner/functions.scala

2024-07-23 Thread Jira

Herman van Hövell created SPARK-48985:
-

 Summary: Remote (most) hard coded expressions from 
SparkConnectPlanner/functions.scala
 Key: SPARK-48985
 URL: https://issues.apache.org/jira/browse/SPARK-48985
 Project: Spark
  Issue Type: New Feature
  Components: Connect, SQL
Affects Versions: 4.0.0
Reporter: Herman van Hövell
Assignee: Herman van Hövell


There are a number of hard coded expressions in the SparkConnectPlanner. Most 
of these expressions are hardcoded because they are missing a proper 
constructor, or because they are not registered in the FunctionRegistry. 
functions.scala has a similar problem. We should try to remove these exceptions.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-48984) Add Controller Metrics System and Utils

2024-07-23 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48984?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-48984:
---
Labels: pull-request-available  (was: )

> Add Controller Metrics System and Utils
> ---
>
> Key: SPARK-48984
> URL: https://issues.apache.org/jira/browse/SPARK-48984
> Project: Spark
>  Issue Type: Sub-task
>  Components: k8s
>Affects Versions: kubernetes-operator-0.1.0
>Reporter: Zhou JIANG
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-48984) Add Controller Metrics System and Utils

2024-07-23 Thread Zhou JIANG (Jira)

Zhou JIANG created SPARK-48984:
--

 Summary: Add Controller Metrics System and Utils
 Key: SPARK-48984
 URL: https://issues.apache.org/jira/browse/SPARK-48984
 Project: Spark
  Issue Type: Sub-task
  Components: k8s
Affects Versions: kubernetes-operator-0.1.0
Reporter: Zhou JIANG






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-48935) Make `checkEvaluation` directly check the `Collation` expression itself in UT

2024-07-23 Thread BingKun Pan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48935?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

BingKun Pan updated SPARK-48935:

Summary: Make `checkEvaluation` directly check the `Collation` expression 
itself in UT   (was: Restrictions on`collatinId` should be added to the 
constructor of `StringType`)

> Make `checkEvaluation` directly check the `Collation` expression itself in UT 
> --
>
> Key: SPARK-48935
> URL: https://issues.apache.org/jira/browse/SPARK-48935
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: BingKun Pan
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-48982) [GO] Extract Spark Exceptions from GRPC response

2024-07-23 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48982?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-48982:
---
Labels: pull-request-available  (was: )

> [GO] Extract Spark Exceptions from GRPC response
> 
>
> Key: SPARK-48982
> URL: https://issues.apache.org/jira/browse/SPARK-48982
> Project: Spark
>  Issue Type: Improvement
>  Components: Connect
>Affects Versions: 3.5.1
>Reporter: Martin Grund
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-48982) [GO] Extract Spark Exceptions from GRPC response

2024-07-23 Thread Martin Grund (Jira)

Martin Grund created SPARK-48982:


 Summary: [GO] Extract Spark Exceptions from GRPC response
 Key: SPARK-48982
 URL: https://issues.apache.org/jira/browse/SPARK-48982
 Project: Spark
  Issue Type: Improvement
  Components: Connect
Affects Versions: 3.5.1
Reporter: Martin Grund
 Fix For: 4.0.0






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-48967) Improve performance and memory footprint of "INSERT INTO ... VALUES" Statements

2024-07-23 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48967?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-48967:
---
Labels: pull-request-available  (was: )

> Improve performance and memory footprint of "INSERT INTO ... VALUES" 
> Statements
> ---
>
> Key: SPARK-48967
> URL: https://issues.apache.org/jira/browse/SPARK-48967
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.4.4
>Reporter: Costas Zarifis
>Priority: Major
>  Labels: pull-request-available
>
> Currently very large "INSERT INTO ... VALUES" statements result into 
> disproportionally large parse trees as each literal will need to remain in 
> the parse tree, until it eventually gets evaluated into a LocalTable, once 
> the appropriate analyzer/optimizer rules have been applied.
>  
> This results in increased memory pressure on the driver, when such large 
> statements are generated, that can lead to OOMs and GC pauses. It also 
> results in suboptimal runtime performance as the time it takes to apply 
> analyzer/optimizer rules is typically proportional to the size of the parse 
> tree.
>  
> Both these issues can be resolved by applying the functions that evaluate the 
> unresolved table into a local table eagerly from the AST Builder, thus 
> short-circuiting the evaluation of such statements.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-48981) Fix pyspark simpleString method for collations

2024-07-23 Thread Stefan Kandic (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48981?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stefan Kandic updated SPARK-48981:
--
Summary: Fix pyspark simpleString method for collations  (was: Fix pyspark 
simpleString method)

> Fix pyspark simpleString method for collations
> --
>
> Key: SPARK-48981
> URL: https://issues.apache.org/jira/browse/SPARK-48981
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 4.0.0
>Reporter: Stefan Kandic
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-48928) Log Warning for Calling .unpersist() on Locally Checkpointed RDDs

2024-07-23 Thread Mridul Muralidharan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48928?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mridul Muralidharan resolved SPARK-48928.
-
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 47391
[https://github.com/apache/spark/pull/47391]

> Log Warning for Calling .unpersist() on Locally Checkpointed RDDs
> -
>
> Key: SPARK-48928
> URL: https://issues.apache.org/jira/browse/SPARK-48928
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Mingkang Li
>Assignee: Mingkang Li
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> *Summary:* 
> This change proposes to log a warning message when the {{.unpersist()}} 
> method is called on RDDs that have been locally checkpointed in Apache Spark. 
> This aims to inform users about the potential risks of unpersisting such RDDs 
> without altering the existing behavior of the method.
> *Background:*
>  Local checkpointing in Spark truncates the lineage of an RDD, meaning that 
> the RDD cannot be recomputed from its source. If an RDD that has been locally 
> checkpointed is unpersisted, it loses its data and cannot be regenerated. 
> This can lead to job failures if subsequent actions or transformations are 
> attempted on the unpersisted RDD.
> *Proposed Change:* 
> To mitigate this issue, a warning message will be logged whenever 
> {{.unpersist()}} is called on a locally checkpointed RDD. This approach 
> maintains the current functionality while alerting users to the potential 
> consequences of their actions. This change is intended to be non-disruptive 
> and is a step towards better user awareness and debugging.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-48347) [M0] Support for WHILE statement

2024-07-23 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48347?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-48347:
---
Labels: pull-request-available  (was: )

> [M0] Support for WHILE statement
> 
>
> Key: SPARK-48347
> URL: https://issues.apache.org/jira/browse/SPARK-48347
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: David Milicevic
>Priority: Major
>  Labels: pull-request-available
>
> Add support for WHILE statements to SQL scripting parser & interpreter.
>  
> For more details, design doc can be found in parent Jira item.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-48925) Introduce interface to ensure extra strategy do planning of scan plan (with additional filters and projections)

2024-07-23 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48925?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-48925:
---
Labels: pull-request-available  (was: )

> Introduce interface to ensure extra strategy do planning of scan plan (with 
> additional filters and projections)
> ---
>
> Key: SPARK-48925
> URL: https://issues.apache.org/jira/browse/SPARK-48925
> Project: Spark
>  Issue Type: Task
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Uros Stankovic
>Priority: Minor
>  Labels: pull-request-available
>
> If we have some plan that contains scan with filter (or project) as parent, 
> it can happen we want to do planning of scan using extra strategies instead 
> of DataSourceV2Strategy.
> One of use-cases:
> Snowflake and BigQuery connectors have their own strategies and we want to 
> prohibit DataSourceV2Strategy to do planning of scan node.
> Despite the fact extra strategies have priority, it can happen that strategy 
> fails to plan 
> Filter->Relation, but it can plan only Relation without its parent. In that 
> case, DataSourceV2Strategy will jump in and it will be able to plan 
> Filter->Relation in just one pass.
> So we want to have ability to prevent that for certain Scan classes



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-48980) Avoid per-row param read in `LSH/DCT/NGram/PolynomialExpansion`

2024-07-23 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48980?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-48980:
---
Labels: pull-request-available  (was: )

> Avoid per-row param read in `LSH/DCT/NGram/PolynomialExpansion`
> ---
>
> Key: SPARK-48980
> URL: https://issues.apache.org/jira/browse/SPARK-48980
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 4.0.0
>Reporter: Ruifeng Zheng
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-48980) Avoid per-row param read in `LSH/DCT/NGram/PolynomialExpansion`

2024-07-23 Thread Ruifeng Zheng (Jira)

Ruifeng Zheng created SPARK-48980:
-

 Summary: Avoid per-row param read in 
`LSH/DCT/NGram/PolynomialExpansion`
 Key: SPARK-48980
 URL: https://issues.apache.org/jira/browse/SPARK-48980
 Project: Spark
  Issue Type: Improvement
  Components: ML
Affects Versions: 4.0.0
Reporter: Ruifeng Zheng






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-48979) CONV function behaves inconsistently

2024-07-23 Thread Dylan He (Jira)

Dylan He created SPARK-48979:


 Summary: CONV function behaves inconsistently
 Key: SPARK-48979
 URL: https://issues.apache.org/jira/browse/SPARK-48979
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.5.1
Reporter: Dylan He


I'm currently working on CONV function, and I found something confused about 
the implementation in Spark.
All codes below is from NumberConverter.scala.
h3.  negative and signed situation
{code:sql}
spark-sql (default)> select conv('FFFE', 16, -16);
-2
spark-sql (default)> select conv('-FFFE', 16, -16);
-2
{code}
Ideally, these two queries should yield different results, but they both return 
-2.
{code:java}
if (toBase < 0 && v < 0) {
  v = -v
  negative = true
}
{code}
According to code above, when toBase < 0 and v < 0, negative sign is set to 
true regardless of the original value. This will lead to incorrect result as 
the examples above, because negative sign is ignored in the second case. A 
potential adjustment is negative = !negative, which would correctly interpret 
the double negation and yield 2.
h3. ansi mode
{code:java}
if (negative && toBase > 0) {
  if (v < 0) {
v = -1
  } else {
v = -v
  }
}
{code}
Here, -1 is used to indicate an overflow condition but does not throw an 
exception when ANSI mode is enabled, unlike the overflow handling in the encode 
method.
h3. overflow check
{code:java}
val bound = java.lang.Long.divideUnsigned(-1 - radix, radix)
if (v >= bound) {...}
{code}
The inclusion of the equality in the overflow check seems unnecessary.

 

I am still learning function in Spark. Please feel free to point out any 
mistakes I might have. And some of these questions are also mentioned in 
[SPARK-44943|https://issues.apache.org/jira/browse/SPARK-44943].





--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-48970) Avoid using SparkSession.getActiveSession in spark ML reader/writer

2024-07-23 Thread Weichen Xu (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weichen Xu resolved SPARK-48970.

Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 47453
[https://github.com/apache/spark/pull/47453]

> Avoid using SparkSession.getActiveSession in spark ML reader/writer
> ---
>
> Key: SPARK-48970
> URL: https://issues.apache.org/jira/browse/SPARK-48970
> Project: Spark
>  Issue Type: Bug
>  Components: ML, PySpark
>Affects Versions: 4.0.0
>Reporter: Weichen Xu
>Assignee: Weichen Xu
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> `SparkSession.getActiveSession` is thread-local session, but spark ML reader 
> / writer might be executed in different threads which causes 
> `SparkSession.getActiveSession` returning None.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-48978) Optimize collation support for ASCII strings (all collations)

2024-07-23 Thread Jira

Uroš Bojanić created SPARK-48978:


 Summary: Optimize collation support for ASCII strings (all 
collations)
 Key: SPARK-48978
 URL: https://issues.apache.org/jira/browse/SPARK-48978
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 4.0.0
Reporter: Uroš Bojanić






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-48977) Optimize collation support for string search (UTF8_LCASE collation)

2024-07-23 Thread Jira

Uroš Bojanić created SPARK-48977:


 Summary: Optimize collation support for string search (UTF8_LCASE 
collation)
 Key: SPARK-48977
 URL: https://issues.apache.org/jira/browse/SPARK-48977
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 4.0.0
Reporter: Uroš Bojanić






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-47988) When the collationId is invalid, throw `COLLATION_INVALID_ID`

2024-07-23 Thread Jira



 [ 
https://issues.apache.org/jira/browse/SPARK-47988?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uroš Bojanić resolved SPARK-47988.
--
Resolution: Won't Fix

> When the collationId is invalid, throw `COLLATION_INVALID_ID`
> -
>
> Key: SPARK-47988
> URL: https://issues.apache.org/jira/browse/SPARK-47988
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: BingKun Pan
>Priority: Minor
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-47988) When the collationId is invalid, throw `COLLATION_INVALID_ID`

2024-07-23 Thread Jira



[ 
https://issues.apache.org/jira/browse/SPARK-47988?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17868008#comment-17868008
 ] 

Uroš Bojanić commented on SPARK-47988:
--

Closing this ticket for now, given that it's no longer relevant after the 
recent CollationFactory rewrite.

> When the collationId is invalid, throw `COLLATION_INVALID_ID`
> -
>
> Key: SPARK-47988
> URL: https://issues.apache.org/jira/browse/SPARK-47988
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: BingKun Pan
>Priority: Minor
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-48338) Sql Scripting support for Spark SQL

2024-07-23 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48338?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot reassigned SPARK-48338:
--

Assignee: (was: Apache Spark)

> Sql Scripting support for Spark SQL
> ---
>
> Key: SPARK-48338
> URL: https://issues.apache.org/jira/browse/SPARK-48338
> Project: Spark
>  Issue Type: Epic
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Aleksandar Tomic
>Priority: Major
>  Labels: pull-request-available
> Attachments: Sql Scripting - OSS.odt, [Design Doc] Sql Scripting - 
> OSS.pdf
>
>
> Design doc for this feature is in attachment.
> High level example of Sql Script:
> ```
> BEGIN
>   DECLARE c INT = 10;
>   WHILE c > 0 DO
> INSERT INTO tscript VALUES (c);
> SET c = c - 1;
>   END WHILE;
> END
> ```
> High level motivation behind this feature:
> SQL Scripting gives customers the ability to develop complex ETL and analysis 
> entirely in SQL. Until now, customers have had to write verbose SQL 
> statements or combine SQL + Python to efficiently write business logic. 
> Coming from another system, customers have to choose whether or not they want 
> to migrate to pyspark. Some customers end up not using Spark because of this 
> gap. SQL Scripting is a key milestone towards enabling SQL practitioners to 
> write sophisticated queries, without the need to use pyspark. Further, SQL 
> Scripting is a necessary step towards support for SQL Stored Procedures, and 
> along with SQL Variables (released) and Temp Tables (in progress), will allow 
> for more seamless data warehouse migrations.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-48338) Sql Scripting support for Spark SQL

2024-07-23 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48338?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot reassigned SPARK-48338:
--

Assignee: Apache Spark

> Sql Scripting support for Spark SQL
> ---
>
> Key: SPARK-48338
> URL: https://issues.apache.org/jira/browse/SPARK-48338
> Project: Spark
>  Issue Type: Epic
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Aleksandar Tomic
>Assignee: Apache Spark
>Priority: Major
>  Labels: pull-request-available
> Attachments: Sql Scripting - OSS.odt, [Design Doc] Sql Scripting - 
> OSS.pdf
>
>
> Design doc for this feature is in attachment.
> High level example of Sql Script:
> ```
> BEGIN
>   DECLARE c INT = 10;
>   WHILE c > 0 DO
> INSERT INTO tscript VALUES (c);
> SET c = c - 1;
>   END WHILE;
> END
> ```
> High level motivation behind this feature:
> SQL Scripting gives customers the ability to develop complex ETL and analysis 
> entirely in SQL. Until now, customers have had to write verbose SQL 
> statements or combine SQL + Python to efficiently write business logic. 
> Coming from another system, customers have to choose whether or not they want 
> to migrate to pyspark. Some customers end up not using Spark because of this 
> gap. SQL Scripting is a key milestone towards enabling SQL practitioners to 
> write sophisticated queries, without the need to use pyspark. Further, SQL 
> Scripting is a necessary step towards support for SQL Stored Procedures, and 
> along with SQL Variables (released) and Temp Tables (in progress), will allow 
> for more seamless data warehouse migrations.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-48976) Improve the docs related to `variable`

2024-07-23 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48976?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot reassigned SPARK-48976:
--

Assignee: (was: Apache Spark)

> Improve the docs related to `variable`
> --
>
> Key: SPARK-48976
> URL: https://issues.apache.org/jira/browse/SPARK-48976
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, SQL
>Affects Versions: 4.0.0
>Reporter: BingKun Pan
>Priority: Minor
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-48976) Improve the docs related to `variable`

2024-07-23 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48976?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot reassigned SPARK-48976:
--

Assignee: Apache Spark

> Improve the docs related to `variable`
> --
>
> Key: SPARK-48976
> URL: https://issues.apache.org/jira/browse/SPARK-48976
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, SQL
>Affects Versions: 4.0.0
>Reporter: BingKun Pan
>Assignee: Apache Spark
>Priority: Minor
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-48910) Slow linear searches in PreprocessTableCreation

2024-07-23 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48910?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot reassigned SPARK-48910:
--

Assignee: (was: Apache Spark)

> Slow linear searches in PreprocessTableCreation
> ---
>
> Key: SPARK-48910
> URL: https://issues.apache.org/jira/browse/SPARK-48910
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Vladimir Golubev
>Priority: Major
>  Labels: pull-request-available
>
> PreprocessTableCreation does Seq.contains over partition columns, which 
> becomes very slow in case of 1000s of partitions



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-48976) Improve the docs related to `variable`

2024-07-23 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48976?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot reassigned SPARK-48976:
--

Assignee: (was: Apache Spark)

> Improve the docs related to `variable`
> --
>
> Key: SPARK-48976
> URL: https://issues.apache.org/jira/browse/SPARK-48976
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, SQL
>Affects Versions: 4.0.0
>Reporter: BingKun Pan
>Priority: Minor
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-48910) Slow linear searches in PreprocessTableCreation

2024-07-23 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48910?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot reassigned SPARK-48910:
--

Assignee: Apache Spark

> Slow linear searches in PreprocessTableCreation
> ---
>
> Key: SPARK-48910
> URL: https://issues.apache.org/jira/browse/SPARK-48910
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Vladimir Golubev
>Assignee: Apache Spark
>Priority: Major
>  Labels: pull-request-available
>
> PreprocessTableCreation does Seq.contains over partition columns, which 
> becomes very slow in case of 1000s of partitions



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-48976) Improve the docs related to `variable`

2024-07-23 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48976?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot reassigned SPARK-48976:
--

Assignee: (was: Apache Spark)

> Improve the docs related to `variable`
> --
>
> Key: SPARK-48976
> URL: https://issues.apache.org/jira/browse/SPARK-48976
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, SQL
>Affects Versions: 4.0.0
>Reporter: BingKun Pan
>Priority: Minor
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-48761) Add clusterBy DataFrameWriter API for Scala

2024-07-23 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48761?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot reassigned SPARK-48761:
--

Assignee: (was: Apache Spark)

> Add clusterBy DataFrameWriter API for Scala
> ---
>
> Key: SPARK-48761
> URL: https://issues.apache.org/jira/browse/SPARK-48761
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Jiaheng Tang
>Priority: Major
>  Labels: pull-request-available
>
> Add a new `clusterBy` DataFrameWriter API for Scala. This allows users to 
> interact with clustered tables using DataFrameWriter API.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-48761) Add clusterBy DataFrameWriter API for Scala

2024-07-23 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48761?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot reassigned SPARK-48761:
--

Assignee: Apache Spark

> Add clusterBy DataFrameWriter API for Scala
> ---
>
> Key: SPARK-48761
> URL: https://issues.apache.org/jira/browse/SPARK-48761
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Jiaheng Tang
>Assignee: Apache Spark
>Priority: Major
>  Labels: pull-request-available
>
> Add a new `clusterBy` DataFrameWriter API for Scala. This allows users to 
> interact with clustered tables using DataFrameWriter API.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-48975) Remove unnecessary `ScalaReflectionLock` definition from `protobuf`

2024-07-23 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48975?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot reassigned SPARK-48975:
--

Assignee: (was: Apache Spark)

> Remove unnecessary `ScalaReflectionLock` definition from `protobuf`
> ---
>
> Key: SPARK-48975
> URL: https://issues.apache.org/jira/browse/SPARK-48975
> Project: Spark
>  Issue Type: Improvement
>  Components: Protobuf
>Affects Versions: 4.0.0
>Reporter: Yang Jie
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-48975) Remove unnecessary `ScalaReflectionLock` definition from `protobuf`

2024-07-23 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48975?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot reassigned SPARK-48975:
--

Assignee: Apache Spark

> Remove unnecessary `ScalaReflectionLock` definition from `protobuf`
> ---
>
> Key: SPARK-48975
> URL: https://issues.apache.org/jira/browse/SPARK-48975
> Project: Spark
>  Issue Type: Improvement
>  Components: Protobuf
>Affects Versions: 4.0.0
>Reporter: Yang Jie
>Assignee: Apache Spark
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-48976) Improve the docs related to `variable`

2024-07-23 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48976?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-48976:
---
Labels: pull-request-available  (was: )

> Improve the docs related to `variable`
> --
>
> Key: SPARK-48976
> URL: https://issues.apache.org/jira/browse/SPARK-48976
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, SQL
>Affects Versions: 4.0.0
>Reporter: BingKun Pan
>Priority: Minor
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-48975) Remove unnecessary `ScalaReflectionLock` definition from `protobuf`

2024-07-23 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48975?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-48975:
---
Labels: pull-request-available  (was: )

> Remove unnecessary `ScalaReflectionLock` definition from `protobuf`
> ---
>
> Key: SPARK-48975
> URL: https://issues.apache.org/jira/browse/SPARK-48975
> Project: Spark
>  Issue Type: Improvement
>  Components: Protobuf
>Affects Versions: 4.0.0
>Reporter: Yang Jie
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-48975) Remove unnecessary `ScalaReflectionLock` definition from `protobuf`

2024-07-23 Thread Yang Jie (Jira)

Yang Jie created SPARK-48975:


 Summary: Remove unnecessary `ScalaReflectionLock` definition from 
`protobuf`
 Key: SPARK-48975
 URL: https://issues.apache.org/jira/browse/SPARK-48975
 Project: Spark
  Issue Type: Improvement
  Components: Protobuf
Affects Versions: 4.0.0
Reporter: Yang Jie






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-48957) Return classified store load exception type on load failure

2024-07-23 Thread Jungtaek Lim (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48957?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jungtaek Lim resolved SPARK-48957.
--
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 47431
[https://github.com/apache/spark/pull/47431]

> Return classified store load exception type on load failure
> ---
>
> Key: SPARK-48957
> URL: https://issues.apache.org/jira/browse/SPARK-48957
> Project: Spark
>  Issue Type: Task
>  Components: Structured Streaming
>Affects Versions: 4.0.0
>Reporter: Anish Shrigondekar
>Assignee: Anish Shrigondekar
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> Return classified store load exception type on load failure



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-48973) Unexpected behavior using spark mask function handle string contains invalid UTF-8 or wide character

2024-07-23 Thread Wei Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei Guo updated SPARK-48973:

Description: 
In the spark the mask function when apply with a string contains invalid 
character or wide character would cause unexpected behavior.

Example to use `*` mask a string contains wide-character {{}}
{code:sql}
select mask("", "Y", "y", "n", "*");
{code}
could cause result is ** instead of *. Looks spark mask treat {{}} as 2 
characters.

Example to use wide-character {{}} do mask would cause wrong garbled code 
problem
{code:sql}
select mask("ABC", "");
{code}
result is `???`.

Example to mask a string contains a invalid UTF-8 character
{code:java}
select mask("\xED");
{code}
result is `xXX` instead of `\xED`, looks spark treat it as four character `\`, 
`x`, `E`, `D`.

Looks spark mask can only handle BMP character (that is 16 bits) and can't 
guarantee result for invalid UTC-8 character and wide-character when doing mask.

My question here is *does that the limitation / issue of spark mask function or 
spark mask by design only handle for BMP character ?*

If it is a limitation of mask function, could spark address this part in mask 
function document or comments ?

 

  was:
In the spark the mask function when apply with a string contains invalid 
character or wide character would cause unexpected behavior.

Example to use `*` mask a stirng contains wide-character {{}}
{code:sql}
select mask("", "Y", "y", "n", "*");
{code}
could cause result is ** instead of *. Looks spark mask treat {{}} as 2 
characters.

Example to use wide-character {{}} do mask would cause wrong garbled code 
problem
{code:sql}
select mask("ABC", "");
{code}
result is `???`.

Example to mask a string contains a invalid UTF-8 character
{code:java}
select mask("\xED");
{code}
result is `xXX` instead of `\xED`, looks spark treat it as four character `\`, 
`x`, `E`, `D`.

Looks spark mask can only handle BMP character (that is 16 bits) and can't 
guarantee result for invalid UTC-8 character and wide-character when doing mask.

My question here is *does that the limitation / issue of spark mask function or 
spark mask by design only handle for BMP character ?*

If it is a limitation of mask function, could spark address this part in mask 
function document or comments ?

 


> Unexpected behavior using spark mask function handle string contains invalid 
> UTF-8 or wide character
> ----
>
> Key: SPARK-48973
> URL: https://issues.apache.org/jira/browse/SPARK-48973
> Project: Spark
>  Issue Type: Question
>  Components: SQL
>Affects Versions: 3.3.1, 3.2.4, 3.3.2, 3.4.1, 3.5.0, 4.0.0, 3.5.1, 3.3.4
> Environment: Ubuntu 22.04
>Reporter: Yangyang Gao
>Priority: Major
>
> In the spark the mask function when apply with a string contains invalid 
> character or wide character would cause unexpected behavior.
> Example to use `*` mask a string contains wide-character {{}}
> {code:sql}
> select mask("", "Y", "y", "n", "*");
> {code}
> could cause result is ** instead of *. Looks spark mask treat {{}} as 2 
> characters.
> Example to use wide-character {{}} do mask would cause wrong garbled code 
> problem
> {code:sql}
> select mask("ABC", "");
> {code}
> result is `???`.
> Example to mask a string contains a invalid UTF-8 character
> {code:java}
> select mask("\xED");
> {code}
> result is `xXX` instead of `\xED`, looks spark treat it as four character 
> `\`, `x`, `E`, `D`.
> Looks spark mask can only handle BMP character (that is 16 bits) and can't 
> guarantee result for invalid UTC-8 character and wide-character when doing 
> mask.
> My question here is *does that the limitation / issue of spark mask function 
> or spark mask by design only handle for BMP character ?*
> If it is a limitation of mask function, could spark address this part in mask 
> function document or comments ?
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-48973) Unexpected behavior using spark mask function handle string contains invalid UTF-8 or wide character

2024-07-23 Thread Wei Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei Guo updated SPARK-48973:

Description: 
In the spark the mask function when apply with a string contains invalid 
character or wide character would cause unexpected behavior.

Example to use `*` mask a stirng contains wide-character {{}}
{code:sql}
select mask("", "Y", "y", "n", "*");
{code}
could cause result is ** instead of *. Looks spark mask treat {{}} as 2 
characters.

Example to use wide-character {{}} do mask would cause wrong garbled code 
problem
{code:sql}
select mask("ABC", "");
{code}
result is `???`.

Example to mask a string contains a invalid UTF-8 character
{code:java}
select mask("\xED");
{code}
result is `xXX` instead of `\xED`, looks spark treat it as four character `\`, 
`x`, `E`, `D`.

Looks spark mask can only handle BMP character (that is 16 bits) and can't 
guarantee result for invalid UTC-8 character and wide-character when doing mask.

My question here is *does that the limitation / issue of spark mask function or 
spark mask by design only handle for BMP character ?*

If it is a limitation of mask function, could spark address this part in mask 
function document or comments ?

 

  was:
In the spark the mask function when apply with a stirng contains invalid 
character or wide character would cause unexpected behavior.


Example to use `*` mask a stirng contains wide-character {{}}


{code:sql}
select mask("", "Y", "y", "n", "*");
{code}


could cause result is ** instead of *. Looks spark mask treat {{}} as 2 
characters.

Example to use wide-character {{}} do mask would cause wrong garbled code 
problem


{code:sql}
select mask("ABC", "");
{code}

result is `???`.

Example to mask a string contains a invalid UTF-8 character

{code:java}
select mask("\xED");
{code}

result is `xXX` instead of `\xED`, looks spark treat it as four character `\`, 
`x`, `E`, `D`.

Looks spark mask can only handle BMP character (that is 16 bits) and can't 
guarantee result for invalid UTC-8 character and wide-character when doing mask.


My question here is *does that the limitation / issue of spark mask function or 
spark mask by design only handle for BMP character ?*

If it is a limitation of mask function, could spark address this part in mask 
function document or comments ?

 


> Unexpected behavior using spark mask function handle string contains invalid 
> UTF-8 or wide character
> ----
>
> Key: SPARK-48973
> URL: https://issues.apache.org/jira/browse/SPARK-48973
> Project: Spark
>  Issue Type: Question
>  Components: SQL
>Affects Versions: 3.3.1, 3.2.4, 3.3.2, 3.4.1, 3.5.0, 4.0.0, 3.5.1, 3.3.4
> Environment: Ubuntu 22.04
>Reporter: Yangyang Gao
>Priority: Major
>
> In the spark the mask function when apply with a string contains invalid 
> character or wide character would cause unexpected behavior.
> Example to use `*` mask a stirng contains wide-character {{}}
> {code:sql}
> select mask("", "Y", "y", "n", "*");
> {code}
> could cause result is ** instead of *. Looks spark mask treat {{}} as 2 
> characters.
> Example to use wide-character {{}} do mask would cause wrong garbled code 
> problem
> {code:sql}
> select mask("ABC", "");
> {code}
> result is `???`.
> Example to mask a string contains a invalid UTF-8 character
> {code:java}
> select mask("\xED");
> {code}
> result is `xXX` instead of `\xED`, looks spark treat it as four character 
> `\`, `x`, `E`, `D`.
> Looks spark mask can only handle BMP character (that is 16 bits) and can't 
> guarantee result for invalid UTC-8 character and wide-character when doing 
> mask.
> My question here is *does that the limitation / issue of spark mask function 
> or spark mask by design only handle for BMP character ?*
> If it is a limitation of mask function, could spark address this part in mask 
> function document or comments ?
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-48964) Fix the discrepancy between implementation, comment and documentation of option recursive.fields.max.depth in ProtoBuf connector

2024-07-23 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48964?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-48964:
---
Labels: pull-request-available  (was: )

> Fix the discrepancy between implementation, comment and documentation of 
> option recursive.fields.max.depth in ProtoBuf connector
> 
>
> Key: SPARK-48964
> URL: https://issues.apache.org/jira/browse/SPARK-48964
> Project: Spark
>  Issue Type: Documentation
>  Components: Connect
>Affects Versions: 3.5.0, 4.0.0, 3.5.1, 3.5.2, 3.5.3
>Reporter: Yuchen Liu
>Priority: Major
>  Labels: pull-request-available
>
> After the three PRs ([https://github.com/apache/spark/pull/38922,] 
> [https://github.com/apache/spark/pull/40011,] 
> [https://github.com/apache/spark/pull/40141]) working on the same option, 
> there are some legacy comments and documentation that has not been updated to 
> the latest implementation. This task should consolidate them. Below is the 
> correct description of the behavior.
> The `recursive.fields.max.depth` parameter can be specified in the 
> from_protobuf options to control the maximum allowed recursion depth for a 
> field. Setting `recursive.fields.max.depth` to 1 drops all-recursive fields, 
> setting it to 2 allows it to be recursed once, and setting it to 3 allows it 
> to be recursed twice. Attempting to set the `recursive.fields.max.depth` to a 
> value greater than 10 is not allowed. If the `recursive.fields.max.depth` is 
> specified to a value smaller than 1, recursive fields are not permitted. The 
> default value of the option is -1. if a protobuf record has more depth for 
> recursive fields than the allowed value, it will be truncated and some fields 
> may be discarded. This check is based on the fully qualified field type. SQL 
> Schema for the protobuf message
> {code:java}
> message Person { string name = 1; Person bff = 2 }{code}
> will vary based on the value of `recursive.fields.max.depth`.
> {code:java}
> 1: struct
> 2: struct>
> 3: struct>> ...
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-48972) Unify the literal string handling

2024-07-22 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48972?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-48972.
--
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 47454
[https://github.com/apache/spark/pull/47454]

> Unify the literal string handling
> -
>
> Key: SPARK-48972
> URL: https://issues.apache.org/jira/browse/SPARK-48972
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 4.0.0
>Reporter: Ruifeng Zheng
>Assignee: Ruifeng Zheng
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-48972) Unify the literal string handling

2024-07-22 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48972?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-48972:


Assignee: Ruifeng Zheng

> Unify the literal string handling
> -
>
> Key: SPARK-48972
> URL: https://issues.apache.org/jira/browse/SPARK-48972
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 4.0.0
>Reporter: Ruifeng Zheng
>Assignee: Ruifeng Zheng
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-48973) Unexpected behavior using spark mask function handle string contains invalid UTF-8 or wide character

2024-07-22 Thread Yangyang Gao (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yangyang Gao updated SPARK-48973:
-
Description: 
In the spark the mask function when apply with a stirng contains invalid 
character or wide character would cause unexpected behavior.


Example to use `*` mask a stirng contains wide-character {{}}


{code:sql}
select mask("", "Y", "y", "n", "*");
{code}


could cause result is {code:java}**{code}
 instead of `*`. Looks spark mask treat {{}} as 2 characters.

Example to use wide-character {{}} do mask would cause wrong garbled code 
problem


{code:sql}
select mask("ABC", "");
{code}

result is `???`.

Example to mask a string contains a invalid UTF-8 character

{code:java}
select mask("\xED");
{code}

result is `xXX` instead of `\xED`, looks spark treat it as four character `\`, 
`x`, `E`, `D`.

Looks spark mask can only handle BMP character (that is 16 bits) and can't 
guarantee result for invalid UTC-8 character and wide-character when doing mask.


My question here is *does that the limitation / issue of spark mask function or 
spark mask by design only handle for BMP character ?*

If it is a limitation of mask function, could spark address this part in mask 
function document or comments ?

 

  was:
In the spark the mask function when apply with a stirng contains invalid 
character or wide character would cause unexpected behavior.


Example to use `*` mask a stirng contains wide-character {{}}


{code:sql}
select mask("", "Y", "y", "n", "*");
{code}


could cause result is `**` instead of `*`. Looks spark mask treat {{}} as 2 
characters.

Example to use wide-character {{}} do mask would cause wrong garbled code 
problem


{code:sql}
select mask("ABC", "");
{code}

result is `???`.

Example to mask a string contains a invalid UTF-8 character

{code:java}
select mask("\xED");
{code}

result is `xXX` instead of `\xED`, looks spark treat it as four character `\`, 
`x`, `E`, `D`.

Looks spark mask can only handle BMP character (that is 16 bits) and can't 
guarantee result for invalid UTC-8 character and wide-character when doing mask.


My question here is *does that the limitation / issue of spark mask function or 
spark mask by design only handle for BMP character ?*

If it is a limitation of mask function, could spark address this part in mask 
function document or comments ?

 


> Unexpected behavior using spark mask function handle string contains invalid 
> UTF-8 or wide character
> ----
>
> Key: SPARK-48973
> URL: https://issues.apache.org/jira/browse/SPARK-48973
> Project: Spark
>  Issue Type: Question
>  Components: SQL
>Affects Versions: 3.3.1, 3.2.4, 3.3.2, 3.4.1, 3.5.0, 4.0.0, 3.5.1, 3.3.4
> Environment: Ubuntu 22.04
>Reporter: Yangyang Gao
>Priority: Major
>
> In the spark the mask function when apply with a stirng contains invalid 
> character or wide character would cause unexpected behavior.
> Example to use `*` mask a stirng contains wide-character {{}}
> {code:sql}
> select mask("", "Y", "y", "n", "*");
> {code}
> could cause result is {code:java}**{code}
>  instead of `*`. Looks spark mask treat {{}} as 2 characters.
> Example to use wide-character {{}} do mask would cause wrong garbled code 
> problem
> {code:sql}
> select mask("ABC", "");
> {code}
> result is `???`.
> Example to mask a string contains a invalid UTF-8 character
> {code:java}
> select mask("\xED");
> {code}
> result is `xXX` instead of `\xED`, looks spark treat it as four character 
> `\`, `x`, `E`, `D`.
> Looks spark mask can only handle BMP character (that is 16 bits) and can't 
> guarantee result for invalid UTC-8 character and wide-character when doing 
> mask.
> My question here is *does that the limitation / issue of spark mask function 
> or spark mask by design only handle for BMP character ?*
> If it is a limitation of mask function, could spark address this part in mask 
> function document or comments ?
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-48973) Unexpected behavior using spark mask function handle string contains invalid UTF-8 or wide character

2024-07-22 Thread Yangyang Gao (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yangyang Gao updated SPARK-48973:
-
Description: 
In the spark the mask function when apply with a stirng contains invalid 
character or wide character would cause unexpected behavior.


Example to use `*` mask a stirng contains wide-character {{}}


{code:sql}
select mask("", "Y", "y", "n", "*");
{code}


could cause result is {{**}} instead of {{*}}. Looks spark mask treat {{}} as 
2 characters.

Example to use wide-character {{}} do mask would cause wrong garbled code 
problem


{code:sql}
select mask("ABC", "");
{code}

result is `???`.

Example to mask a string contains a invalid UTF-8 character

{code:java}
select mask("\xED");
{code}

result is `xXX` instead of `\xED`, looks spark treat it as four character `\`, 
`x`, `E`, `D`.

Looks spark mask can only handle BMP character (that is 16 bits) and can't 
guarantee result for invalid UTC-8 character and wide-character when doing mask.


My question here is *does that the limitation / issue of spark mask function or 
spark mask by design only handle for BMP character ?*

If it is a limitation of mask function, could spark address this part in mask 
function document or comments ?

 

  was:
In the spark the mask function when apply with a stirng contains invalid 
character or wide character would cause unexpected behavior.


Example to use `*` mask a stirng contains wide-character {{}}


{code:sql}
select mask("", "Y", "y", "n", "*");
{code}


could cause result `**` instead of `*`. Looks spark mask treat {{}} as 2 
characters.

Example to use wide-character {{}} do mask would cause wrong garbled code 
problem


{code:sql}
select mask("ABC", "");
{code}

result is `???`.

Example to mask a string contains a invalid UTF-8 character

{code:java}
select mask("\xED");
{code}

result is `xXX` instead of `\xED`, looks spark treat it as four character `\`, 
`x`, `E`, `D`.

Looks spark mask can only handle BMP character (that is 16 bits) and can't 
guarantee result for invalid UTC-8 character and wide-character when doing mask.


My question here is *does that the limitation / issue of spark mask function or 
spark mask by design only handle for BMP character ?*

If it is a limitation of mask function, could spark address this part in mask 
function document or comments ?

 


> Unexpected behavior using spark mask function handle string contains invalid 
> UTF-8 or wide character
> ----
>
> Key: SPARK-48973
> URL: https://issues.apache.org/jira/browse/SPARK-48973
> Project: Spark
>  Issue Type: Question
>  Components: SQL
>Affects Versions: 3.3.1, 3.2.4, 3.3.2, 3.4.1, 3.5.0, 4.0.0, 3.5.1, 3.3.4
> Environment: Ubuntu 22.04
>Reporter: Yangyang Gao
>Priority: Major
>
> In the spark the mask function when apply with a stirng contains invalid 
> character or wide character would cause unexpected behavior.
> Example to use `*` mask a stirng contains wide-character {{}}
> {code:sql}
> select mask("", "Y", "y", "n", "*");
> {code}
> could cause result is {{**}} instead of {{*}}. Looks spark mask treat {{}} 
> as 2 characters.
> Example to use wide-character {{}} do mask would cause wrong garbled code 
> problem
> {code:sql}
> select mask("ABC", "");
> {code}
> result is `???`.
> Example to mask a string contains a invalid UTF-8 character
> {code:java}
> select mask("\xED");
> {code}
> result is `xXX` instead of `\xED`, looks spark treat it as four character 
> `\`, `x`, `E`, `D`.
> Looks spark mask can only handle BMP character (that is 16 bits) and can't 
> guarantee result for invalid UTC-8 character and wide-character when doing 
> mask.
> My question here is *does that the limitation / issue of spark mask function 
> or spark mask by design only handle for BMP character ?*
> If it is a limitation of mask function, could spark address this part in mask 
> function document or comments ?
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-48973) Unexpected behavior using spark mask function handle string contains invalid UTF-8 or wide character

2024-07-22 Thread Yangyang Gao (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yangyang Gao updated SPARK-48973:
-
Description: 
In the spark the mask function when apply with a stirng contains invalid 
character or wide character would cause unexpected behavior.


Example to use `*` mask a stirng contains wide-character {{}}


{code:sql}
select mask("", "Y", "y", "n", "*");
{code}


could cause result is `**` instead of `*`. Looks spark mask treat {{}} as 2 
characters.

Example to use wide-character {{}} do mask would cause wrong garbled code 
problem


{code:sql}
select mask("ABC", "");
{code}

result is `???`.

Example to mask a string contains a invalid UTF-8 character

{code:java}
select mask("\xED");
{code}

result is `xXX` instead of `\xED`, looks spark treat it as four character `\`, 
`x`, `E`, `D`.

Looks spark mask can only handle BMP character (that is 16 bits) and can't 
guarantee result for invalid UTC-8 character and wide-character when doing mask.


My question here is *does that the limitation / issue of spark mask function or 
spark mask by design only handle for BMP character ?*

If it is a limitation of mask function, could spark address this part in mask 
function document or comments ?

 

  was:
In the spark the mask function when apply with a stirng contains invalid 
character or wide character would cause unexpected behavior.


Example to use `*` mask a stirng contains wide-character {{}}


{code:sql}
select mask("", "Y", "y", "n", "*");
{code}


could cause result is {{**}} instead of {{*}}. Looks spark mask treat {{}} as 
2 characters.

Example to use wide-character {{}} do mask would cause wrong garbled code 
problem


{code:sql}
select mask("ABC", "");
{code}

result is `???`.

Example to mask a string contains a invalid UTF-8 character

{code:java}
select mask("\xED");
{code}

result is `xXX` instead of `\xED`, looks spark treat it as four character `\`, 
`x`, `E`, `D`.

Looks spark mask can only handle BMP character (that is 16 bits) and can't 
guarantee result for invalid UTC-8 character and wide-character when doing mask.


My question here is *does that the limitation / issue of spark mask function or 
spark mask by design only handle for BMP character ?*

If it is a limitation of mask function, could spark address this part in mask 
function document or comments ?

 


> Unexpected behavior using spark mask function handle string contains invalid 
> UTF-8 or wide character
> ----
>
> Key: SPARK-48973
> URL: https://issues.apache.org/jira/browse/SPARK-48973
> Project: Spark
>  Issue Type: Question
>  Components: SQL
>Affects Versions: 3.3.1, 3.2.4, 3.3.2, 3.4.1, 3.5.0, 4.0.0, 3.5.1, 3.3.4
> Environment: Ubuntu 22.04
>Reporter: Yangyang Gao
>Priority: Major
>
> In the spark the mask function when apply with a stirng contains invalid 
> character or wide character would cause unexpected behavior.
> Example to use `*` mask a stirng contains wide-character {{}}
> {code:sql}
> select mask("", "Y", "y", "n", "*");
> {code}
> could cause result is `**` instead of `*`. Looks spark mask treat {{}} as 2 
> characters.
> Example to use wide-character {{}} do mask would cause wrong garbled code 
> problem
> {code:sql}
> select mask("ABC", "");
> {code}
> result is `???`.
> Example to mask a string contains a invalid UTF-8 character
> {code:java}
> select mask("\xED");
> {code}
> result is `xXX` instead of `\xED`, looks spark treat it as four character 
> `\`, `x`, `E`, `D`.
> Looks spark mask can only handle BMP character (that is 16 bits) and can't 
> guarantee result for invalid UTC-8 character and wide-character when doing 
> mask.
> My question here is *does that the limitation / issue of spark mask function 
> or spark mask by design only handle for BMP character ?*
> If it is a limitation of mask function, could spark address this part in mask 
> function document or comments ?
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-48973) Unexpected behavior using spark mask function handle string contains invalid UTF-8 or wide character

2024-07-22 Thread Yangyang Gao (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yangyang Gao updated SPARK-48973:
-
Description: 
In the spark the mask function when apply with a stirng contains invalid 
character or wide character would cause unexpected behavior.


Example to use `*` mask a stirng contains wide-character {{}}


{code:sql}
select mask("", "Y", "y", "n", "*");
{code}


could cause result is ** instead of *. Looks spark mask treat {{}} as 2 
characters.

Example to use wide-character {{}} do mask would cause wrong garbled code 
problem


{code:sql}
select mask("ABC", "");
{code}

result is `???`.

Example to mask a string contains a invalid UTF-8 character

{code:java}
select mask("\xED");
{code}

result is `xXX` instead of `\xED`, looks spark treat it as four character `\`, 
`x`, `E`, `D`.

Looks spark mask can only handle BMP character (that is 16 bits) and can't 
guarantee result for invalid UTC-8 character and wide-character when doing mask.


My question here is *does that the limitation / issue of spark mask function or 
spark mask by design only handle for BMP character ?*

If it is a limitation of mask function, could spark address this part in mask 
function document or comments ?

 

  was:
In the spark the mask function when apply with a stirng contains invalid 
character or wide character would cause unexpected behavior.


Example to use `*` mask a stirng contains wide-character {{}}


{code:sql}
select mask("", "Y", "y", "n", "*");
{code}


could cause result is {code:java}**{code}
 instead of `*`. Looks spark mask treat {{}} as 2 characters.

Example to use wide-character {{}} do mask would cause wrong garbled code 
problem


{code:sql}
select mask("ABC", "");
{code}

result is `???`.

Example to mask a string contains a invalid UTF-8 character

{code:java}
select mask("\xED");
{code}

result is `xXX` instead of `\xED`, looks spark treat it as four character `\`, 
`x`, `E`, `D`.

Looks spark mask can only handle BMP character (that is 16 bits) and can't 
guarantee result for invalid UTC-8 character and wide-character when doing mask.


My question here is *does that the limitation / issue of spark mask function or 
spark mask by design only handle for BMP character ?*

If it is a limitation of mask function, could spark address this part in mask 
function document or comments ?

 


> Unexpected behavior using spark mask function handle string contains invalid 
> UTF-8 or wide character
> ----
>
> Key: SPARK-48973
> URL: https://issues.apache.org/jira/browse/SPARK-48973
> Project: Spark
>  Issue Type: Question
>  Components: SQL
>Affects Versions: 3.3.1, 3.2.4, 3.3.2, 3.4.1, 3.5.0, 4.0.0, 3.5.1, 3.3.4
> Environment: Ubuntu 22.04
>Reporter: Yangyang Gao
>Priority: Major
>
> In the spark the mask function when apply with a stirng contains invalid 
> character or wide character would cause unexpected behavior.
> Example to use `*` mask a stirng contains wide-character {{}}
> {code:sql}
> select mask("", "Y", "y", "n", "*");
> {code}
> could cause result is ** instead of *. Looks spark mask treat {{}} as 2 
> characters.
> Example to use wide-character {{}} do mask would cause wrong garbled code 
> problem
> {code:sql}
> select mask("ABC", "");
> {code}
> result is `???`.
> Example to mask a string contains a invalid UTF-8 character
> {code:java}
> select mask("\xED");
> {code}
> result is `xXX` instead of `\xED`, looks spark treat it as four character 
> `\`, `x`, `E`, `D`.
> Looks spark mask can only handle BMP character (that is 16 bits) and can't 
> guarantee result for invalid UTC-8 character and wide-character when doing 
> mask.
> My question here is *does that the limitation / issue of spark mask function 
> or spark mask by design only handle for BMP character ?*
> If it is a limitation of mask function, could spark address this part in mask 
> function document or comments ?
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-48973) Unexpected behavior using spark mask function handle string contains invalid UTF-8 or wide character

2024-07-22 Thread Yangyang Gao (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yangyang Gao updated SPARK-48973:
-
Description: 
In the spark the mask function when apply with a stirng contains invalid 
character or wide character would cause unexpected behavior.


Example to use `*` mask a stirng contains wide-character {{}}


{code:sql}
select mask("", "Y", "y", "n", "*");
{code}


could cause result `**` instead of `*`. Looks spark mask treat {{}} as 2 
characters.

Example to use wide-character {{}} do mask would cause wrong garbled code 
problem


{code:sql}
select mask("ABC", "");
{code}

result is `???`.

Example to mask a string contains a invalid UTF-8 character

{code:java}
select mask("\xED");
{code}

result is `xXX` instead of `\xED`, looks spark treat it as four character `\`, 
`x`, `E`, `D`.

Looks spark mask can only handle BMP character (that is 16 bits) and can't 
guarantee result for invalid UTC-8 character and wide-character when doing mask.


My question here is *does that the limitation / issue of spark mask function or 
spark mask by design only handle for BMP character ?*

If it is a limitation of mask function, could spark address this part in mask 
function document or comments ?

 

  was:
In the spark the mask function when apply with a stirng contains invalid 
character or wide character would cause unexpected behavior.


Example to use `*` mask a stirng contains wide-character {{}}


{code:sql}
select mask("", "Y", "y", "n", "*");
{code}


could cause result `{{{}*{*}`{*}{}}} \{*}instead of `{{{}{}}}{*}{{{}`{}}}. 
Looks spark mask treat {{}} as 2 characters.

Example to use wide-character {{}} do mask would cause wrong garbled code 
problem


{code:sql}
select mask("ABC", "");
{code}

result is {{{}`???{}}}`.

Example to mask a string contains a invalid UTF-8 character

{code:java}
select mask("\xED");
{code}

result is `xXX` instead of `\xED`, looks spark treat it as four character `\`, 
`x`, `E`, `D`.

Looks spark mask can only handle BMP character (that is 16 bits) and can't 
guarantee result for invalid UTC-8 character and wide-character when doing mask.


My question here is *does that the limitation / issue of spark mask function or 
spark mask by design only handle for BMP character ?*

If it is a limitation of mask function, could spark address this part in mask 
function document or comments ?

 


> Unexpected behavior using spark mask function handle string contains invalid 
> UTF-8 or wide character
> ----
>
> Key: SPARK-48973
> URL: https://issues.apache.org/jira/browse/SPARK-48973
> Project: Spark
>  Issue Type: Question
>  Components: SQL
>Affects Versions: 3.3.1, 3.2.4, 3.3.2, 3.4.1, 3.5.0, 4.0.0, 3.5.1, 3.3.4
> Environment: Ubuntu 22.04
>Reporter: Yangyang Gao
>Priority: Major
>
> In the spark the mask function when apply with a stirng contains invalid 
> character or wide character would cause unexpected behavior.
> Example to use `*` mask a stirng contains wide-character {{}}
> {code:sql}
> select mask("", "Y", "y", "n", "*");
> {code}
> could cause result `**` instead of `*`. Looks spark mask treat {{}} as 2 
> characters.
> Example to use wide-character {{}} do mask would cause wrong garbled code 
> problem
> {code:sql}
> select mask("ABC", "");
> {code}
> result is `???`.
> Example to mask a string contains a invalid UTF-8 character
> {code:java}
> select mask("\xED");
> {code}
> result is `xXX` instead of `\xED`, looks spark treat it as four character 
> `\`, `x`, `E`, `D`.
> Looks spark mask can only handle BMP character (that is 16 bits) and can't 
> guarantee result for invalid UTC-8 character and wide-character when doing 
> mask.
> My question here is *does that the limitation / issue of spark mask function 
> or spark mask by design only handle for BMP character ?*
> If it is a limitation of mask function, could spark address this part in mask 
> function document or comments ?
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-48973) Unexpected behavior using spark mask function handle string contains invalid UTF-8 or wide character

2024-07-22 Thread Yangyang Gao (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yangyang Gao updated SPARK-48973:
-
Description: 
In the spark the mask function when apply with a stirng contains invalid 
character or wide character would cause unexpected behavior.


Example to use `*` mask a stirng contains wide-character {{}}


{code:sql}
select mask("", "Y", "y", "n", "*");
{code}


could cause result `{{{}*{*}`{*}{}}} \{*}instead of `{{{}{}}}{*}{{{}`{}}}. 
Looks spark mask treat {{}} as 2 characters.

Example to use wide-character {{}} do mask would cause wrong garbled code 
problem


{code:sql}
select mask("ABC", "");
{code}

result is {{{}`???{}}}`.

Example to mask a string contains a invalid UTF-8 character

{code:java}
select mask("\xED");
{code}

result is `xXX` instead of `\xED`, looks spark treat it as four character `\`, 
`x`, `E`, `D`.

Looks spark mask can only handle BMP character (that is 16 bits) and can't 
guarantee result for invalid UTC-8 character and wide-character when doing mask.


My question here is *does that the limitation / issue of spark mask function or 
spark mask by design only handle for BMP character ?*

If it is a limitation of mask function, could spark address this part in mask 
function document or comments ?

 

  was:
In the spark the mask function when apply with a stirng contains invalid 
character or wide character would cause unexpected behavior.


Example to use `*` mask a stirng contains wide-character {{}}

```sql
select mask("", "Y", "y", "n", "*");
```

could cause result `{{{}*{*}`{*}{}}} \{*}instead of `{{{}{}}}{*}{{{}`{}}}. 
Looks spark mask treat {{}} as 2 characters.

Example to use wide-character {{}} do mask would cause wrong garbled code 
problem

```sql
select mask("ABC", "");
```
result is {{{}`???{}}}`.

Example to mask a string contains a invalid UTF-8 character

```sql
select mask("\xED");
```

result is `xXX` instead of `\xED`, looks spark treat it as four character `\`, 
`x`, `E`, `D`.

Looks spark mask can only handle BMP character (that is 16 bits) and can't 
guarantee result for invalid UTC-8 character and wide-character when doing mask.


My question here is *does that the limitation / issue of spark mask function or 
spark mask by design only handle for BMP character ?*

If it is a limitation of mask function, could spark address this part in mask 
function document or comments ?

 


> Unexpected behavior using spark mask function handle string contains invalid 
> UTF-8 or wide character
> ----
>
> Key: SPARK-48973
> URL: https://issues.apache.org/jira/browse/SPARK-48973
> Project: Spark
>  Issue Type: Question
>  Components: SQL
>Affects Versions: 3.3.1, 3.2.4, 3.3.2, 3.4.1, 3.5.0, 4.0.0, 3.5.1, 3.3.4
> Environment: Ubuntu 22.04
>Reporter: Yangyang Gao
>Priority: Major
>
> In the spark the mask function when apply with a stirng contains invalid 
> character or wide character would cause unexpected behavior.
> Example to use `*` mask a stirng contains wide-character {{}}
> {code:sql}
> select mask("", "Y", "y", "n", "*");
> {code}
> could cause result `{{{}*{*}`{*}{}}} \{*}instead of `{{{}{}}}{*}{{{}`{}}}. 
> Looks spark mask treat {{}} as 2 characters.
> Example to use wide-character {{}} do mask would cause wrong garbled code 
> problem
> {code:sql}
> select mask("ABC", "");
> {code}
> result is {{{}`???{}}}`.
> Example to mask a string contains a invalid UTF-8 character
> {code:java}
> select mask("\xED");
> {code}
> result is `xXX` instead of `\xED`, looks spark treat it as four character 
> `\`, `x`, `E`, `D`.
> Looks spark mask can only handle BMP character (that is 16 bits) and can't 
> guarantee result for invalid UTC-8 character and wide-character when doing 
> mask.
> My question here is *does that the limitation / issue of spark mask function 
> or spark mask by design only handle for BMP character ?*
> If it is a limitation of mask function, could spark address this part in mask 
> function document or comments ?
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-48973) Unexpected behavior using spark mask function handle string contains invalid UTF-8 or wide character

2024-07-22 Thread Yangyang Gao (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yangyang Gao updated SPARK-48973:
-
Description: 
In the spark the mask function when apply with a stirng contains invalid 
character or wide character would cause unexpected behavior.


Example to use `*` mask a stirng contains wide-character {{}}

```sql
select mask("", "Y", "y", "n", "*");
```

could cause result `{{{}*{*}`{*}{}}} \{*}instead of `{{{}{}}}{*}{{{}`{}}}. 
Looks spark mask treat {{}} as 2 characters.

Example to use wide-character {{}} do mask would cause wrong garbled code 
problem

```sql
select mask("ABC", "");
```
result is {{{}`???{}}}`.

Example to mask a string contains a invalid UTF-8 character

```sql
select mask("\xED");
```

result is `xXX` instead of `\xED`, looks spark treat it as four character `\`, 
`x`, `E`, `D`.

Looks spark mask can only handle BMP character (that is 16 bits) and can't 
guarantee result for invalid UTC-8 character and wide-character when doing mask.


My question here is *does that the limitation / issue of spark mask function or 
spark mask by design only handle for BMP character ?*

If it is a limitation of mask function, could spark address this part in mask 
function document or comments ?

 

  was:
In the spark the mask function when apply with a stirng contains invalid 
character or wide character would cause unexpected behavior.


Example to use `{{{}*`{}}} mask a stirng contains wide-character {{}}

```sql
select mask("", "Y", "y", "n", "*");
```

could cause result `{{{}*{*}`{*}{}}} \{*}instead of `{{{}{}}}{*}{{{}`{}}}. 
Looks spark mask treat {{}} as 2 characters.

Example to use wide-character {{}} do mask would cause wrong garbled code 
problem

```sql
select mask("ABC", "");
```
result is {{{}`???{}}}`.

Example to mask a string contains a invalid UTF-8 character

```sql
select mask("\xED");
```

result is `xXX` instead of `\xED`, looks spark treat it as four character `\`, 
`x`, `E`, `D`.

Looks spark mask can only handle BMP character (that is 16 bits) and can't 
guarantee result for invalid UTC-8 character and wide-character when doing mask.


My question here is *does that the limitation / issue of spark mask function or 
spark mask by design only handle for BMP character ?*

If it is a limitation of mask function, could spark address this part in mask 
function document or comments ?

 


> Unexpected behavior using spark mask function handle string contains invalid 
> UTF-8 or wide character
> ----
>
> Key: SPARK-48973
> URL: https://issues.apache.org/jira/browse/SPARK-48973
> Project: Spark
>  Issue Type: Question
>  Components: SQL
>Affects Versions: 3.3.1, 3.2.4, 3.3.2, 3.4.1, 3.5.0, 4.0.0, 3.5.1, 3.3.4
> Environment: Ubuntu 22.04
>Reporter: Yangyang Gao
>Priority: Major
>
> In the spark the mask function when apply with a stirng contains invalid 
> character or wide character would cause unexpected behavior.
> Example to use `*` mask a stirng contains wide-character {{}}
> ```sql
> select mask("", "Y", "y", "n", "*");
> ```
> could cause result `{{{}*{*}`{*}{}}} \{*}instead of `{{{}{}}}{*}{{{}`{}}}. 
> Looks spark mask treat {{}} as 2 characters.
> Example to use wide-character {{}} do mask would cause wrong garbled code 
> problem
> ```sql
> select mask("ABC", "");
> ```
> result is {{{}`???{}}}`.
> Example to mask a string contains a invalid UTF-8 character
> ```sql
> select mask("\xED");
> ```
> result is `xXX` instead of `\xED`, looks spark treat it as four character 
> `\`, `x`, `E`, `D`.
> Looks spark mask can only handle BMP character (that is 16 bits) and can't 
> guarantee result for invalid UTC-8 character and wide-character when doing 
> mask.
> My question here is *does that the limitation / issue of spark mask function 
> or spark mask by design only handle for BMP character ?*
> If it is a limitation of mask function, could spark address this part in mask 
> function document or comments ?
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-48973) Unexpected behavior using spark mask function handle string contains invalid UTF-8 or wide character

2024-07-22 Thread Yangyang Gao (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yangyang Gao updated SPARK-48973:
-
Description: 
In the spark the mask function when apply with a stirng contains invalid 
character or wide character would cause unexpected behavior.


Example to use `{{{}*`{}}} mask a stirng contains wide-character {{}}

```sql
select mask("", "Y", "y", "n", "*");
```

could cause result `{{{}*{*}`{*}{}}} \{*}instead of `{{{}{}}}{*}{{{}`{}}}. 
Looks spark mask treat {{}} as 2 characters.

Example to use wide-character {{}} do mask would cause wrong garbled code 
problem

```sql
select mask("ABC", "");
```
result is {{{}`???{}}}`.

Example to mask a string contains a invalid UTF-8 character

```sql
select mask("\xED");
```

result is `xXX` instead of `\xED`, looks spark treat it as four character `\`, 
`x`, `E`, `D`.

Looks spark mask can only handle BMP character (that is 16 bits) and can't 
guarantee result for invalid UTC-8 character and wide-character when doing mask.


My question here is *does that the limitation / issue of spark mask function or 
spark mask by design only handle for BMP character ?*

If it is a limitation of mask function, could spark address this part in mask 
function document or comments ?

 

  was:
In the spark the mask function when apply with a stirng contains invalid 
character or wide character would cause unexpected behavior.


Example to use `{{{}*`{}}} mask a stirng contains wide-character {{}}

```
select mask("", "Y", "y", "n", "*");
```

could cause result `{{{}*{*}`{*}{}}} \{*}instead of `{{{}{}}}{*}{{{}`{}}}. 
Looks spark mask treat {{}} as 2 characters.

Example to use wide-character {{}} do mask would cause wrong garbled code 
problem

```
select mask("ABC", "");
```
result is {{{}`???{}}}`.

Example to mask a string contains a invalid UTF-8 character

```
 select mask("\xED");
```

result is `xXX` instead of `\xED`, looks spark treat it as four character `\`, 
`x`, `E`, `D`.

Looks spark mask can only handle BMP character (that is 16 bits) and can't 
guarantee result for invalid UTC-8 character and wide-character when doing mask.


My question here is *does that the limitation / issue of spark mask function or 
spark mask by design only handle for BMP character ?*

If it is a limitation of mask function, could spark address this part in mask 
function document or comments ?

 


> Unexpected behavior using spark mask function handle string contains invalid 
> UTF-8 or wide character
> ----
>
> Key: SPARK-48973
> URL: https://issues.apache.org/jira/browse/SPARK-48973
> Project: Spark
>  Issue Type: Question
>  Components: SQL
>Affects Versions: 3.3.1, 3.2.4, 3.3.2, 3.4.1, 3.5.0, 4.0.0, 3.5.1, 3.3.4
> Environment: Ubuntu 22.04
>Reporter: Yangyang Gao
>Priority: Major
>
> In the spark the mask function when apply with a stirng contains invalid 
> character or wide character would cause unexpected behavior.
> Example to use `{{{}*`{}}} mask a stirng contains wide-character {{}}
> ```sql
> select mask("", "Y", "y", "n", "*");
> ```
> could cause result `{{{}*{*}`{*}{}}} \{*}instead of `{{{}{}}}{*}{{{}`{}}}. 
> Looks spark mask treat {{}} as 2 characters.
> Example to use wide-character {{}} do mask would cause wrong garbled code 
> problem
> ```sql
> select mask("ABC", "");
> ```
> result is {{{}`???{}}}`.
> Example to mask a string contains a invalid UTF-8 character
> ```sql
> select mask("\xED");
> ```
> result is `xXX` instead of `\xED`, looks spark treat it as four character 
> `\`, `x`, `E`, `D`.
> Looks spark mask can only handle BMP character (that is 16 bits) and can't 
> guarantee result for invalid UTC-8 character and wide-character when doing 
> mask.
> My question here is *does that the limitation / issue of spark mask function 
> or spark mask by design only handle for BMP character ?*
> If it is a limitation of mask function, could spark address this part in mask 
> function document or comments ?
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

1 2 3 4 5 6 7 8 9 10 >

1 - 100 of 366134 matches

Mail list logo