[jira] [Commented] (SPARK-9213) Improve regular expression performance (via joni)

2022-10-08 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-9213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17614621#comment-17614621
 ] 

Apache Spark commented on SPARK-9213:
-

User 'lyy-pineapple' has created a pull request for this issue:
https://github.com/apache/spark/pull/38171

> Improve regular expression performance (via joni)
> -
>
> Key: SPARK-9213
> URL: https://issues.apache.org/jira/browse/SPARK-9213
> Project: Spark
>  Issue Type: Umbrella
>  Components: SQL
>Reporter: Reynold Xin
>Priority: Major
>  Labels: bulk-closed
>
> I'm creating an umbrella ticket to improve regular expression performance for 
> string expressions. Right now our use of regular expressions is inefficient 
> for two reasons:
> 1. Java regex in general is slow.
> 2. We have to convert everything from UTF8 encoded bytes into Java String, 
> and then run regex on it, and then convert it back.
> There are libraries in Java that provide regex support directly on UTF8 
> encoded bytes. One prominent example is joni, used in JRuby.
> Note: all regex functions are in 
> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringOperations.scala



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40663) Migrate execution errors onto error classes

2022-10-08 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40663?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17614596#comment-17614596
 ] 

Apache Spark commented on SPARK-40663:
--

User 'itholic' has created a pull request for this issue:
https://github.com/apache/spark/pull/38170

> Migrate execution errors onto error classes
> ---
>
> Key: SPARK-40663
> URL: https://issues.apache.org/jira/browse/SPARK-40663
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Haejoon Lee
>Assignee: Haejoon Lee
>Priority: Major
> Fix For: 3.4.0
>
>
> Use temporary error classes in the execution exceptions.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40663) Migrate execution errors onto error classes

2022-10-08 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40663?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17614594#comment-17614594
 ] 

Apache Spark commented on SPARK-40663:
--

User 'itholic' has created a pull request for this issue:
https://github.com/apache/spark/pull/38169

> Migrate execution errors onto error classes
> ---
>
> Key: SPARK-40663
> URL: https://issues.apache.org/jira/browse/SPARK-40663
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Haejoon Lee
>Assignee: Haejoon Lee
>Priority: Major
> Fix For: 3.4.0
>
>
> Use temporary error classes in the execution exceptions.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40663) Migrate execution errors onto error classes

2022-10-08 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40663?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17614593#comment-17614593
 ] 

Apache Spark commented on SPARK-40663:
--

User 'itholic' has created a pull request for this issue:
https://github.com/apache/spark/pull/38169

> Migrate execution errors onto error classes
> ---
>
> Key: SPARK-40663
> URL: https://issues.apache.org/jira/browse/SPARK-40663
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Haejoon Lee
>Assignee: Haejoon Lee
>Priority: Major
> Fix For: 3.4.0
>
>
> Use temporary error classes in the execution exceptions.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40659) Schema evolution for protobuf (and Avro too?)

2022-10-08 Thread Sandish Kumar HN (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40659?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17614568#comment-17614568
 ] 

Sandish Kumar HN commented on SPARK-40659:
--

[~rangadi] it is possible to add these options settings, just an idea. 
 # BACKWORD: Consumers using the latest schema can process data written by 
producers using the latest or oldest schema. like Adding fields or deleting 
optional fields.
 # FORWARD: Consumers using the latest or oldest schema can process data 
written by producers using the latest schema. like Adding fields or deleting 
optional fields
 # FULL: Both BACKWORD and FORWARD between oldest and latest schema. 
 # The default option is FULL. 

> Schema evolution for protobuf (and Avro too?)
> -
>
> Key: SPARK-40659
> URL: https://issues.apache.org/jira/browse/SPARK-40659
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 3.3.0
>Reporter: Raghu Angadi
>Priority: Major
>
> Protobuf & Avro should support schema evolution in streaming. We need to 
> throw a specific error message when we detect newer version of the the schema 
> in schema registry.
> A couple of options for detecting version change at runtime:
>  * How do we detect newer version from schema registry? It is contacted only 
> during planning currently.
>  * We could detect version id in coming messages.
>  ** What if the id in the incoming message is newer than what our 
> schema-registry reports after the restart?
>  *** This indicates delayed syncs between customers schema-registry servers 
> (should be rare). We can keep erroring out until it is fixed.
>  *** Make sure we log the schema id used during planning.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40658) Protobuf v2 & v3 support

2022-10-08 Thread Sandish Kumar HN (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40658?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17614565#comment-17614565
 ] 

Sandish Kumar HN commented on SPARK-40658:
--

[~mposdev21] these are the changes I see between proto2 vs proto3
 # The latest Proto3 also supports optional fields, the difference is optional 
fields which have has_foo() methods, and "singular" fields, which do not. I 
don't see any different treatment needed to handle this. 
 # In contrast to proto3, Proto2 allows custom default values and required 
fields.
 # Enums: Proto3's default value is the enum 0 index value. Proto2 uses the 
first syntactic entry in the enum declaration as the default value if it is not 
specified otherwise.
 # Proto2 does not validate that inbound and outbound bytes are encoded in 
UTF-8. During parsing, all string fields in Proto3 are appropriately UTF-8 
encoded.
 # Proto2 and proto3 are wire compatible, they will have the same binary 
representation.

should we have an optional option setting something like 
PROTO_VERSION_SUPPORT=V3 or V2 or ANY? the default can be ANY. 

> Protobuf v2 & v3 support
> 
>
> Key: SPARK-40658
> URL: https://issues.apache.org/jira/browse/SPARK-40658
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 3.3.0
>Reporter: Raghu Angadi
>Priority: Major
>
> We want to ensure Protobuf functions support both Protobuf version 2 and 
> version 3 schemas (e.g. descriptor file or compiled classes with v2 and v3).
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-40686) Support data masking built-in functions

2022-10-08 Thread Vinod KC (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40686?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod KC updated SPARK-40686:
-
Summary: Support data masking built-in functions  (was: Support data 
Masking built-in Functions)

> Support data masking built-in functions
> ---
>
> Key: SPARK-40686
> URL: https://issues.apache.org/jira/browse/SPARK-40686
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Vinod KC
>Priority: Minor
>
> Support built-in data masking functions 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40706) IllegalStateException when querying array values inside a nested struct

2022-10-08 Thread Bruce Robbins (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40706?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17614550#comment-17614550
 ] 

Bruce Robbins commented on SPARK-40706:
---

Same as SPARK-39854?

At the very least, the suggest workaround also worked for your case:
{noformat}
spark-sql> set spark.sql.optimizer.nestedSchemaPruning.enabled=false;
spark.sql.optimizer.nestedSchemaPruning.enabled false
Time taken: 0.224 seconds, Fetched 1 row(s)
spark-sql> set spark.sql.optimizer.expression.nestedPruning.enabled=false;
spark.sql.optimizer.expression.nestedPruning.enabledfalse
Time taken: 0.016 seconds, Fetched 1 row(s)
spark-sql> SELECT 
response.message as message,
response.timestamp as timestamp,
score as risk_score,
model.value as model_type
FROM tbl
  LATERAL VIEW OUTER explode(response.data.items.attempt)   
  AS Attempt
  LATERAL VIEW OUTER explode(response.data.items.attempt.risk)  
  AS RiskModels
  LATERAL VIEW OUTER explode(RiskModels)
  AS RiskModel
  LATERAL VIEW OUTER explode(RiskModel.indicator)   
  AS Model
  LATERAL VIEW OUTER explode(RiskModel.Score)   
  AS Score;

 >  >  >  >  >  >  >
  >  >  > 
m1  09/07/2022  1   abc
m1  09/07/2022  2   abc
m1  09/07/2022  3   abc
m1  09/07/2022  1   def
m1  09/07/2022  2   def
m1  09/07/2022  3   def
Time taken: 1.213 seconds, Fetched 6 row(s)
spark-sql>  > 
{noformat}
 

> IllegalStateException when querying array values inside a nested struct
> ---
>
> Key: SPARK-40706
> URL: https://issues.apache.org/jira/browse/SPARK-40706
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Rohan Barman
>Priority: Major
>
> We are in the process of migrating our PySpark applications from Spark 
> version 3.1.2 to Spark version 3.2.0. 
> This bug is present in version 3.2.0. We do not see this issue in version 
> 3.1.2.
>  
> *Minimal example to reproduce bug*
> Below is a minimal example that generates hardcoded data and queries. The 
> data has several nested structs and arrays.
> Our real use case reads data from avro files and has more complex queries, 
> but this is sufficient to reproduce the error.
>  
> {code:java}
> # Generate data
> data = [
>   ('1',{
>   'timestamp': '09/07/2022',
>   'message': 'm1',
>   'data':{
> 'items': {
>   'id':1,
>   'attempt':[
> {'risk':[
>   {'score':[1,2,3]},
>   {'indicator':[
> {'code':'c1','value':'abc'},
> {'code':'c2','value':'def'}
>   ]}
> ]}
>   ]
> }
>   }
>   })
> ]
> from pyspark.sql.types import *
> schema = StructType([
> StructField('id', StringType(), True),
> StructField('response', StructType([
>   StructField('timestamp', StringType(), True),
>   StructField('message',StringType(), True),
>   StructField('data', StructType([
> StructField('items', StructType([
>   StructField('id', StringType(), True),
>   StructField("attempt", ArrayType(StructType([
> StructField("risk", ArrayType(StructType([
>   StructField('score', ArrayType(StringType()), True),
>   StructField('indicator', ArrayType(StructType([
> StructField('code', StringType(), True),
> StructField('value', StringType(), True),
>   ])))
>  ])))
>])))
> ]))
>   ]))
> ])),
>  ])
> df = spark.createDataFrame(data=data, schema=schema)
> df.printSchema()
> df.createOrReplaceTempView("tbl")
> # Execute query
> query = """
> SELECT 
> response.message as message,
> response.timestamp as timestamp,
> score as risk_score,
> model.value as model_type
> FROM tbl
>   LATERAL VIEW OUTER explode(response.data.items.attempt) 
> AS Attempt
>   LATERAL VIEW OUTER explode(response.data.items.attempt.risk)
> AS RiskModels
>   LATERAL VIEW OUTER explode(RiskModels)  
> AS RiskModel
>   LATERAL VIEW OUTER explode(RiskModel.indicator) 
> AS Model
>   LATERAL VIEW OUTER explode(RiskModel.Score) 
> AS Score
> """
> result = spark.sql(query)
> print(result.coun

[jira] [Commented] (SPARK-40713) Improve SET operation support in the proto and the server

2022-10-08 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40713?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17614540#comment-17614540
 ] 

Apache Spark commented on SPARK-40713:
--

User 'amaliujia' has created a pull request for this issue:
https://github.com/apache/spark/pull/38166

> Improve SET operation support in the proto and the server
> -
>
> Key: SPARK-40713
> URL: https://issues.apache.org/jira/browse/SPARK-40713
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Rui Wang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40713) Improve SET operation support in the proto and the server

2022-10-08 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40713?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40713:


Assignee: (was: Apache Spark)

> Improve SET operation support in the proto and the server
> -
>
> Key: SPARK-40713
> URL: https://issues.apache.org/jira/browse/SPARK-40713
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Rui Wang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40713) Improve SET operation support in the proto and the server

2022-10-08 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40713?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17614539#comment-17614539
 ] 

Apache Spark commented on SPARK-40713:
--

User 'amaliujia' has created a pull request for this issue:
https://github.com/apache/spark/pull/38166

> Improve SET operation support in the proto and the server
> -
>
> Key: SPARK-40713
> URL: https://issues.apache.org/jira/browse/SPARK-40713
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Rui Wang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40713) Improve SET operation support in the proto and the server

2022-10-08 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40713?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40713:


Assignee: Apache Spark

> Improve SET operation support in the proto and the server
> -
>
> Key: SPARK-40713
> URL: https://issues.apache.org/jira/browse/SPARK-40713
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Rui Wang
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-40713) Improve SET operation support in the proto and the server

2022-10-08 Thread Rui Wang (Jira)
Rui Wang created SPARK-40713:


 Summary: Improve SET operation support in the proto and the server
 Key: SPARK-40713
 URL: https://issues.apache.org/jira/browse/SPARK-40713
 Project: Spark
  Issue Type: Sub-task
  Components: Connect
Affects Versions: 3.4.0
Reporter: Rui Wang






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40691) Support data masking built-in function 'mask_show_last_n'

2022-10-08 Thread Vinod KC (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40691?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17614508#comment-17614508
 ] 

Vinod KC commented on SPARK-40691:
--

I'm working on this sub task

> Support data masking built-in function 'mask_show_last_n'
> -
>
> Key: SPARK-40691
> URL: https://issues.apache.org/jira/browse/SPARK-40691
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Vinod KC
>Priority: Minor
>
> Support data masking built-in function '{*}mask_show_last_n{*}'
> Return a masked version of str, showing the last n characters unmasked. Upper 
> case letters should be converted to "X", lower case letters should be 
> converted to "x" and numbers should be converted to "n". For example, 
> mask_show_last_n("1234-5678-8765-4321", 4) results in ---4321.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40692) Support data masking built-in function 'mask_hash'

2022-10-08 Thread Vinod KC (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17614509#comment-17614509
 ] 

Vinod KC commented on SPARK-40692:
--

I'm working on this sub task

> Support data masking built-in function 'mask_hash'
> --
>
> Key: SPARK-40692
> URL: https://issues.apache.org/jira/browse/SPARK-40692
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Vinod KC
>Priority: Minor
>
> Support data masking built-in function '{*}mask_hash{*}'
> Return a hashed value based on str. The hash should be consistent and should 
> be used to join masked string values together across tables.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40690) Support data masking built-in function 'mask_show_first_n'

2022-10-08 Thread Vinod KC (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40690?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17614507#comment-17614507
 ] 

Vinod KC commented on SPARK-40690:
--

I'm working on this sub task

> Support data masking built-in function 'mask_show_first_n'
> --
>
> Key: SPARK-40690
> URL: https://issues.apache.org/jira/browse/SPARK-40690
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Vinod KC
>Priority: Minor
>
> Support data masking built-in function '{*}mask_show_first_n{*}'
> Return a masked version of str, showing the first n characters unmasked . 
> Upper case letters should be converted to "X", lower case letters should be 
> converted to "x" and numbers should be converted to "n". For example, 
> mask_show_first_n("1234-5678-8765-4321", 4) results in 1234---.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40689) Support data masking built-in function 'mask_last_n'

2022-10-08 Thread Vinod KC (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17614506#comment-17614506
 ] 

Vinod KC commented on SPARK-40689:
--

I'm working on this sub task

> Support data masking built-in function 'mask_last_n'
> 
>
> Key: SPARK-40689
> URL: https://issues.apache.org/jira/browse/SPARK-40689
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Vinod KC
>Priority: Minor
>
> Support data masking built-in function *mask_last_n*
> Return a masked version of str with the last n values masked. Upper case 
> letters should be converted to "X", lower case letters should be converted to 
> "x" and numbers should be converted to "n". For example, 
> mask_last_n("1234-5678-8765-4321", 4) results in 1234-5678-8765-.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40688) Support data masking built-in function 'mask_first_n'

2022-10-08 Thread Vinod KC (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17614505#comment-17614505
 ] 

Vinod KC commented on SPARK-40688:
--

I'm working on this sub task

> Support data masking built-in function  'mask_first_n'
> --
>
> Key: SPARK-40688
> URL: https://issues.apache.org/jira/browse/SPARK-40688
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Vinod KC
>Priority: Minor
>
> Support data masking built-in function  *mask_first_n*
> Return a masked version of str with the first n values masked. Upper case 
> letters should be converted to "X", lower case letters should be converted to 
> "x" and numbers should be converted to "n". For example, 
> mask_first_n("1234-5678-8765-4321", 4) results in -5678-8765-4321.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40712) upgrade sbt-assembly plugin to 1.2.0

2022-10-08 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17614488#comment-17614488
 ] 

Apache Spark commented on SPARK-40712:
--

User 'LuciferYang' has created a pull request for this issue:
https://github.com/apache/spark/pull/38164

> upgrade sbt-assembly plugin to 1.2.0
> 
>
> Key: SPARK-40712
> URL: https://issues.apache.org/jira/browse/SPARK-40712
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.4.0
>Reporter: Yang Jie
>Priority: Major
>
> * [https://github.com/sbt/sbt-assembly/releases/tag/v1.0.0]
>  * https://github.com/sbt/sbt-assembly/releases/tag/v1.1.0
>  * https://github.com/sbt/sbt-assembly/releases/tag/v1.2.0



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40712) upgrade sbt-assembly plugin to 1.2.0

2022-10-08 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40712?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40712:


Assignee: (was: Apache Spark)

> upgrade sbt-assembly plugin to 1.2.0
> 
>
> Key: SPARK-40712
> URL: https://issues.apache.org/jira/browse/SPARK-40712
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.4.0
>Reporter: Yang Jie
>Priority: Major
>
> * [https://github.com/sbt/sbt-assembly/releases/tag/v1.0.0]
>  * https://github.com/sbt/sbt-assembly/releases/tag/v1.1.0
>  * https://github.com/sbt/sbt-assembly/releases/tag/v1.2.0



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40712) upgrade sbt-assembly plugin to 1.2.0

2022-10-08 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40712?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40712:


Assignee: Apache Spark

> upgrade sbt-assembly plugin to 1.2.0
> 
>
> Key: SPARK-40712
> URL: https://issues.apache.org/jira/browse/SPARK-40712
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.4.0
>Reporter: Yang Jie
>Assignee: Apache Spark
>Priority: Major
>
> * [https://github.com/sbt/sbt-assembly/releases/tag/v1.0.0]
>  * https://github.com/sbt/sbt-assembly/releases/tag/v1.1.0
>  * https://github.com/sbt/sbt-assembly/releases/tag/v1.2.0



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40712) upgrade sbt-assembly plugin to 1.2.0

2022-10-08 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17614486#comment-17614486
 ] 

Apache Spark commented on SPARK-40712:
--

User 'LuciferYang' has created a pull request for this issue:
https://github.com/apache/spark/pull/38164

> upgrade sbt-assembly plugin to 1.2.0
> 
>
> Key: SPARK-40712
> URL: https://issues.apache.org/jira/browse/SPARK-40712
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.4.0
>Reporter: Yang Jie
>Priority: Major
>
> * [https://github.com/sbt/sbt-assembly/releases/tag/v1.0.0]
>  * https://github.com/sbt/sbt-assembly/releases/tag/v1.1.0
>  * https://github.com/sbt/sbt-assembly/releases/tag/v1.2.0



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-40712) upgrade sbt-assembly plugin to 1.2.0

2022-10-08 Thread Yang Jie (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40712?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Jie updated SPARK-40712:
-
Summary: upgrade sbt-assembly plugin to 1.2.0  (was: upgra sbt-assembly 
plugin to 1.2.0)

> upgrade sbt-assembly plugin to 1.2.0
> 
>
> Key: SPARK-40712
> URL: https://issues.apache.org/jira/browse/SPARK-40712
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.4.0
>Reporter: Yang Jie
>Priority: Major
>
> * [https://github.com/sbt/sbt-assembly/releases/tag/v1.0.0]
>  * https://github.com/sbt/sbt-assembly/releases/tag/v1.1.0
>  * https://github.com/sbt/sbt-assembly/releases/tag/v1.2.0



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-40712) upgra sbt-assembly plugin to 1.2.0

2022-10-08 Thread Yang Jie (Jira)
Yang Jie created SPARK-40712:


 Summary: upgra sbt-assembly plugin to 1.2.0
 Key: SPARK-40712
 URL: https://issues.apache.org/jira/browse/SPARK-40712
 Project: Spark
  Issue Type: Improvement
  Components: Build
Affects Versions: 3.4.0
Reporter: Yang Jie


* [https://github.com/sbt/sbt-assembly/releases/tag/v1.0.0]
 * https://github.com/sbt/sbt-assembly/releases/tag/v1.1.0
 * https://github.com/sbt/sbt-assembly/releases/tag/v1.2.0



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40711) Add spill size metrics for window

2022-10-08 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40711?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40711:


Assignee: Apache Spark

> Add spill size metrics for window
> -
>
> Key: SPARK-40711
> URL: https://issues.apache.org/jira/browse/SPARK-40711
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: XiDuo You
>Assignee: Apache Spark
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40711) Add spill size metrics for window

2022-10-08 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17614472#comment-17614472
 ] 

Apache Spark commented on SPARK-40711:
--

User 'ulysses-you' has created a pull request for this issue:
https://github.com/apache/spark/pull/38163

> Add spill size metrics for window
> -
>
> Key: SPARK-40711
> URL: https://issues.apache.org/jira/browse/SPARK-40711
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: XiDuo You
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40711) Add spill size metrics for window

2022-10-08 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40711?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40711:


Assignee: (was: Apache Spark)

> Add spill size metrics for window
> -
>
> Key: SPARK-40711
> URL: https://issues.apache.org/jira/browse/SPARK-40711
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: XiDuo You
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40677) Shade more dependency to be able to run separately

2022-10-08 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17614464#comment-17614464
 ] 

Apache Spark commented on SPARK-40677:
--

User 'LuciferYang' has created a pull request for this issue:
https://github.com/apache/spark/pull/38162

> Shade more dependency to be able to run separately
> --
>
> Key: SPARK-40677
> URL: https://issues.apache.org/jira/browse/SPARK-40677
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build, Connect
>Affects Versions: 3.4.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
> Fix For: 3.4.0
>
>
> https://github.com/apache/spark/pull/38109 separated the component but found 
> out that there were several more jars to be shaded. See also 
> https://github.com/apache/spark/pull/38109#issuecomment-1269836435



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40677) Shade more dependency to be able to run separately

2022-10-08 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17614463#comment-17614463
 ] 

Apache Spark commented on SPARK-40677:
--

User 'LuciferYang' has created a pull request for this issue:
https://github.com/apache/spark/pull/38162

> Shade more dependency to be able to run separately
> --
>
> Key: SPARK-40677
> URL: https://issues.apache.org/jira/browse/SPARK-40677
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build, Connect
>Affects Versions: 3.4.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
> Fix For: 3.4.0
>
>
> https://github.com/apache/spark/pull/38109 separated the component but found 
> out that there were several more jars to be shaded. See also 
> https://github.com/apache/spark/pull/38109#issuecomment-1269836435



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-40711) Add spill size metrics for window

2022-10-08 Thread XiDuo You (Jira)
XiDuo You created SPARK-40711:
-

 Summary: Add spill size metrics for window
 Key: SPARK-40711
 URL: https://issues.apache.org/jira/browse/SPARK-40711
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.4.0
Reporter: XiDuo You






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40710) Supplement undocumented parquet configurations in documentation

2022-10-08 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40710?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40710:


Assignee: (was: Apache Spark)

> Supplement undocumented parquet configurations in documentation
> ---
>
> Key: SPARK-40710
> URL: https://issues.apache.org/jira/browse/SPARK-40710
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation
>Affects Versions: 3.3.0
>Reporter: Qian Sun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40710) Supplement undocumented parquet configurations in documentation

2022-10-08 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17614428#comment-17614428
 ] 

Apache Spark commented on SPARK-40710:
--

User 'dcoliversun' has created a pull request for this issue:
https://github.com/apache/spark/pull/38160

> Supplement undocumented parquet configurations in documentation
> ---
>
> Key: SPARK-40710
> URL: https://issues.apache.org/jira/browse/SPARK-40710
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation
>Affects Versions: 3.3.0
>Reporter: Qian Sun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40710) Supplement undocumented parquet configurations in documentation

2022-10-08 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40710?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40710:


Assignee: Apache Spark

> Supplement undocumented parquet configurations in documentation
> ---
>
> Key: SPARK-40710
> URL: https://issues.apache.org/jira/browse/SPARK-40710
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation
>Affects Versions: 3.3.0
>Reporter: Qian Sun
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40594) Eagerly release hashed relation in ShuffledHashJoin

2022-10-08 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17614425#comment-17614425
 ] 

Apache Spark commented on SPARK-40594:
--

User 'ulysses-you' has created a pull request for this issue:
https://github.com/apache/spark/pull/38159

> Eagerly release hashed relation in ShuffledHashJoin
> ---
>
> Key: SPARK-40594
> URL: https://issues.apache.org/jira/browse/SPARK-40594
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: XiDuo You
>Priority: Major
>
> ShuffledHashJoin releases the built hashed relation at the end of task using 
> taskCompletionListener. It is not always good enough for complex sql query.
> If a smj or window on the top of the shj, then the hashed relation in shj 
> would be leak. All rows have been consumed in sort before smj or window then 
> the buffer can not allocate the memory which is hold by hashed relation. Then 
> it causes unnecessary spill.
> It is a common case in multi-join, since AQE supports convert smj to shj at 
> runtime.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40594) Eagerly release hashed relation in ShuffledHashJoin

2022-10-08 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17614424#comment-17614424
 ] 

Apache Spark commented on SPARK-40594:
--

User 'ulysses-you' has created a pull request for this issue:
https://github.com/apache/spark/pull/38159

> Eagerly release hashed relation in ShuffledHashJoin
> ---
>
> Key: SPARK-40594
> URL: https://issues.apache.org/jira/browse/SPARK-40594
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: XiDuo You
>Priority: Major
>
> ShuffledHashJoin releases the built hashed relation at the end of task using 
> taskCompletionListener. It is not always good enough for complex sql query.
> If a smj or window on the top of the shj, then the hashed relation in shj 
> would be leak. All rows have been consumed in sort before smj or window then 
> the buffer can not allocate the memory which is hold by hashed relation. Then 
> it causes unnecessary spill.
> It is a common case in multi-join, since AQE supports convert smj to shj at 
> runtime.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40594) Eagerly release hashed relation in ShuffledHashJoin

2022-10-08 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40594?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40594:


Assignee: Apache Spark

> Eagerly release hashed relation in ShuffledHashJoin
> ---
>
> Key: SPARK-40594
> URL: https://issues.apache.org/jira/browse/SPARK-40594
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: XiDuo You
>Assignee: Apache Spark
>Priority: Major
>
> ShuffledHashJoin releases the built hashed relation at the end of task using 
> taskCompletionListener. It is not always good enough for complex sql query.
> If a smj or window on the top of the shj, then the hashed relation in shj 
> would be leak. All rows have been consumed in sort before smj or window then 
> the buffer can not allocate the memory which is hold by hashed relation. Then 
> it causes unnecessary spill.
> It is a common case in multi-join, since AQE supports convert smj to shj at 
> runtime.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40594) Eagerly release hashed relation in ShuffledHashJoin

2022-10-08 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40594?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40594:


Assignee: (was: Apache Spark)

> Eagerly release hashed relation in ShuffledHashJoin
> ---
>
> Key: SPARK-40594
> URL: https://issues.apache.org/jira/browse/SPARK-40594
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: XiDuo You
>Priority: Major
>
> ShuffledHashJoin releases the built hashed relation at the end of task using 
> taskCompletionListener. It is not always good enough for complex sql query.
> If a smj or window on the top of the shj, then the hashed relation in shj 
> would be leak. All rows have been consumed in sort before smj or window then 
> the buffer can not allocate the memory which is hold by hashed relation. Then 
> it causes unnecessary spill.
> It is a common case in multi-join, since AQE supports convert smj to shj at 
> runtime.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-40710) Supplement undocumented parquet configurations in documentation

2022-10-08 Thread Qian Sun (Jira)
Qian Sun created SPARK-40710:


 Summary: Supplement undocumented parquet configurations in 
documentation
 Key: SPARK-40710
 URL: https://issues.apache.org/jira/browse/SPARK-40710
 Project: Spark
  Issue Type: Sub-task
  Components: Documentation
Affects Versions: 3.3.0
Reporter: Qian Sun






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org