[jira] [Created] (SPARK-43828) Add config to control whether close idle connection

2023-05-26 Thread Zhongwei Zhu (Jira)
Zhongwei Zhu created SPARK-43828:


 Summary: Add config to control whether close idle connection
 Key: SPARK-43828
 URL: https://issues.apache.org/jira/browse/SPARK-43828
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 3.4.0
Reporter: Zhongwei Zhu






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41775) Implement training functions as input

2023-05-26 Thread Snoot.io (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41775?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17726768#comment-17726768
 ] 

Snoot.io commented on SPARK-41775:
--

User 'WeichenXu123' has created a pull request for this issue:
https://github.com/apache/spark/pull/41337

> Implement training functions as input
> -
>
> Key: SPARK-41775
> URL: https://issues.apache.org/jira/browse/SPARK-41775
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, PySpark
>Affects Versions: 3.4.0
>Reporter: Rithwik Ediga Lakhamsani
>Assignee: Rithwik Ediga Lakhamsani
>Priority: Major
> Fix For: 3.4.0
>
>
> Sidenote: make formatting updates described in 
> https://github.com/apache/spark/pull/39188
>  
> Currently, `Distributor().run(...)` takes only files as input. Now we will 
> add in additional functionality to take in functions as well. This will 
> require us to go through the following process on each task in the executor 
> nodes:
> 1. take the input function and args and pickle them
> 2. Create a temp train.py file that looks like
> {code:java}
> import cloudpickle
> import os
> if _name_ == "_main_":
>     train, args = cloudpickle.load(f"{tempdir}/train_input.pkl")
>     output = train(*args)
>     if output and os.environ.get("RANK", "") == "0": # this is for 
> partitionId == 0
>         cloudpickle.dump(f"{tempdir}/train_output.pkl") {code}
> 3. Run that train.py file with `torchrun`
> 4. Check if `train_output.pkl` has been created on process on partitionId == 
> 0, if it has, then deserialize it and return that output through `.collect()`



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-43775) DataSource V2: Allow representing updates as deletes and inserts

2023-05-26 Thread Snoot.io (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-43775?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17726767#comment-17726767
 ] 

Snoot.io commented on SPARK-43775:
--

User 'aokolnychyi' has created a pull request for this issue:
https://github.com/apache/spark/pull/41300

> DataSource V2: Allow representing updates as deletes and inserts
> 
>
> Key: SPARK-43775
> URL: https://issues.apache.org/jira/browse/SPARK-43775
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: Anton Okolnychyi
>Priority: Major
>
> It may be beneficial for data sources to represent updates as deletes and 
> inserts for delta-based implementations. Specifically, it may be helpful to 
> properly distribute and order records on write. Remember that delete records 
> have only row ID and metadata attributes set. Update records have data, row 
> ID, metadata attributes set. Insert records have only data attributes set.
> For instance, a data source may rely on a metadata column _row_id (synthetic 
> internally generated) to identify the row and is partitioned by 
> bucket(product_id). Splitting updates into inserts and deletes would allow 
> data sources to cluster all update and insert records for the same partition 
> into a single task. Otherwise, the clustering key for updates and inserts 
> will be different (updates have _row_id set). This is critical to reduce the 
> number of generated files.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-43818) Spark Glue job introduces duplicates while writing a dataframe as file to S3

2023-05-26 Thread Yuming Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43818?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-43818:

Language:   (was: English)

> Spark Glue job introduces duplicates while writing a dataframe as file to S3
> 
>
> Key: SPARK-43818
> URL: https://issues.apache.org/jira/browse/SPARK-43818
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.1.0
> Environment: Production
>Reporter: Umesh Kant
>Priority: Major
>
> We have AWS Glue (Spark) based ETL framework which processes the data through 
> multiple hops and finally write the dataframe in S3 bucket as parquet files 
> with snappy compression. We have used this framework to process and write 
> data to S3 for 1000+ tables/files and it works fine. But for two of the 
> tables - in memory data frame contains correct records but when data frame 
> gets persisted in S3 as file, it introduces duplicate entries and as the 
> total count remains same duplicate cause missing records as well.
> {*}Data Points{*}:
>  # This happens only for large(wider tables + millions of rows)
>  # When this happens we notice stage failures and retry succeeding but 
> causing duplicates/missing records
> {*}Code Steps{*}:
> |Steps Information|Dataframe|Query / Operation /Action|
> |Query Raw DB & get no of partition ( to  loop one by one)| |select distinct 
> partition_0 FROM  .|
> |Raw DF Query|raw|select SCHDWKID_REF, TASK_REF, LIFECYCLE_REF, TASK_DESC, 
> WHOSE_ENT_NAME, WHOSE_INST_REF, WHOSE_INST_CDE, STENDDAT_STRTDT, 
> STENDDAT_ENDDAT, AGENT_ENT_NAME, AGENT_INST_REF, AGENT_INST_CDE, AGENT_CODE, 
> LOCATION_ENT_NAME, LOCATION_INST_REF, LOCATION_INST_CDE, CASEID_NUMBER, 
> FACE_AMT, TAAR_AMT, AUTH_AMT, TRANSFER_YORN_ENCODE, TRANSFER_YORN_DECODE, 
> TRANSFER_YORN_ELMREF, CASE_YORN_ENCODE, CASE_YORN_DECODE, CASE_YORN_ELMREF, 
> CHANGEID_REF, CNTRCTID_REF, CNTRCTID_NUMBER, KTKDSCID_REF, KWNOFFID_REF, 
> KWNOFFID_CODE, USERID_REF, USERID_CODE, WQUEUEID_REF, WQUEUEID_CODE, 
> STATUS_REF, STATUS_CODE, STATUS_ASAT, LASTUPD_USER, LASTUPD_TERMNO, 
> LASTUPD_PROG, LASTUPD_INFTIM, KWQPRIID_REF, KWQPRIID_CODE, INSURED_NAME, 
> AGENT_NAME, EDM_INGESTED_AT, EDM_INGEST_TIME, PARTITION_0, DELTA_IND, 
> TRANSACT_SEQ from RAW_ORACLE_ORAP12_NYLDPROD60CL.SCHEDULED_WORK where 
> partition_0= '20230428'|
> |Structured  DF Query|structured|SELECT * FROM 
> RL_LAKE_ORACLE_ORAP12_NYLDPROD60CL.SCHEDULED_WORK WHERE part_num > 0 |
> | | | |
> |Merge DF Generated By joining raw & structured on nks|df_merge|df_merge = 
> structured.join(raw,keys,how='fullouter')|
> |action column will be added to
>  Merge Df|df_merge|df_merge = df_merge.withColumn("action", 
> fn.whendf_merge['structured.EDH_RECORD_STATUS_IN'] == 'A') \| 
> (df_merge['structured.EDH_RECORD_STATUS_IN'] == 'D')) & ( 
> df_merge['raw.chksum'].isNull()) & (~ 
> df_merge['structured.CHKSUM'].isNull())) , "NOACTION")
>     
> .when((df_merge['structured.CHKSUM'].isNull()) & (df_merge['raw.delta_ind']!= 
> 'D'), "INSERT")
>     
> .when((df_merge['structured.CHKSUM'] != df_merge['raw.chksum']) & (~ 
> df_merge['structured.CHKSUM'].isNull()) & 
> (df_merge['structured.EDH_RECORD_STATUS_IN'] == 'A') & 
> ((df_merge['raw.delta_ind'] == 'U') \| (df_merge['raw.delta_ind'] == 'I')), 
> "UPDATE")
>     
> .when(((df_merge['raw.delta_ind']== 'D') & 
> (df_merge['structured.EDH_RECORD_STATUS_IN'] == 'A')) , "DELETE")
>     
> .when(((df_merge['raw.delta_ind']== 'D') & 
> (df_merge['structured.EDH_RECORD_STATUS_IN'] == 'D') ) , "DELETECOPY")
>     
> .when(((df_merge['raw.delta_ind']== 'I') & 
> (df_merge['structured.EDH_RECORD_STATUS_IN'] == 'D') & (~ 
> df_merge['raw.chksum'].isNull()) & (~ 
> df_merge['structured.CHKSUM'].isNull())) , "DELETEREINSERT")
>     
> .when(((df_merge['raw.delta_ind']== 'D') & 
> (df_merge['structured.CHKSUM'].isNull())) , "DELETEABSENT")
>     
> .when((df_merge['structured.CHKSUM'] == df_merge['raw.chksum']), "NOCHANGE"))|
> | | | |
> |No Action df will be derived from merge df|df_noaction|df_noaction = 
> df_merge.select(keys + ['structured.' + x.upper() for x in 
> structured_cols_list if x.upper() not in keys]).where((df_merge.action == 
> 'NOACTION') \| (df_merge.action == 'NOCHANGE'))|
> |Delete Copy DF will be derived|df_dcopy|df_dcopy = df_merge.select(keys + 
> ['structured.' + x.upper() for x in structured_cols_list if x.upper() not in 
> keys]).where(df_merge.action == 'DELETECOPY')|
> |Delete Absent df will be derive

[jira] [Commented] (SPARK-43802) unbase64 and unhex codegen are invalid with failOnError

2023-05-26 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-43802?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17726764#comment-17726764
 ] 

Dongjoon Hyun commented on SPARK-43802:
---

This is backported to branch-3.4 via https://github.com/apache/spark/pull/41334

> unbase64 and unhex codegen are invalid with failOnError
> ---
>
> Key: SPARK-43802
> URL: https://issues.apache.org/jira/browse/SPARK-43802
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Adam Binford
>Assignee: Adam Binford
>Priority: Major
> Fix For: 3.4.1, 3.5.0
>
>
> to_binary with hex and base64 generate invalid codegen:
> {{spark.range(5).selectExpr('to_binary(base64(cast(id as binary)), 
> "BASE64")').show()}}
> results in
> {{Caused by: org.codehaus.commons.compiler.CompileException: File 
> 'generated.java', Line 47, Column 1: failed to compile: 
> org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 
> 47, Column 1: Unknown variable or type "BASE64"}}
> because this is the generated code:
> /* 107 */         if 
> (!org.apache.spark.sql.catalyst.expressions.UnBase64.isValidBase64(project_value_1))
>  {
> /* 108 */           throw QueryExecutionErrors.invalidInputInConversionError(
> /* 109 */             ((org.apache.spark.sql.types.BinaryType$) references[1] 
> /* to */),
> /* 110 */             project_value_1,
> /* 111 */             BASE64,
> /* 112 */             "try_to_binary");
> /* 113 */         }



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-43802) unbase64 and unhex codegen are invalid with failOnError

2023-05-26 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43802?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-43802:
--
Issue Type: Bug  (was: New Feature)

> unbase64 and unhex codegen are invalid with failOnError
> ---
>
> Key: SPARK-43802
> URL: https://issues.apache.org/jira/browse/SPARK-43802
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Adam Binford
>Assignee: Adam Binford
>Priority: Major
> Fix For: 3.5.0
>
>
> to_binary with hex and base64 generate invalid codegen:
> {{spark.range(5).selectExpr('to_binary(base64(cast(id as binary)), 
> "BASE64")').show()}}
> results in
> {{Caused by: org.codehaus.commons.compiler.CompileException: File 
> 'generated.java', Line 47, Column 1: failed to compile: 
> org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 
> 47, Column 1: Unknown variable or type "BASE64"}}
> because this is the generated code:
> /* 107 */         if 
> (!org.apache.spark.sql.catalyst.expressions.UnBase64.isValidBase64(project_value_1))
>  {
> /* 108 */           throw QueryExecutionErrors.invalidInputInConversionError(
> /* 109 */             ((org.apache.spark.sql.types.BinaryType$) references[1] 
> /* to */),
> /* 110 */             project_value_1,
> /* 111 */             BASE64,
> /* 112 */             "try_to_binary");
> /* 113 */         }



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-43802) unbase64 and unhex codegen are invalid with failOnError

2023-05-26 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43802?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-43802:
--
Fix Version/s: 3.4.1

> unbase64 and unhex codegen are invalid with failOnError
> ---
>
> Key: SPARK-43802
> URL: https://issues.apache.org/jira/browse/SPARK-43802
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Adam Binford
>Assignee: Adam Binford
>Priority: Major
> Fix For: 3.4.1, 3.5.0
>
>
> to_binary with hex and base64 generate invalid codegen:
> {{spark.range(5).selectExpr('to_binary(base64(cast(id as binary)), 
> "BASE64")').show()}}
> results in
> {{Caused by: org.codehaus.commons.compiler.CompileException: File 
> 'generated.java', Line 47, Column 1: failed to compile: 
> org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 
> 47, Column 1: Unknown variable or type "BASE64"}}
> because this is the generated code:
> /* 107 */         if 
> (!org.apache.spark.sql.catalyst.expressions.UnBase64.isValidBase64(project_value_1))
>  {
> /* 108 */           throw QueryExecutionErrors.invalidInputInConversionError(
> /* 109 */             ((org.apache.spark.sql.types.BinaryType$) references[1] 
> /* to */),
> /* 110 */             project_value_1,
> /* 111 */             BASE64,
> /* 112 */             "try_to_binary");
> /* 113 */         }



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-43827) Assign a name to the error class _LEGACY_ERROR_TEMP_2417

2023-05-26 Thread jiaan.geng (Jira)
jiaan.geng created SPARK-43827:
--

 Summary: Assign a name to the error class _LEGACY_ERROR_TEMP_2417
 Key: SPARK-43827
 URL: https://issues.apache.org/jira/browse/SPARK-43827
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.5.0
Reporter: jiaan.geng






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-43826) Assign a name to the error class _LEGACY_ERROR_TEMP_2416

2023-05-26 Thread jiaan.geng (Jira)
jiaan.geng created SPARK-43826:
--

 Summary: Assign a name to the error class _LEGACY_ERROR_TEMP_2416
 Key: SPARK-43826
 URL: https://issues.apache.org/jira/browse/SPARK-43826
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.5.0
Reporter: jiaan.geng






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-43824) Assign a name to the error class _LEGACY_ERROR_TEMP_1281

2023-05-26 Thread BingKun Pan (Jira)
BingKun Pan created SPARK-43824:
---

 Summary: Assign a name to the error class _LEGACY_ERROR_TEMP_1281
 Key: SPARK-43824
 URL: https://issues.apache.org/jira/browse/SPARK-43824
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.5.0
Reporter: BingKun Pan






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-43825) Assign a name to the error class _LEGACY_ERROR_TEMP_1282

2023-05-26 Thread BingKun Pan (Jira)
BingKun Pan created SPARK-43825:
---

 Summary: Assign a name to the error class _LEGACY_ERROR_TEMP_1282
 Key: SPARK-43825
 URL: https://issues.apache.org/jira/browse/SPARK-43825
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.5.0
Reporter: BingKun Pan






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-43823) Assign a name to the error class _LEGACY_ERROR_TEMP_2414

2023-05-26 Thread jiaan.geng (Jira)
jiaan.geng created SPARK-43823:
--

 Summary: Assign a name to the error class _LEGACY_ERROR_TEMP_2414
 Key: SPARK-43823
 URL: https://issues.apache.org/jira/browse/SPARK-43823
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.5.0
Reporter: jiaan.geng






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-43822) Assign a name to the error class _LEGACY_ERROR_TEMP_2413

2023-05-26 Thread jiaan.geng (Jira)
jiaan.geng created SPARK-43822:
--

 Summary: Assign a name to the error class _LEGACY_ERROR_TEMP_2413
 Key: SPARK-43822
 URL: https://issues.apache.org/jira/browse/SPARK-43822
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.5.0
Reporter: jiaan.geng






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-43821) Make the prompt for `findJar` method in IntegrationTestUtils clearer

2023-05-26 Thread BingKun Pan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43821?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

BingKun Pan updated SPARK-43821:

Description: 
When I am running tests in ClientE2ETestSuite, I often cannot locate them 
through error prompts when they fail, and I can only search for specific 
reasons through code
 * Before applying this patche, the error prompt is as follows:
Exception encountered when invoking run on a nested suite - Failed to find the 
jar inside folder: .../spark-community/connector/connect/server/target

 
 * After applying this patche, The error prompt is as follows:
Exception encountered when invoking run on a nested suite - Failed to find the 
jar: {color:#ff}spark-connect-assembly(.{*}).jar or 
spark-connect(.{*})3.5.0-SNAPSHOT.jar {color}inside folder: 
.../spark-community/connector/connect/server/target. {color:#ff}This file 
can be generated by similar to the following command: build/sbt 
package|assembly{color}

  was:
When I am running tests in ClientE2ETestSuite, I often cannot locate them 
through error prompts when they fail, and I can only search for specific 
reasons through code
 * Before applying this patche, the error prompt is as follows:
Exception encountered when invoking run on a nested suite - Failed to find the 
jar inside folder: .../spark-community/connector/connect/server/target

 
 * After applying this patche, The error prompt is as follows:
Exception encountered when invoking run on a nested suite - Failed to find the 
jar: {color:#FF}spark-connect-assembly(.*).jar or 
spark-connect(.*)3.5.0-SNAPSHOT.jar {color}inside folder: 
.../spark-community/connector/connect/server/target. {color:#FF}This file 
can be generated by similar to the following command: build/sbt package | 
assembly{color}


> Make the prompt for `findJar` method in IntegrationTestUtils clearer
> 
>
> Key: SPARK-43821
> URL: https://issues.apache.org/jira/browse/SPARK-43821
> Project: Spark
>  Issue Type: Improvement
>  Components: Connect, Tests
>Affects Versions: 3.5.0
>Reporter: BingKun Pan
>Priority: Minor
>
> When I am running tests in ClientE2ETestSuite, I often cannot locate them 
> through error prompts when they fail, and I can only search for specific 
> reasons through code
>  * Before applying this patche, the error prompt is as follows:
> Exception encountered when invoking run on a nested suite - Failed to find 
> the jar inside folder: .../spark-community/connector/connect/server/target
>  
>  * After applying this patche, The error prompt is as follows:
> Exception encountered when invoking run on a nested suite - Failed to find 
> the jar: {color:#ff}spark-connect-assembly(.{*}).jar or 
> spark-connect(.{*})3.5.0-SNAPSHOT.jar {color}inside folder: 
> .../spark-community/connector/connect/server/target. {color:#ff}This file 
> can be generated by similar to the following command: build/sbt 
> package|assembly{color}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-43821) Make the prompt for `findJar` method in IntegrationTestUtils clearer

2023-05-26 Thread BingKun Pan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43821?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

BingKun Pan updated SPARK-43821:

Description: 
When I am running tests in ClientE2ETestSuite, I often cannot locate them 
through error prompts when they fail, and I can only search for specific 
reasons through code
 * Before applying this patche, the error prompt is as follows:
Exception encountered when invoking run on a nested suite - Failed to find the 
jar inside folder: .../spark-community/connector/connect/server/target

 
 * After applying this patche, The error prompt is as follows:
Exception encountered when invoking run on a nested suite - Failed to find the 
jar: {color:#FF}spark-connect-assembly(.*).jar or 
spark-connect(.*)3.5.0-SNAPSHOT.jar {color}inside folder: 
.../spark-community/connector/connect/server/target. {color:#FF}This file 
can be generated by similar to the following command: build/sbt package | 
assembly{color}

> Make the prompt for `findJar` method in IntegrationTestUtils clearer
> 
>
> Key: SPARK-43821
> URL: https://issues.apache.org/jira/browse/SPARK-43821
> Project: Spark
>  Issue Type: Improvement
>  Components: Connect, Tests
>Affects Versions: 3.5.0
>Reporter: BingKun Pan
>Priority: Minor
>
> When I am running tests in ClientE2ETestSuite, I often cannot locate them 
> through error prompts when they fail, and I can only search for specific 
> reasons through code
>  * Before applying this patche, the error prompt is as follows:
> Exception encountered when invoking run on a nested suite - Failed to find 
> the jar inside folder: .../spark-community/connector/connect/server/target
>  
>  * After applying this patche, The error prompt is as follows:
> Exception encountered when invoking run on a nested suite - Failed to find 
> the jar: {color:#FF}spark-connect-assembly(.*).jar or 
> spark-connect(.*)3.5.0-SNAPSHOT.jar {color}inside folder: 
> .../spark-community/connector/connect/server/target. {color:#FF}This file 
> can be generated by similar to the following command: build/sbt package | 
> assembly{color}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-43821) Make the prompt for `findJar` method in IntegrationTestUtils clearer

2023-05-26 Thread BingKun Pan (Jira)
BingKun Pan created SPARK-43821:
---

 Summary: Make the prompt for `findJar` method in 
IntegrationTestUtils clearer
 Key: SPARK-43821
 URL: https://issues.apache.org/jira/browse/SPARK-43821
 Project: Spark
  Issue Type: Improvement
  Components: Connect, Tests
Affects Versions: 3.5.0
Reporter: BingKun Pan






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-43820) Assign a name to the error class _LEGACY_ERROR_TEMP_2411

2023-05-26 Thread jiaan.geng (Jira)
jiaan.geng created SPARK-43820:
--

 Summary: Assign a name to the error class _LEGACY_ERROR_TEMP_2411
 Key: SPARK-43820
 URL: https://issues.apache.org/jira/browse/SPARK-43820
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.5.0
Reporter: jiaan.geng






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-43205) Add an IDENTIFIER(stringLiteral) clause that maps a string to an identifier

2023-05-26 Thread Gengliang Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43205?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gengliang Wang reassigned SPARK-43205:
--

Assignee: Serge Rielau

> Add an IDENTIFIER(stringLiteral) clause that maps a string to an identifier
> ---
>
> Key: SPARK-43205
> URL: https://issues.apache.org/jira/browse/SPARK-43205
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 3.5.0
>Reporter: Serge Rielau
>Assignee: Serge Rielau
>Priority: Major
>
> There is a requirement for SQL templates, where the table and or column names 
> are provided through substitution. This can be done today using variable 
> substitution:
> SET hivevar:tabname = mytab;
> SELECT * FROM ${ hivevar:tabname };
> A straight variable substitution is dangerous since it does allow for SQL 
> injection:
> SET hivevar:tabname = mytab, someothertab;
> SELECT * FROM ${ hivevar:tabname };
> A way to get around this problem is to wrap the variable substitution with a 
> clause that limits the scope t produce an identifier.
> This approach is taken by Snowflake:
>  
> [https://docs.snowflake.com/en/sql-reference/session-variables#using-variables-in-sql]
> SET hivevar:tabname = 'tabname';
> SELECT * FROM IDENTIFIER(${ hivevar:tabname })



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-43205) Add an IDENTIFIER(stringLiteral) clause that maps a string to an identifier

2023-05-26 Thread Gengliang Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43205?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gengliang Wang resolved SPARK-43205.

Fix Version/s: 3.5.0
   Resolution: Fixed

Issue resolved by pull request 41007
[https://github.com/apache/spark/pull/41007]

> Add an IDENTIFIER(stringLiteral) clause that maps a string to an identifier
> ---
>
> Key: SPARK-43205
> URL: https://issues.apache.org/jira/browse/SPARK-43205
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 3.5.0
>Reporter: Serge Rielau
>Assignee: Serge Rielau
>Priority: Major
> Fix For: 3.5.0
>
>
> There is a requirement for SQL templates, where the table and or column names 
> are provided through substitution. This can be done today using variable 
> substitution:
> SET hivevar:tabname = mytab;
> SELECT * FROM ${ hivevar:tabname };
> A straight variable substitution is dangerous since it does allow for SQL 
> injection:
> SET hivevar:tabname = mytab, someothertab;
> SELECT * FROM ${ hivevar:tabname };
> A way to get around this problem is to wrap the variable substitution with a 
> clause that limits the scope t produce an identifier.
> This approach is taken by Snowflake:
>  
> [https://docs.snowflake.com/en/sql-reference/session-variables#using-variables-in-sql]
> SET hivevar:tabname = 'tabname';
> SELECT * FROM IDENTIFIER(${ hivevar:tabname })



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-43819) Barrier Executor Stage Not Retried

2023-05-26 Thread Matthew Tieman (Jira)
Matthew Tieman created SPARK-43819:
--

 Summary: Barrier Executor Stage Not Retried
 Key: SPARK-43819
 URL: https://issues.apache.org/jira/browse/SPARK-43819
 Project: Spark
  Issue Type: Bug
  Components: PySpark, Spark Core
Affects Versions: 3.3.2
Reporter: Matthew Tieman


When running a stage using barrier executor, the expectation is that a failure 
in a task will result in the stage being retried. However, if an exception is 
thrown from a task, the stage is not retried and the job fails.

Running the pyspark code below will cause a single task to fail, failing the 
stage without retrying.
{code:java}
def test_func(index: int) -> list:
    if index == 0:
        raise RuntimeError("Thrown from test func")
    return []

start_rdd = sc.parallelize([i for i in range(10)], 10)
result = start_rdd.barrier().mapPartitionsWithIndex(lambda i, c: test_func(i))

result.collect(){code}
 

This failure is seen running locally via the pyspark shell and on a K8s cluster.

 

Stack trace from local execution:
{noformat}
Traceback (most recent call last):
  File "", line 1, in 
  File "/opt/homebrew/anaconda3/lib/python3.9/site-packages/pyspark/rdd.py", 
line 1197, in collect
    sock_info = self.ctx._jvm.PythonRDD.collectAndServe(self._jrdd.rdd())
  File 
"/opt/homebrew/anaconda3/lib/python3.9/site-packages/pyspark/python/lib/py4j-0.10.9.5-src.zip/py4j/java_gateway.py",
 line 1321, in __call__
  File 
"/opt/homebrew/anaconda3/lib/python3.9/site-packages/pyspark/sql/utils.py", 
line 190, in deco
    return f(*a, **kw)
  File 
"/opt/homebrew/anaconda3/lib/python3.9/site-packages/pyspark/python/lib/py4j-0.10.9.5-src.zip/py4j/protocol.py",
 line 326, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling 
z:org.apache.spark.api.python.PythonRDD.collectAndServe.
: org.apache.spark.SparkException: Job aborted due to stage failure: Could not 
recover from a failed barrier ResultStage. Most recent failure reason: Stage 
failed because barrier task ResultTask(0, 0) finished unsuccessfully.
org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File 
"/opt/homebrew/anaconda3/lib/python3.9/site-packages/pyspark/python/lib/pyspark.zip/pyspark/worker.py",
 line 686, in main
    process()
  File 
"/opt/homebrew/anaconda3/lib/python3.9/site-packages/pyspark/python/lib/pyspark.zip/pyspark/worker.py",
 line 676, in process
    out_iter = func(split_index, iterator)
  File "", line 1, in 
  File "", line 3, in test_func
RuntimeError: Thrown from test func


at 
org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:559)
at 
org.apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:765)
at 
org.apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:747)
at 
org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:512)
at 
org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
at scala.collection.Iterator.foreach(Iterator.scala:943)
at scala.collection.Iterator.foreach$(Iterator.scala:943)
at 
org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28)
at scala.collection.generic.Growable.$plus$plus$eq(Growable.scala:62)
at scala.collection.generic.Growable.$plus$plus$eq$(Growable.scala:53)
at 
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:105)
at 
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:49)
at scala.collection.TraversableOnce.to(TraversableOnce.scala:366)
at scala.collection.TraversableOnce.to$(TraversableOnce.scala:364)
at 
org.apache.spark.InterruptibleIterator.to(InterruptibleIterator.scala:28)
at scala.collection.TraversableOnce.toBuffer(TraversableOnce.scala:358)
at scala.collection.TraversableOnce.toBuffer$(TraversableOnce.scala:358)
at 
org.apache.spark.InterruptibleIterator.toBuffer(InterruptibleIterator.scala:28)
at scala.collection.TraversableOnce.toArray(TraversableOnce.scala:345)
at scala.collection.TraversableOnce.toArray$(TraversableOnce.scala:339)
at 
org.apache.spark.InterruptibleIterator.toArray(InterruptibleIterator.scala:28)
at org.apache.spark.rdd.RDD.$anonfun$collect$2(RDD.scala:1021)
at 
org.apache.spark.SparkContext.$anonfun$runJob$5(SparkContext.scala:2268)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:136)
at 
org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:548)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1504)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:551)
at 
java.

[jira] [Updated] (SPARK-43819) Barrier Executor Stage Not Retried on Task Failure

2023-05-26 Thread Matthew Tieman (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43819?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matthew Tieman updated SPARK-43819:
---
Summary: Barrier Executor Stage Not Retried on Task Failure  (was: Barrier 
Executor Stage Not Retried)

> Barrier Executor Stage Not Retried on Task Failure
> --
>
> Key: SPARK-43819
> URL: https://issues.apache.org/jira/browse/SPARK-43819
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Spark Core
>Affects Versions: 3.3.2
>Reporter: Matthew Tieman
>Priority: Major
>
> When running a stage using barrier executor, the expectation is that a 
> failure in a task will result in the stage being retried. However, if an 
> exception is thrown from a task, the stage is not retried and the job fails.
> Running the pyspark code below will cause a single task to fail, failing the 
> stage without retrying.
> {code:java}
> def test_func(index: int) -> list:
>     if index == 0:
>         raise RuntimeError("Thrown from test func")
>     return []
> start_rdd = sc.parallelize([i for i in range(10)], 10)
> result = start_rdd.barrier().mapPartitionsWithIndex(lambda i, c: test_func(i))
> result.collect(){code}
>  
> This failure is seen running locally via the pyspark shell and on a K8s 
> cluster.
>  
> Stack trace from local execution:
> {noformat}
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "/opt/homebrew/anaconda3/lib/python3.9/site-packages/pyspark/rdd.py", 
> line 1197, in collect
>     sock_info = self.ctx._jvm.PythonRDD.collectAndServe(self._jrdd.rdd())
>   File 
> "/opt/homebrew/anaconda3/lib/python3.9/site-packages/pyspark/python/lib/py4j-0.10.9.5-src.zip/py4j/java_gateway.py",
>  line 1321, in __call__
>   File 
> "/opt/homebrew/anaconda3/lib/python3.9/site-packages/pyspark/sql/utils.py", 
> line 190, in deco
>     return f(*a, **kw)
>   File 
> "/opt/homebrew/anaconda3/lib/python3.9/site-packages/pyspark/python/lib/py4j-0.10.9.5-src.zip/py4j/protocol.py",
>  line 326, in get_return_value
> py4j.protocol.Py4JJavaError: An error occurred while calling 
> z:org.apache.spark.api.python.PythonRDD.collectAndServe.
> : org.apache.spark.SparkException: Job aborted due to stage failure: Could 
> not recover from a failed barrier ResultStage. Most recent failure reason: 
> Stage failed because barrier task ResultTask(0, 0) finished unsuccessfully.
> org.apache.spark.api.python.PythonException: Traceback (most recent call 
> last):
>   File 
> "/opt/homebrew/anaconda3/lib/python3.9/site-packages/pyspark/python/lib/pyspark.zip/pyspark/worker.py",
>  line 686, in main
>     process()
>   File 
> "/opt/homebrew/anaconda3/lib/python3.9/site-packages/pyspark/python/lib/pyspark.zip/pyspark/worker.py",
>  line 676, in process
>     out_iter = func(split_index, iterator)
>   File "", line 1, in 
>   File "", line 3, in test_func
> RuntimeError: Thrown from test func
>   at 
> org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:559)
>   at 
> org.apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:765)
>   at 
> org.apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:747)
>   at 
> org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:512)
>   at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
>   at scala.collection.Iterator.foreach(Iterator.scala:943)
>   at scala.collection.Iterator.foreach$(Iterator.scala:943)
>   at 
> org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28)
>   at scala.collection.generic.Growable.$plus$plus$eq(Growable.scala:62)
>   at scala.collection.generic.Growable.$plus$plus$eq$(Growable.scala:53)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:105)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:49)
>   at scala.collection.TraversableOnce.to(TraversableOnce.scala:366)
>   at scala.collection.TraversableOnce.to$(TraversableOnce.scala:364)
>   at 
> org.apache.spark.InterruptibleIterator.to(InterruptibleIterator.scala:28)
>   at scala.collection.TraversableOnce.toBuffer(TraversableOnce.scala:358)
>   at scala.collection.TraversableOnce.toBuffer$(TraversableOnce.scala:358)
>   at 
> org.apache.spark.InterruptibleIterator.toBuffer(InterruptibleIterator.scala:28)
>   at scala.collection.TraversableOnce.toArray(TraversableOnce.scala:345)
>   at scala.collection.TraversableOnce.toArray$(TraversableOnce.scala:339)
>   at 
> org.apache.spark.InterruptibleIterator.toArray(InterruptibleIterator.scala:28)
>   at org.apache.spark.rdd.RDD.$anonfun$collect$2(RDD.scala:1021)
>   at 
> org.apache.spark.SparkContext.$anonfun

[jira] [Updated] (SPARK-43815) Add to_varchar alias for to_char SQL function

2023-05-26 Thread Richard Yu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43815?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Richard Yu updated SPARK-43815:
---
Description: 
We want to add the alias to_varchar for the function to_char. 

For users who are migrating to Spark SQL such that the SQL engine they formerly 
used supported to_varchar instead of to_char, this change would minimize the 
number of changes to their application to ensure it is compatible with Spark 
SQL syntax and support.

  was:
We want to add the alias to_varchar for the function to_char. 

For users who are migrating to Spark SQL such that the SQL engine they formerly 
used supported to_varchar instead of to_char, this change would minimize the 
number of chars to their application to ensure it is compatible with Spark SQL 
syntax and support.


> Add to_varchar alias for to_char SQL function
> -
>
> Key: SPARK-43815
> URL: https://issues.apache.org/jira/browse/SPARK-43815
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.4.1
>Reporter: Richard Yu
>Priority: Major
>
> We want to add the alias to_varchar for the function to_char. 
> For users who are migrating to Spark SQL such that the SQL engine they 
> formerly used supported to_varchar instead of to_char, this change would 
> minimize the number of changes to their application to ensure it is 
> compatible with Spark SQL syntax and support.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-43815) Add SQL functions to_varchar alias for to_char

2023-05-26 Thread Richard Yu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43815?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Richard Yu updated SPARK-43815:
---
Description: 
We want to add the alias to_varchar for the function to_char. 

For users who are migrating to Spark SQL such that the SQL engine they formerly 
used supported to_varchar instead of to_char, this change would minimize the 
number of chars to their application to ensure it is compatible with Spark SQL 
syntax and support.

  was:
We want to add support for the follow functions:
 * to_varchar() as an alias for to_char()

 * Expand to_char() to take date, timestamp and binary expression as the first 
argument. For date and timestamp expression, the function will be equivalent to 
date_format(expr, fmt) . For binary expression, the function will be equivalent 
to base64() , hex(), and decode(, 'UTF-8') for fmt base64, hex, and UTF-8 
respectively.

 * timediff() as an alias for timestampdiff()


> Add SQL functions to_varchar alias for to_char
> --
>
> Key: SPARK-43815
> URL: https://issues.apache.org/jira/browse/SPARK-43815
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.4.1
>Reporter: Richard Yu
>Priority: Major
>
> We want to add the alias to_varchar for the function to_char. 
> For users who are migrating to Spark SQL such that the SQL engine they 
> formerly used supported to_varchar instead of to_char, this change would 
> minimize the number of chars to their application to ensure it is compatible 
> with Spark SQL syntax and support.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-43815) Add to_varchar alias for to_char SQL function

2023-05-26 Thread Richard Yu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43815?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Richard Yu updated SPARK-43815:
---
Summary: Add to_varchar alias for to_char SQL function  (was: Add SQL 
functions to_varchar alias for to_char)

> Add to_varchar alias for to_char SQL function
> -
>
> Key: SPARK-43815
> URL: https://issues.apache.org/jira/browse/SPARK-43815
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.4.1
>Reporter: Richard Yu
>Priority: Major
>
> We want to add the alias to_varchar for the function to_char. 
> For users who are migrating to Spark SQL such that the SQL engine they 
> formerly used supported to_varchar instead of to_char, this change would 
> minimize the number of chars to their application to ensure it is compatible 
> with Spark SQL syntax and support.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-43815) Add SQL functions to_varchar alias for to_char

2023-05-26 Thread Richard Yu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43815?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Richard Yu updated SPARK-43815:
---
Summary: Add SQL functions to_varchar alias for to_char  (was: Add SQL 
functions to_varchar and extend to_char functionality)

> Add SQL functions to_varchar alias for to_char
> --
>
> Key: SPARK-43815
> URL: https://issues.apache.org/jira/browse/SPARK-43815
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.4.1
>Reporter: Richard Yu
>Priority: Major
>
> We want to add support for the follow functions:
>  * to_varchar() as an alias for to_char()
>  * Expand to_char() to take date, timestamp and binary expression as the 
> first argument. For date and timestamp expression, the function will be 
> equivalent to date_format(expr, fmt) . For binary expression, the function 
> will be equivalent to base64() , hex(), and decode(, 'UTF-8') for fmt 
> base64, hex, and UTF-8 respectively.
>  * timediff() as an alias for timestampdiff()



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41660) only propagate metadata columns if they are used

2023-05-26 Thread Thomas Graves (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17726740#comment-17726740
 ] 

Thomas Graves commented on SPARK-41660:
---

it looks like this was backported to 3.3. with 
https://github.com/apache/spark/pull/40889

> only propagate metadata columns if they are used
> 
>
> Key: SPARK-41660
> URL: https://issues.apache.org/jira/browse/SPARK-41660
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
> Fix For: 3.3.3, 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-41660) only propagate metadata columns if they are used

2023-05-26 Thread Thomas Graves (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41660?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves updated SPARK-41660:
--
Fix Version/s: 3.3.3

> only propagate metadata columns if they are used
> 
>
> Key: SPARK-41660
> URL: https://issues.apache.org/jira/browse/SPARK-41660
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
> Fix For: 3.3.3, 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-43366) Spark Driver Bind Address is off-by-one

2023-05-26 Thread Sean R. Owen (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-43366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17724303#comment-17724303
 ] 

Sean R. Owen edited comment on SPARK-43366 at 5/26/23 8:04 PM:
---

-Was the original port in use? it'll try the next one then- EDIT: this makes no 
sense


was (Author: srowen):
Was the original port in use? it'll try the next one then

> Spark Driver Bind Address is off-by-one
> ---
>
> Key: SPARK-43366
> URL: https://issues.apache.org/jira/browse/SPARK-43366
> Project: Spark
>  Issue Type: Bug
>  Components: Block Manager
>Affects Versions: 3.3.3
>Reporter: Derek Brown
>Priority: Major
>
> I have the following environment variable set in my driver pod configuration:
> {code:java}
> SPARK_DRIVER_BIND_ADDRESS=10.244.0.53{code}
> However, I see an off-by-one IP address being referred to in the Spark logs:
> {code:java}
> 23/05/04 02:37:03 INFO 
> KubernetesClusterSchedulerBackend$KubernetesDriverEndpoint: Registered 
> executor NettyRpcEndpointRef(spark-client://Executor) (10.244.0.54:53140) 
> with ID 1,  ResourceProfileId 0
> 23/05/04 02:37:03 INFO BlockManagerMasterEndpoint: Registering block manager 
> 10.244.0.54:32805 with 413.9 MiB RAM, BlockManagerId(1, 10.244.0.54, 32805, 
> None){code}
> I am not sure why this might be the case.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-43366) Spark Driver Bind Address is off-by-one

2023-05-26 Thread Sean R. Owen (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-43366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17726731#comment-17726731
 ] 

Sean R. Owen commented on SPARK-43366:
--

Ack yeah, nevermind. Reading too fast without coffee. That I don't know, except 
to say that's not going to be controlled by the _driver_ IP. Block manager 
would be tied to _executors_. Is that IP an executor?

> Spark Driver Bind Address is off-by-one
> ---
>
> Key: SPARK-43366
> URL: https://issues.apache.org/jira/browse/SPARK-43366
> Project: Spark
>  Issue Type: Bug
>  Components: Block Manager
>Affects Versions: 3.3.3
>Reporter: Derek Brown
>Priority: Major
>
> I have the following environment variable set in my driver pod configuration:
> {code:java}
> SPARK_DRIVER_BIND_ADDRESS=10.244.0.53{code}
> However, I see an off-by-one IP address being referred to in the Spark logs:
> {code:java}
> 23/05/04 02:37:03 INFO 
> KubernetesClusterSchedulerBackend$KubernetesDriverEndpoint: Registered 
> executor NettyRpcEndpointRef(spark-client://Executor) (10.244.0.54:53140) 
> with ID 1,  ResourceProfileId 0
> 23/05/04 02:37:03 INFO BlockManagerMasterEndpoint: Registering block manager 
> 10.244.0.54:32805 with 413.9 MiB RAM, BlockManagerId(1, 10.244.0.54, 32805, 
> None){code}
> I am not sure why this might be the case.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-43366) Spark Driver Bind Address is off-by-one

2023-05-26 Thread Derek Brown (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-43366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17726727#comment-17726727
 ] 

Derek Brown commented on SPARK-43366:
-

[~srowen] the issue isn't with the port; the issue is with the IP address. The 
ports are both 32805.

> Spark Driver Bind Address is off-by-one
> ---
>
> Key: SPARK-43366
> URL: https://issues.apache.org/jira/browse/SPARK-43366
> Project: Spark
>  Issue Type: Bug
>  Components: Block Manager
>Affects Versions: 3.3.3
>Reporter: Derek Brown
>Priority: Major
>
> I have the following environment variable set in my driver pod configuration:
> {code:java}
> SPARK_DRIVER_BIND_ADDRESS=10.244.0.53{code}
> However, I see an off-by-one IP address being referred to in the Spark logs:
> {code:java}
> 23/05/04 02:37:03 INFO 
> KubernetesClusterSchedulerBackend$KubernetesDriverEndpoint: Registered 
> executor NettyRpcEndpointRef(spark-client://Executor) (10.244.0.54:53140) 
> with ID 1,  ResourceProfileId 0
> 23/05/04 02:37:03 INFO BlockManagerMasterEndpoint: Registering block manager 
> 10.244.0.54:32805 with 413.9 MiB RAM, BlockManagerId(1, 10.244.0.54, 32805, 
> None){code}
> I am not sure why this might be the case.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-43766) Assign a name to the error class _LEGACY_ERROR_TEMP_2410

2023-05-26 Thread Max Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43766?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk reassigned SPARK-43766:


Assignee: jiaan.geng

> Assign a name to the error class _LEGACY_ERROR_TEMP_2410
> 
>
> Key: SPARK-43766
> URL: https://issues.apache.org/jira/browse/SPARK-43766
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: jiaan.geng
>Assignee: jiaan.geng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-43765) Assign a name to the error class _LEGACY_ERROR_TEMP_2409

2023-05-26 Thread Max Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43765?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk reassigned SPARK-43765:


Assignee: jiaan.geng

> Assign a name to the error class _LEGACY_ERROR_TEMP_2409
> 
>
> Key: SPARK-43765
> URL: https://issues.apache.org/jira/browse/SPARK-43765
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: jiaan.geng
>Assignee: jiaan.geng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-43764) Assign a name to the error class _LEGACY_ERROR_TEMP_2408

2023-05-26 Thread Max Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43764?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk resolved SPARK-43764.
--
Fix Version/s: 3.5.0
   Resolution: Fixed

Issue resolved by pull request 41290
[https://github.com/apache/spark/pull/41290]

> Assign a name to the error class _LEGACY_ERROR_TEMP_2408
> 
>
> Key: SPARK-43764
> URL: https://issues.apache.org/jira/browse/SPARK-43764
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: jiaan.geng
>Assignee: jiaan.geng
>Priority: Major
> Fix For: 3.5.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-43762) Assign a name to the error class _LEGACY_ERROR_TEMP_2406

2023-05-26 Thread Max Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43762?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk reassigned SPARK-43762:


Assignee: jiaan.geng

> Assign a name to the error class _LEGACY_ERROR_TEMP_2406
> 
>
> Key: SPARK-43762
> URL: https://issues.apache.org/jira/browse/SPARK-43762
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: jiaan.geng
>Assignee: jiaan.geng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-43762) Assign a name to the error class _LEGACY_ERROR_TEMP_2406

2023-05-26 Thread Max Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43762?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk resolved SPARK-43762.
--
Fix Version/s: 3.5.0
   Resolution: Fixed

Issue resolved by pull request 41290
[https://github.com/apache/spark/pull/41290]

> Assign a name to the error class _LEGACY_ERROR_TEMP_2406
> 
>
> Key: SPARK-43762
> URL: https://issues.apache.org/jira/browse/SPARK-43762
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: jiaan.geng
>Assignee: jiaan.geng
>Priority: Major
> Fix For: 3.5.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-43766) Assign a name to the error class _LEGACY_ERROR_TEMP_2410

2023-05-26 Thread Max Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43766?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk resolved SPARK-43766.
--
Fix Version/s: 3.5.0
   Resolution: Fixed

Issue resolved by pull request 41290
[https://github.com/apache/spark/pull/41290]

> Assign a name to the error class _LEGACY_ERROR_TEMP_2410
> 
>
> Key: SPARK-43766
> URL: https://issues.apache.org/jira/browse/SPARK-43766
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: jiaan.geng
>Assignee: jiaan.geng
>Priority: Major
> Fix For: 3.5.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-43765) Assign a name to the error class _LEGACY_ERROR_TEMP_2409

2023-05-26 Thread Max Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43765?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk resolved SPARK-43765.
--
Fix Version/s: 3.5.0
   Resolution: Fixed

Issue resolved by pull request 41290
[https://github.com/apache/spark/pull/41290]

> Assign a name to the error class _LEGACY_ERROR_TEMP_2409
> 
>
> Key: SPARK-43765
> URL: https://issues.apache.org/jira/browse/SPARK-43765
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: jiaan.geng
>Assignee: jiaan.geng
>Priority: Major
> Fix For: 3.5.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-43764) Assign a name to the error class _LEGACY_ERROR_TEMP_2408

2023-05-26 Thread Max Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43764?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk reassigned SPARK-43764:


Assignee: jiaan.geng

> Assign a name to the error class _LEGACY_ERROR_TEMP_2408
> 
>
> Key: SPARK-43764
> URL: https://issues.apache.org/jira/browse/SPARK-43764
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: jiaan.geng
>Assignee: jiaan.geng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-43763) Assign a name to the error class _LEGACY_ERROR_TEMP_2407

2023-05-26 Thread Max Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43763?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk resolved SPARK-43763.
--
Fix Version/s: 3.5.0
   Resolution: Fixed

Issue resolved by pull request 41290
[https://github.com/apache/spark/pull/41290]

> Assign a name to the error class _LEGACY_ERROR_TEMP_2407
> 
>
> Key: SPARK-43763
> URL: https://issues.apache.org/jira/browse/SPARK-43763
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: jiaan.geng
>Assignee: jiaan.geng
>Priority: Major
> Fix For: 3.5.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-43763) Assign a name to the error class _LEGACY_ERROR_TEMP_2407

2023-05-26 Thread Max Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43763?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk reassigned SPARK-43763:


Assignee: jiaan.geng

> Assign a name to the error class _LEGACY_ERROR_TEMP_2407
> 
>
> Key: SPARK-43763
> URL: https://issues.apache.org/jira/browse/SPARK-43763
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: jiaan.geng
>Assignee: jiaan.geng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-43818) Spark Glue job introduces duplicates while writing a dataframe as file to S3

2023-05-26 Thread Umesh Kant (Jira)
Umesh Kant created SPARK-43818:
--

 Summary: Spark Glue job introduces duplicates while writing a 
dataframe as file to S3
 Key: SPARK-43818
 URL: https://issues.apache.org/jira/browse/SPARK-43818
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 3.1.0
 Environment: Production
Reporter: Umesh Kant


We have AWS Glue (Spark) based ETL framework which processes the data through 
multiple hops and finally write the dataframe in S3 bucket as parquet files 
with snappy compression. We have used this framework to process and write data 
to S3 for 1000+ tables/files and it works fine. But for two of the tables - in 
memory data frame contains correct records but when data frame gets persisted 
in S3 as file, it introduces duplicate entries and as the total count remains 
same duplicate cause missing records as well.

{*}Data Points{*}:
 # This happens only for large(wider tables + millions of rows)
 # When this happens we notice stage failures and retry succeeding but causing 
duplicates/missing records

{*}Code Steps{*}:
|Steps Information|Dataframe|Query / Operation /Action|
|Query Raw DB & get no of partition ( to  loop one by one)| |select distinct 
partition_0 FROM  .|
|Raw DF Query|raw|select SCHDWKID_REF, TASK_REF, LIFECYCLE_REF, TASK_DESC, 
WHOSE_ENT_NAME, WHOSE_INST_REF, WHOSE_INST_CDE, STENDDAT_STRTDT, 
STENDDAT_ENDDAT, AGENT_ENT_NAME, AGENT_INST_REF, AGENT_INST_CDE, AGENT_CODE, 
LOCATION_ENT_NAME, LOCATION_INST_REF, LOCATION_INST_CDE, CASEID_NUMBER, 
FACE_AMT, TAAR_AMT, AUTH_AMT, TRANSFER_YORN_ENCODE, TRANSFER_YORN_DECODE, 
TRANSFER_YORN_ELMREF, CASE_YORN_ENCODE, CASE_YORN_DECODE, CASE_YORN_ELMREF, 
CHANGEID_REF, CNTRCTID_REF, CNTRCTID_NUMBER, KTKDSCID_REF, KWNOFFID_REF, 
KWNOFFID_CODE, USERID_REF, USERID_CODE, WQUEUEID_REF, WQUEUEID_CODE, 
STATUS_REF, STATUS_CODE, STATUS_ASAT, LASTUPD_USER, LASTUPD_TERMNO, 
LASTUPD_PROG, LASTUPD_INFTIM, KWQPRIID_REF, KWQPRIID_CODE, INSURED_NAME, 
AGENT_NAME, EDM_INGESTED_AT, EDM_INGEST_TIME, PARTITION_0, DELTA_IND, 
TRANSACT_SEQ from RAW_ORACLE_ORAP12_NYLDPROD60CL.SCHEDULED_WORK where 
partition_0= '20230428'|
|Structured  DF Query|structured|SELECT * FROM 
RL_LAKE_ORACLE_ORAP12_NYLDPROD60CL.SCHEDULED_WORK WHERE part_num > 0 |
| | | |
|Merge DF Generated By joining raw & structured on nks|df_merge|df_merge = 
structured.join(raw,keys,how='fullouter')|
|action column will be added to
 Merge Df|df_merge|df_merge = df_merge.withColumn("action", 
fn.whendf_merge['structured.EDH_RECORD_STATUS_IN'] == 'A') \| 
(df_merge['structured.EDH_RECORD_STATUS_IN'] == 'D')) & ( 
df_merge['raw.chksum'].isNull()) & (~ df_merge['structured.CHKSUM'].isNull())) 
, "NOACTION")
    
.when((df_merge['structured.CHKSUM'].isNull()) & (df_merge['raw.delta_ind']!= 
'D'), "INSERT")
    
.when((df_merge['structured.CHKSUM'] != df_merge['raw.chksum']) & (~ 
df_merge['structured.CHKSUM'].isNull()) & 
(df_merge['structured.EDH_RECORD_STATUS_IN'] == 'A') & 
((df_merge['raw.delta_ind'] == 'U') \| (df_merge['raw.delta_ind'] == 'I')), 
"UPDATE")
    
.when(((df_merge['raw.delta_ind']== 'D') & 
(df_merge['structured.EDH_RECORD_STATUS_IN'] == 'A')) , "DELETE")
    
.when(((df_merge['raw.delta_ind']== 'D') & 
(df_merge['structured.EDH_RECORD_STATUS_IN'] == 'D') ) , "DELETECOPY")
    
.when(((df_merge['raw.delta_ind']== 'I') & 
(df_merge['structured.EDH_RECORD_STATUS_IN'] == 'D') & (~ 
df_merge['raw.chksum'].isNull()) & (~ df_merge['structured.CHKSUM'].isNull())) 
, "DELETEREINSERT")
    
.when(((df_merge['raw.delta_ind']== 'D') & 
(df_merge['structured.CHKSUM'].isNull())) , "DELETEABSENT")
    
.when((df_merge['structured.CHKSUM'] == df_merge['raw.chksum']), "NOCHANGE"))|
| | | |
|No Action df will be derived from merge df|df_noaction|df_noaction = 
df_merge.select(keys + ['structured.' + x.upper() for x in structured_cols_list 
if x.upper() not in keys]).where((df_merge.action == 'NOACTION') \| 
(df_merge.action == 'NOCHANGE'))|
|Delete Copy DF will be derived|df_dcopy|df_dcopy = df_merge.select(keys + 
['structured.' + x.upper() for x in structured_cols_list if x.upper() not in 
keys]).where(df_merge.action == 'DELETECOPY')|
|Delete Absent df will be derived|df_dabs|df_dabs = df_merge.select(keys + 
['raw.' + x.upper() for x in raw_cols_list if x.upper() not in 
keys]).where(df_merge.action == 'DELETEABSENT')|
|insert df will be derived|df_insert|df_insert = df_merge.select(keys + ['raw.' 
+ x.upper() for x in raw_cols_list if x.upper() not in 
keys]).where(df_merge.action == 'INSERT')|
|Outdated Df will be derived , records from structured where we 

[jira] [Commented] (SPARK-43815) Add SQL functions to_varchar and extend to_char functionality

2023-05-26 Thread Max Gekk (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-43815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17726704#comment-17726704
 ] 

Max Gekk commented on SPARK-43815:
--

[~ryu796] Let's focus on the first item, and create separate JIRAs for other 
items.

> Add SQL functions to_varchar and extend to_char functionality
> -
>
> Key: SPARK-43815
> URL: https://issues.apache.org/jira/browse/SPARK-43815
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.4.1
>Reporter: Richard Yu
>Priority: Major
>
> We want to add support for the follow functions:
>  * to_varchar() as an alias for to_char()
>  * Expand to_char() to take date, timestamp and binary expression as the 
> first argument. For date and timestamp expression, the function will be 
> equivalent to date_format(expr, fmt) . For binary expression, the function 
> will be equivalent to base64() , hex(), and decode(, 'UTF-8') for fmt 
> base64, hex, and UTF-8 respectively.
>  * timediff() as an alias for timestampdiff()



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-43808) Use `checkError()` to check `Exception` in `SQLViewTestSuite`

2023-05-26 Thread Max Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43808?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk resolved SPARK-43808.
--
Fix Version/s: 3.5.0
   Resolution: Fixed

Issue resolved by pull request 41331
[https://github.com/apache/spark/pull/41331]

> Use `checkError()` to check `Exception` in `SQLViewTestSuite`
> -
>
> Key: SPARK-43808
> URL: https://issues.apache.org/jira/browse/SPARK-43808
> Project: Spark
>  Issue Type: Sub-task
>  Components: Tests
>Affects Versions: 3.5.0
>Reporter: BingKun Pan
>Assignee: BingKun Pan
>Priority: Minor
> Fix For: 3.5.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-43808) Use `checkError()` to check `Exception` in `SQLViewTestSuite`

2023-05-26 Thread Max Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43808?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk reassigned SPARK-43808:


Assignee: BingKun Pan

> Use `checkError()` to check `Exception` in `SQLViewTestSuite`
> -
>
> Key: SPARK-43808
> URL: https://issues.apache.org/jira/browse/SPARK-43808
> Project: Spark
>  Issue Type: Sub-task
>  Components: Tests
>Affects Versions: 3.5.0
>Reporter: BingKun Pan
>Assignee: BingKun Pan
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-43817) Support UserDefinedType in creaetDataFrame from pandas DataFrame and toPandas

2023-05-26 Thread Takuya Ueshin (Jira)
Takuya Ueshin created SPARK-43817:
-

 Summary: Support UserDefinedType in creaetDataFrame from pandas 
DataFrame and toPandas
 Key: SPARK-43817
 URL: https://issues.apache.org/jira/browse/SPARK-43817
 Project: Spark
  Issue Type: Sub-task
  Components: PySpark
Affects Versions: 3.5.0
Reporter: Takuya Ueshin






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-43816) Spark Corrupts Data In-Transit for High Volume (> 20 TB/hr) of Data

2023-05-26 Thread Sai Allu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43816?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sai Allu updated SPARK-43816:
-
Description: 
h1. Bug Context

Hello! I would like to report a bug that my team noticed while we were using 
Spark (please see the Environment section to see our exact setup).

The application we built is meant to convert a large number of JSON files (JSON 
Lines format) and write them to a Delta table. The JSON files are located in an 
Azure Data Lake Gen 2 +without+ hierarchical namespacing. The Delta table is in 
an Azure Data Lake Gen 2 +with+ hierarchical namespacing.

We have a PySpark notebook in our Synapse Analytics workspace which reads the 
JSON files into a DataFrame and then writes them to the Delta table. It uses 
batch processing.

The JSON files have {+}no corrupt records{+}, we checked them thoroughly. And 
there are no code flaws in our PySpark notebook, we also checked that.

Our code reads 15 TB of JSON files (each file is about 400 MB in size) into our 
PySpark DataFrame in the following way.
{code:java}
originalDF = (  
spark.read
.schema(originDataSchema)
    .option("pathGlobFilter", DESIRED_FILE_PATTERN)
    .option("mode", "PERMISSIVE")
    .option("columnNameOfCorruptRecord", "DiscoveredCorruptRecords")
.option("badRecordsPath", BAD_RECORDS_PATH)
.json(ORIGIN_FILES_PATH)
) {code}
To read this data and then write it to a Delta table takes about 37 minutes.

The problem that we noticed is that as the data is read into the PySpark 
DataFrame, a small percent of it becomes corrupted. Only about 1 in 10 million 
records become corrupted. This is just a made-up example to illustrate the 
point:
{code:java}
// The original JSON record looks like this
{ "Name": "Robert", "Email": "b...@gmail.com", "Nickname": "Bob" }

// When we look in the PySpark DataFrame we see this (for a small percent of 
records)
{ "Name": "Robertbob@", "Email": "gmail.com", "Nickname": "Bob" }{code}
 

Essentially, the spark.read() has some deserialization problem that only 
emerges for high data throughput (> 20 TB/hr).

When we tried using a smaller dataset (1/4 the size), it didn't show any signs 
of corruption.

When we use the same exact code and then parse just one JSON file which 
contains the record mentioned above, everything works perfectly fine.

The spark.read() corruption is also not deterministic. If we re-run the 20 
TB/hr test, we still see corruption but in different records.

 
h1. Our Temporary Solution

What we noticed is that the "spark.sql.files.maxPartitionBytes" was by default 
set to 128 MB. This meant that for the average JSON files we were reading - 
which was 400 MB - Spark was making four calls to the Azure Data Lake and 
fetching a [byte 
range|https://learn.microsoft.com/en-us/rest/api/storageservices/get-file#:~:text=Range-,Optional.%20Returns%20file%20data%20only%20from%20the%20specified%20byte%20range.,-x%2Dms%2Drange]
 (i.e. the 1st call got bytes 0-128MB, the 2nd call got bytes 128MB-256MB, 
etc.).

We increased "spark.sql.files.maxPartitionBytes" to a large number (1 GB) and 
that made the data corruption problem go away.

 
h1. How We Think You Can Fix This

>From my understanding, when Spark makes a call for a byte range, it will often 
>"cut off" the data in the middle of a JSON record. Our JSON files are in the 
>JSON Lines format and they contain thousands of lines, each with a JSON 
>record. So calling a byte range from 0 - 128MB will most likely mean that the 
>cutoff point is right in the middle of a JSON record.

Spark seems to have some code logic which handles this by only processing the 
"full lines" that are received. But this logic seems to be failing a small 
percent of the time. Specifically, we have about 50,000 JSON files, that means 
~200,000 byte range calls are being made. And spark.read() is creating about 
150 corrupt records.

So we think you should look at the Spark code which is doing this "cut off" 
handling for byte ranges and see if there's something missing there. Or 
something in the deserialization logic of spark.read().

Again, this bug only emerges for high volumes of data transfer (> 20 TB/hr). 
This could be a "race condition" or some kind of performance-related bug.

  was:
h1. Bug Context

Hello! I would like to report a bug that my team noticed while we were using 
Spark (please see the Environment section to see our exact setup).

The application we built is meant to convert a large number of JSON files (JSON 
Lines format) and write them to a Delta table. The JSON files are located in an 
Azure Data Lake Gen 2 +without+ hierarchical namespacing. The Delta table is in 
an Azure Data Lake Gen 2 +with+ hierarchical namespacing.

We have a PySpark notebook in our Synapse Analytics workspace which reads the 
JSON files into a DataFrame and then writes them to the Delta table. It uses 
batch processing.

The JSON files have {+}no corrupt re

[jira] [Created] (SPARK-43816) Spark Corrupts Data In-Transit for High Volume (> 20 TB/hr) of Data

2023-05-26 Thread Sai Allu (Jira)
Sai Allu created SPARK-43816:


 Summary: Spark Corrupts Data In-Transit for High Volume (> 20 
TB/hr) of Data
 Key: SPARK-43816
 URL: https://issues.apache.org/jira/browse/SPARK-43816
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 3.3.1
 Environment: We are using Azure Synapse Analytics. Within that, we 
have provisioned a Spark Pool with 101 nodes. 100 nodes are used for the 
executors and 1 node is for the driver. Each node is what Synapse Analytics 
calls a "Memory Optimized Medium size node". This means each node has 8 vCores 
and 64 GB memory. The Spark Pool does not do dynamic allocation of executors 
(101 nodes are created at the start and present throughout the Spark job). 
Synapse has something called "Intelligent Cache," but we disabled it (set to 
0%). The nodes all use Spark 3.3.1.5.2-90111858. If you need details on any 
specific Spark settings, I can get that for you. Mostly we are just using the 
defaults.
Reporter: Sai Allu


h1. Bug Context

Hello! I would like to report a bug that my team noticed while we were using 
Spark (please see the Environment section to see our exact setup).

The application we built is meant to convert a large number of JSON files (JSON 
Lines format) and write them to a Delta table. The JSON files are located in an 
Azure Data Lake Gen 2 +without+ hierarchical namespacing. The Delta table is in 
an Azure Data Lake Gen 2 +with+ hierarchical namespacing.

We have a PySpark notebook in our Synapse Analytics workspace which reads the 
JSON files into a DataFrame and then writes them to the Delta table. It uses 
batch processing.

The JSON files have {+}no corrupt records{+}, we checked them thoroughly. And 
there are no code flaws in our PySpark notebook, we also checked that.

Our code reads 15 TB of JSON files (each file is about 400 MB in size) into our 
PySpark DataFrame in the following way.
{code:java}
originalDF = (  
spark.read
.schema(originDataSchema)
    .option("pathGlobFilter", DESIRED_FILE_PATTERN)
    .option("mode", "PERMISSIVE")
    .option("columnNameOfCorruptRecord", "DiscoveredCorruptRecords")
.option("badRecordsPath", BAD_RECORDS_PATH)
.json(ORIGIN_FILES_PATH)
) {code}
To read this data and then write it to a Delta table takes about 37 minutes.

The problem that we noticed is that as the data is read into the PySpark 
DataFrame, a small percent of it becomes corrupted. Only about 1 in 10 million 
records become corrupted. This is just a made-up example to illustrate the 
point:
{code:java}
// The original JSON record looks like this
{ "Name": "Robert", "Email": "b...@gmail.com", "Nickname": "Bob" }

// When we look in the PySpark DataFrame we see this (for a small percent of 
records)
{ "Name": "Robertbob@", "Email": "gmail.com", "Nickname": "Bob" }{code}
 

Essentially, the spark.read() has some deserialization problem that only 
emerges for high data throughput (> 20 TB/hr).

When we tried using a smaller dataset (1/4 the size), it didn't show any signs 
of corruption.

When we use the same exact code and then parse just one JSON file which 
contains the record mentioned above, everything works perfectly fine.

The spark.read() corruption is also not deterministic. If we re-run the 20 
TB/hr test, we still see corruption but in different records.

 
h1. Our Temporary Solution

What we noticed is that the "spark.sql.files.maxPartitionBytes" was by default 
set to 128 MB. This meant that for the average JSON files we were reading - 
which was 400 MB - Spark was making four calls to the Azure Data Lake and 
fetching a [byte 
range|https://learn.microsoft.com/en-us/rest/api/storageservices/get-file#:~:text=Range-,Optional.%20Returns%20file%20data%20only%20from%20the%20specified%20byte%20range.,-x%2Dms%2Drange]
 (i.e. the 1st call got bytes 0-128MB, the 2nd call got bytes 128MB-256MB, 
etc.).

We increased "spark.sql.files.maxPartitionBytes" to a large number (1 GB) and 
that made the data corruption problem go away.

 
h1. How We Think You Can Fix This

>From my understanding, when Spark makes a call for a byte range, it will often 
>"cut off" the data in the middle of a JSON record. Our JSON files are in the 
>JSON Lines format and they contain thousands of lines, each with a JSON 
>record. So calling a byte range from 0 - 128MB will most likely mean that the 
>cutoff point is right in the middle of a JSON record.

Spark seems to have some code logic which handles this by only processing the 
"full lines" that are received. But this logic seems to be failing a small 
percent of the time. Specifically, we have about 50,000 JSON files, that means 
~200,000 byte range calls are being made. And spark.read() is creating about 
150 corrupt records.

So we think you should look at the Spark code which is doing this "cut off" 
handling for byte ranges and see if there's something missing t

[jira] [Resolved] (SPARK-43802) unbase64 and unhex codegen are invalid with failOnError

2023-05-26 Thread Max Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43802?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk resolved SPARK-43802.
--
Fix Version/s: 3.5.0
   Resolution: Fixed

Issue resolved by pull request 41317
[https://github.com/apache/spark/pull/41317]

> unbase64 and unhex codegen are invalid with failOnError
> ---
>
> Key: SPARK-43802
> URL: https://issues.apache.org/jira/browse/SPARK-43802
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Adam Binford
>Assignee: Adam Binford
>Priority: Major
> Fix For: 3.5.0
>
>
> to_binary with hex and base64 generate invalid codegen:
> {{spark.range(5).selectExpr('to_binary(base64(cast(id as binary)), 
> "BASE64")').show()}}
> results in
> {{Caused by: org.codehaus.commons.compiler.CompileException: File 
> 'generated.java', Line 47, Column 1: failed to compile: 
> org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 
> 47, Column 1: Unknown variable or type "BASE64"}}
> because this is the generated code:
> /* 107 */         if 
> (!org.apache.spark.sql.catalyst.expressions.UnBase64.isValidBase64(project_value_1))
>  {
> /* 108 */           throw QueryExecutionErrors.invalidInputInConversionError(
> /* 109 */             ((org.apache.spark.sql.types.BinaryType$) references[1] 
> /* to */),
> /* 110 */             project_value_1,
> /* 111 */             BASE64,
> /* 112 */             "try_to_binary");
> /* 113 */         }



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-43802) unbase64 and unhex codegen are invalid with failOnError

2023-05-26 Thread Max Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43802?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk reassigned SPARK-43802:


Assignee: Adam Binford

> unbase64 and unhex codegen are invalid with failOnError
> ---
>
> Key: SPARK-43802
> URL: https://issues.apache.org/jira/browse/SPARK-43802
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Adam Binford
>Assignee: Adam Binford
>Priority: Major
>
> to_binary with hex and base64 generate invalid codegen:
> {{spark.range(5).selectExpr('to_binary(base64(cast(id as binary)), 
> "BASE64")').show()}}
> results in
> {{Caused by: org.codehaus.commons.compiler.CompileException: File 
> 'generated.java', Line 47, Column 1: failed to compile: 
> org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 
> 47, Column 1: Unknown variable or type "BASE64"}}
> because this is the generated code:
> /* 107 */         if 
> (!org.apache.spark.sql.catalyst.expressions.UnBase64.isValidBase64(project_value_1))
>  {
> /* 108 */           throw QueryExecutionErrors.invalidInputInConversionError(
> /* 109 */             ((org.apache.spark.sql.types.BinaryType$) references[1] 
> /* to */),
> /* 110 */             project_value_1,
> /* 111 */             BASE64,
> /* 112 */             "try_to_binary");
> /* 113 */         }



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-43801) Support unwrap date type to string type in UnwrapCastInBinaryComparison

2023-05-26 Thread Pucheng Yang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-43801?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17726671#comment-17726671
 ] 

Pucheng Yang commented on SPARK-43801:
--

created PR https://github.com/apache/spark/pull/41332

> Support unwrap date type to string type in UnwrapCastInBinaryComparison
> ---
>
> Key: SPARK-43801
> URL: https://issues.apache.org/jira/browse/SPARK-43801
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: Pucheng Yang
>Priority: Major
>
> Similar to https://issues.apache.org/jira/browse/SPARK-42597 and others, add 
> support to 
> UnwrapCastInBinaryComparison such that it can unwrap date type to string type.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-43188) Cannot write to Azure Datalake Gen2 (abfs/abfss) after Spark 3.1.2

2023-05-26 Thread Nicolas PHUNG (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43188?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicolas PHUNG resolved SPARK-43188.
---
Resolution: Workaround

> Cannot write to Azure Datalake Gen2 (abfs/abfss) after Spark 3.1.2
> --
>
> Key: SPARK-43188
> URL: https://issues.apache.org/jira/browse/SPARK-43188
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Spark Core
>Affects Versions: 3.3.2, 3.4.0
>Reporter: Nicolas PHUNG
>Priority: Major
>
> Hello,
> I have an issue with Spark 3.3.2 & Spark 3.4.0 to write into Azure Data Lake 
> Storage Gen2 (abfs/abfss scheme). I've got the following errors:
> {code:java}
> warn 13:12:47.554: StdErr from Kernel Process 23/04/19 13:12:47 ERROR 
> FileFormatWriter: Aborting job 
> 6a75949c-1473-4445-b8ab-d125be3f0f21.org.apache.spark.SparkException: Job 
> aborted due to stage failure: Task 1 in stage 0.0 failed 1 times, most recent 
> failure: Lost task 1.0 in stage 0.0 (TID 1) (myhost executor driver): 
> org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find any 
> valid local directory for datablock-0001-    at 
> org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:462)
>     at 
> org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:165)
>     at 
> org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:146)
>     at 
> org.apache.hadoop.fs.store.DataBlocks$DiskBlockFactory.createTmpFileForWrite(DataBlocks.java:980)
>     at 
> org.apache.hadoop.fs.store.DataBlocks$DiskBlockFactory.create(DataBlocks.java:960)
>     at 
> org.apache.hadoop.fs.azurebfs.services.AbfsOutputStream.createBlockIfNeeded(AbfsOutputStream.java:262)
>     at 
> org.apache.hadoop.fs.azurebfs.services.AbfsOutputStream.(AbfsOutputStream.java:173)
>     at 
> org.apache.hadoop.fs.azurebfs.AzureBlobFileSystemStore.createFile(AzureBlobFileSystemStore.java:580)
>     at 
> org.apache.hadoop.fs.azurebfs.AzureBlobFileSystem.create(AzureBlobFileSystem.java:301)
>     at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:1195)    at 
> org.apache.hadoop.fs.FileSystem.create(FileSystem.java:1175)    at 
> org.apache.parquet.hadoop.util.HadoopOutputFile.create(HadoopOutputFile.java:74)
>     at 
> org.apache.parquet.hadoop.ParquetFileWriter.(ParquetFileWriter.java:347)
>     at 
> org.apache.parquet.hadoop.ParquetFileWriter.(ParquetFileWriter.java:314)
>     at 
> org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:480)
>     at 
> org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:420)
>     at 
> org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:409)
>     at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetOutputWriter.(ParquetOutputWriter.scala:36)
>     at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetUtils$$anon$1.newInstance(ParquetUtils.scala:490)
>     at 
> org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.newOutputWriter(FileFormatDataWriter.scala:161)
>     at 
> org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.(FileFormatDataWriter.scala:146)
>     at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.executeTask(FileFormatWriter.scala:389)
>     at 
> org.apache.spark.sql.execution.datasources.WriteFilesExec.$anonfun$doExecuteWrite$1(WriteFiles.scala:100)
>     at 
> org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:888)    
> at 
> org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:888)
>     at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)    
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:364)    at 
> org.apache.spark.rdd.RDD.iterator(RDD.scala:328)    at 
> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:92)    at 
> org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:161)    
> at org.apache.spark.scheduler.Task.run(Task.scala:139)    at 
> org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:554)
>     at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1529)    
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:557)    
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>     at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>     at java.lang.Thread.run(Thread.java:748)
> Driver stacktrace:    at 
> org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2785)
>     at 
> org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2721)
>     at 
> or

[jira] [Commented] (SPARK-43188) Cannot write to Azure Datalake Gen2 (abfs/abfss) after Spark 3.1.2

2023-05-26 Thread Nicolas PHUNG (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-43188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17726667#comment-17726667
 ] 

Nicolas PHUNG commented on SPARK-43188:
---

Hello [~srowen]  I don't think so, but I manage to get it work Thanks to 
HADOOP-18707. It was a new default configuration in hadoop-azure that wasn't 
working for me anymore on local windows setup.

> Cannot write to Azure Datalake Gen2 (abfs/abfss) after Spark 3.1.2
> --
>
> Key: SPARK-43188
> URL: https://issues.apache.org/jira/browse/SPARK-43188
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Spark Core
>Affects Versions: 3.3.2, 3.4.0
>Reporter: Nicolas PHUNG
>Priority: Major
>
> Hello,
> I have an issue with Spark 3.3.2 & Spark 3.4.0 to write into Azure Data Lake 
> Storage Gen2 (abfs/abfss scheme). I've got the following errors:
> {code:java}
> warn 13:12:47.554: StdErr from Kernel Process 23/04/19 13:12:47 ERROR 
> FileFormatWriter: Aborting job 
> 6a75949c-1473-4445-b8ab-d125be3f0f21.org.apache.spark.SparkException: Job 
> aborted due to stage failure: Task 1 in stage 0.0 failed 1 times, most recent 
> failure: Lost task 1.0 in stage 0.0 (TID 1) (myhost executor driver): 
> org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find any 
> valid local directory for datablock-0001-    at 
> org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:462)
>     at 
> org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:165)
>     at 
> org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:146)
>     at 
> org.apache.hadoop.fs.store.DataBlocks$DiskBlockFactory.createTmpFileForWrite(DataBlocks.java:980)
>     at 
> org.apache.hadoop.fs.store.DataBlocks$DiskBlockFactory.create(DataBlocks.java:960)
>     at 
> org.apache.hadoop.fs.azurebfs.services.AbfsOutputStream.createBlockIfNeeded(AbfsOutputStream.java:262)
>     at 
> org.apache.hadoop.fs.azurebfs.services.AbfsOutputStream.(AbfsOutputStream.java:173)
>     at 
> org.apache.hadoop.fs.azurebfs.AzureBlobFileSystemStore.createFile(AzureBlobFileSystemStore.java:580)
>     at 
> org.apache.hadoop.fs.azurebfs.AzureBlobFileSystem.create(AzureBlobFileSystem.java:301)
>     at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:1195)    at 
> org.apache.hadoop.fs.FileSystem.create(FileSystem.java:1175)    at 
> org.apache.parquet.hadoop.util.HadoopOutputFile.create(HadoopOutputFile.java:74)
>     at 
> org.apache.parquet.hadoop.ParquetFileWriter.(ParquetFileWriter.java:347)
>     at 
> org.apache.parquet.hadoop.ParquetFileWriter.(ParquetFileWriter.java:314)
>     at 
> org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:480)
>     at 
> org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:420)
>     at 
> org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:409)
>     at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetOutputWriter.(ParquetOutputWriter.scala:36)
>     at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetUtils$$anon$1.newInstance(ParquetUtils.scala:490)
>     at 
> org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.newOutputWriter(FileFormatDataWriter.scala:161)
>     at 
> org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.(FileFormatDataWriter.scala:146)
>     at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.executeTask(FileFormatWriter.scala:389)
>     at 
> org.apache.spark.sql.execution.datasources.WriteFilesExec.$anonfun$doExecuteWrite$1(WriteFiles.scala:100)
>     at 
> org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:888)    
> at 
> org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:888)
>     at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)    
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:364)    at 
> org.apache.spark.rdd.RDD.iterator(RDD.scala:328)    at 
> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:92)    at 
> org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:161)    
> at org.apache.spark.scheduler.Task.run(Task.scala:139)    at 
> org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:554)
>     at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1529)    
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:557)    
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>     at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>     at java.lang.Thread.run(Thread.java:748)
> Driver

[jira] [Updated] (SPARK-43815) Add SQL functions to_varchar and extend to_char functionality

2023-05-26 Thread Richard Yu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43815?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Richard Yu updated SPARK-43815:
---
Description: 
We want to add support for the follow functions:
 * to_varchar() as an alias for to_char()

 * Expand to_char() to take date, timestamp and binary expression as the first 
argument. For date and timestamp expression, the function will be equivalent to 
date_format(expr, fmt) . For binary expression, the function will be equivalent 
to base64() , hex(), and decode(, 'UTF-8') for fmt base64, hex, and UTF-8 
respectively.

 * timediff() as an alias for timestampdiff()

  was:
Today, users who have SQL engines which support  ```to_varchar```  needs to 
change such function invocations to ```to_char``` when migrating to Apache 
Spark. To help minimize the amount of changes which users need to make, we 
introduce a ```to_varchar``` function alias for ```to_char```. Additionally, we 
extend ```to_char()``` such that when the first argument of the function is:
 * date or timestamp: ```to_char``` is equivalent to ```date_format(expr, 
format)```
 * base64: equivalent to ```base64()```
 * hex: equivalent to ```hex()```
 * UTF-8: equivalent to ```decode(, 'UTF-8')```

Additioally, we add support for the ```timediff``` alias for 
```timestampdiff```.


> Add SQL functions to_varchar and extend to_char functionality
> -
>
> Key: SPARK-43815
> URL: https://issues.apache.org/jira/browse/SPARK-43815
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.4.1
>Reporter: Richard Yu
>Priority: Major
>
> We want to add support for the follow functions:
>  * to_varchar() as an alias for to_char()
>  * Expand to_char() to take date, timestamp and binary expression as the 
> first argument. For date and timestamp expression, the function will be 
> equivalent to date_format(expr, fmt) . For binary expression, the function 
> will be equivalent to base64() , hex(), and decode(, 'UTF-8') for fmt 
> base64, hex, and UTF-8 respectively.
>  * timediff() as an alias for timestampdiff()



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-43815) Add SQL functions to_varchar and extend to_char functionality

2023-05-26 Thread Richard Yu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43815?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Richard Yu updated SPARK-43815:
---
Description: 
Today, users who have SQL engines which support \{code:java} to_varchar \{code} 
 needs to change such function invocations to ```to_char``` when migrating to 
Apache Spark. To help minimize the amount of changes which users need to make, 
we introduce a ```to_varchar``` function alias for ```to_char```. Additionally, 
we extend ```to_char()``` such that when the first argument of the function is:
 * date or timestamp: ```to_char``` is equivalent to ```date_format(expr, 
format)```
 * base64: equivalent to ```base64()```
 * hex: equivalent to ```hex()```
 * UTF-8: equivalent to ```decode(, 'UTF-8')```

Additioally, we add support for the ```timediff``` alias for 
```timestampdiff```.

  was:
Today, users who have SQL engines which support ```to_varchar```  needs to 
change such function invocations to ```to_char``` when migrating to Apache 
Spark. To help minimize the amount of changes which users need to make, we 
introduce a ```to_varchar``` function alias for ```to_char```. Additionally, we 
extend ```to_char()``` such that when the first argument of the function is:
 * date or timestamp: ```to_char``` is equivalent to ```date_format(expr, 
format)```
 * base64: equivalent to ```base64()```
 * hex: equivalent to ```hex()```
 * UTF-8: equivalent to ```decode(, 'UTF-8')```

Additioally, we add support for the ```timediff``` alias for 
```timestampdiff```.


> Add SQL functions to_varchar and extend to_char functionality
> -
>
> Key: SPARK-43815
> URL: https://issues.apache.org/jira/browse/SPARK-43815
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.4.1
>Reporter: Richard Yu
>Priority: Major
>
> Today, users who have SQL engines which support \{code:java} to_varchar 
> \{code}  needs to change such function invocations to ```to_char``` when 
> migrating to Apache Spark. To help minimize the amount of changes which users 
> need to make, we introduce a ```to_varchar``` function alias for 
> ```to_char```. Additionally, we extend ```to_char()``` such that when the 
> first argument of the function is:
>  * date or timestamp: ```to_char``` is equivalent to ```date_format(expr, 
> format)```
>  * base64: equivalent to ```base64()```
>  * hex: equivalent to ```hex()```
>  * UTF-8: equivalent to ```decode(, 'UTF-8')```
> Additioally, we add support for the ```timediff``` alias for 
> ```timestampdiff```.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-43815) Add SQL functions to_varchar and extend to_char functionality

2023-05-26 Thread Richard Yu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43815?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Richard Yu updated SPARK-43815:
---
Description: 
Today, users who have SQL engines which support  ```to_varchar```  needs to 
change such function invocations to ```to_char``` when migrating to Apache 
Spark. To help minimize the amount of changes which users need to make, we 
introduce a ```to_varchar``` function alias for ```to_char```. Additionally, we 
extend ```to_char()``` such that when the first argument of the function is:
 * date or timestamp: ```to_char``` is equivalent to ```date_format(expr, 
format)```
 * base64: equivalent to ```base64()```
 * hex: equivalent to ```hex()```
 * UTF-8: equivalent to ```decode(, 'UTF-8')```

Additioally, we add support for the ```timediff``` alias for 
```timestampdiff```.

  was:
Today, users who have SQL engines which support \{code:java} to_varchar \{code} 
 needs to change such function invocations to ```to_char``` when migrating to 
Apache Spark. To help minimize the amount of changes which users need to make, 
we introduce a ```to_varchar``` function alias for ```to_char```. Additionally, 
we extend ```to_char()``` such that when the first argument of the function is:
 * date or timestamp: ```to_char``` is equivalent to ```date_format(expr, 
format)```
 * base64: equivalent to ```base64()```
 * hex: equivalent to ```hex()```
 * UTF-8: equivalent to ```decode(, 'UTF-8')```

Additioally, we add support for the ```timediff``` alias for 
```timestampdiff```.


> Add SQL functions to_varchar and extend to_char functionality
> -
>
> Key: SPARK-43815
> URL: https://issues.apache.org/jira/browse/SPARK-43815
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.4.1
>Reporter: Richard Yu
>Priority: Major
>
> Today, users who have SQL engines which support  ```to_varchar```  needs to 
> change such function invocations to ```to_char``` when migrating to Apache 
> Spark. To help minimize the amount of changes which users need to make, we 
> introduce a ```to_varchar``` function alias for ```to_char```. Additionally, 
> we extend ```to_char()``` such that when the first argument of the function 
> is:
>  * date or timestamp: ```to_char``` is equivalent to ```date_format(expr, 
> format)```
>  * base64: equivalent to ```base64()```
>  * hex: equivalent to ```hex()```
>  * UTF-8: equivalent to ```decode(, 'UTF-8')```
> Additioally, we add support for the ```timediff``` alias for 
> ```timestampdiff```.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-43815) Add SQL functions to_varchar and extend to_char functionality

2023-05-26 Thread Richard Yu (Jira)
Richard Yu created SPARK-43815:
--

 Summary: Add SQL functions to_varchar and extend to_char 
functionality
 Key: SPARK-43815
 URL: https://issues.apache.org/jira/browse/SPARK-43815
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Affects Versions: 3.4.1
Reporter: Richard Yu


Today, users who have SQL engines which support ```to_varchar```  needs to 
change such function invocations to ```to_char``` when migrating to Apache 
Spark. To help minimize the amount of changes which users need to make, we 
introduce a ```to_varchar``` function alias for ```to_char```. Additionally, we 
extend ```to_char()``` such that when the first argument of the function is:
 * date or timestamp: ```to_char``` is equivalent to ```date_format(expr, 
format)```
 * base64: equivalent to ```base64()```
 * hex: equivalent to ```hex()```
 * UTF-8: equivalent to ```decode(, 'UTF-8')```

Additioally, we add support for the ```timediff``` alias for 
```timestampdiff```.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-43794) Assign a name to the error class _LEGACY_ERROR_TEMP_1335

2023-05-26 Thread Max Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43794?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk reassigned SPARK-43794:


Assignee: BingKun Pan

> Assign a name to the error class _LEGACY_ERROR_TEMP_1335
> 
>
> Key: SPARK-43794
> URL: https://issues.apache.org/jira/browse/SPARK-43794
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: BingKun Pan
>Assignee: BingKun Pan
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-43794) Assign a name to the error class _LEGACY_ERROR_TEMP_1335

2023-05-26 Thread Max Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43794?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk resolved SPARK-43794.
--
Fix Version/s: 3.5.0
   Resolution: Fixed

Issue resolved by pull request 41314
[https://github.com/apache/spark/pull/41314]

> Assign a name to the error class _LEGACY_ERROR_TEMP_1335
> 
>
> Key: SPARK-43794
> URL: https://issues.apache.org/jira/browse/SPARK-43794
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: BingKun Pan
>Assignee: BingKun Pan
>Priority: Minor
> Fix For: 3.5.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-43805) Support SELECT * EXCEPT AND SELECT * REPLACE

2023-05-26 Thread Jia Fan (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-43805?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17726627#comment-17726627
 ] 

Jia Fan commented on SPARK-43805:
-

Tell the truth, I'm don't know will spark accept this statement? It doesn't 
look like standard sql. cc [~cloud_fan] [~dongjoon] 

> Support SELECT * EXCEPT AND  SELECT * REPLACE
> -
>
> Key: SPARK-43805
> URL: https://issues.apache.org/jira/browse/SPARK-43805
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: melin
>Priority: Major
>
> ref: 
> [https://cloud.google.com/bigquery/docs/reference/standard-sql/query-syntax#select_except]
> https://cloud.google.com/bigquery/docs/reference/standard-sql/query-syntax#select_replace
> [~fanjia] 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-43814) Spark cannot use the df.collect() result to construct the DecimalType in CatalystTypeConverters.convertToCatalyst() API

2023-05-26 Thread Ke Jia (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43814?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ke Jia updated SPARK-43814:
---
Description: 
 

When using the df.collect() result to construct the DecimalType in 
CatalystTypeConverters.convertToCatalyst()

 
{code:java}
Decimal scale (18) cannot be greater than precision (1).
org.apache.spark.sql.AnalysisException: Decimal scale (18) cannot be greater 
than precision (1).
        at 
org.apache.spark.sql.errors.QueryCompilationErrors$.decimalCannotGreaterThanPrecisionError(QueryCompilationErrors.scala:1671)
        at org.apache.spark.sql.types.DecimalType.(DecimalType.scala:48)
        at 
org.apache.spark.sql.catalyst.CatalystTypeConverters$.convertToCatalyst(CatalystTypeConverters.scala:518)
        at 
org.apache.spark.sql.DataFrameFunctionsSuite.$anonfun$new$712(DataFrameFunctionsSuite.scala:3714){code}
 

 

This issue can be reproduced by the following case:

 
{code:java}
  val expression = Literal.default(DecimalType.SYSTEM_DEFAULT)
  val schema = StructType(
    StructField("a", IntegerType, nullable = true) :: Nil)
  val empData = Seq(Row(1))
  val df = spark.createDataFrame(spark.sparkContext.parallelize(empData), 
schema)
  val resultDF = df.select(Column(expression))
  val result = resultDF.collect().head.get(0)
  CatalystTypeConverters.convertToCatalyst(result)

{code}
 

It seems that the reason for the failure is that the value of precision is not 
set when the Decimal.toJavaBigDecimal() method is called. However, Java 
BigDecimal only provides an interface for modifying scale and does not provide 
an interface for modifying Precision.

 

  was:
When using the df.collect() result to construct the DecimalType in 

Decimal scale (18) cannot be greater than precision (1).
org.apache.spark.sql.AnalysisException: Decimal scale (18) cannot be greater 
than precision (1).


> Spark cannot use the df.collect() result to construct the DecimalType in 
> CatalystTypeConverters.convertToCatalyst() API
> ---
>
> Key: SPARK-43814
> URL: https://issues.apache.org/jira/browse/SPARK-43814
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.2, 3.3.2
>Reporter: Ke Jia
>Priority: Major
>
>  
> When using the df.collect() result to construct the DecimalType in 
> CatalystTypeConverters.convertToCatalyst()
>  
> {code:java}
> Decimal scale (18) cannot be greater than precision (1).
> org.apache.spark.sql.AnalysisException: Decimal scale (18) cannot be greater 
> than precision (1).
>         at 
> org.apache.spark.sql.errors.QueryCompilationErrors$.decimalCannotGreaterThanPrecisionError(QueryCompilationErrors.scala:1671)
>         at org.apache.spark.sql.types.DecimalType.(DecimalType.scala:48)
>         at 
> org.apache.spark.sql.catalyst.CatalystTypeConverters$.convertToCatalyst(CatalystTypeConverters.scala:518)
>         at 
> org.apache.spark.sql.DataFrameFunctionsSuite.$anonfun$new$712(DataFrameFunctionsSuite.scala:3714){code}
>  
>  
> This issue can be reproduced by the following case:
>  
> {code:java}
>   val expression = Literal.default(DecimalType.SYSTEM_DEFAULT)
>   val schema = StructType(
>     StructField("a", IntegerType, nullable = true) :: Nil)
>   val empData = Seq(Row(1))
>   val df = spark.createDataFrame(spark.sparkContext.parallelize(empData), 
> schema)
>   val resultDF = df.select(Column(expression))
>   val result = resultDF.collect().head.get(0)
>   CatalystTypeConverters.convertToCatalyst(result)
> {code}
>  
> It seems that the reason for the failure is that the value of precision is 
> not set when the Decimal.toJavaBigDecimal() method is called. However, Java 
> BigDecimal only provides an interface for modifying scale and does not 
> provide an interface for modifying Precision.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-43765) Assign a name to the error class _LEGACY_ERROR_TEMP_2409

2023-05-26 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-43765?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17726549#comment-17726549
 ] 

ASF GitHub Bot commented on SPARK-43765:


User 'beliefer' has created a pull request for this issue:
https://github.com/apache/spark/pull/41290

> Assign a name to the error class _LEGACY_ERROR_TEMP_2409
> 
>
> Key: SPARK-43765
> URL: https://issues.apache.org/jira/browse/SPARK-43765
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: jiaan.geng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-43764) Assign a name to the error class _LEGACY_ERROR_TEMP_2408

2023-05-26 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-43764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17726548#comment-17726548
 ] 

ASF GitHub Bot commented on SPARK-43764:


User 'beliefer' has created a pull request for this issue:
https://github.com/apache/spark/pull/41290

> Assign a name to the error class _LEGACY_ERROR_TEMP_2408
> 
>
> Key: SPARK-43764
> URL: https://issues.apache.org/jira/browse/SPARK-43764
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: jiaan.geng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-43763) Assign a name to the error class _LEGACY_ERROR_TEMP_2407

2023-05-26 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-43763?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17726546#comment-17726546
 ] 

ASF GitHub Bot commented on SPARK-43763:


User 'beliefer' has created a pull request for this issue:
https://github.com/apache/spark/pull/41290

> Assign a name to the error class _LEGACY_ERROR_TEMP_2407
> 
>
> Key: SPARK-43763
> URL: https://issues.apache.org/jira/browse/SPARK-43763
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: jiaan.geng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-43763) Assign a name to the error class _LEGACY_ERROR_TEMP_2407

2023-05-26 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-43763?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17726544#comment-17726544
 ] 

ASF GitHub Bot commented on SPARK-43763:


User 'beliefer' has created a pull request for this issue:
https://github.com/apache/spark/pull/41290

> Assign a name to the error class _LEGACY_ERROR_TEMP_2407
> 
>
> Key: SPARK-43763
> URL: https://issues.apache.org/jira/browse/SPARK-43763
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: jiaan.geng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-43762) Assign a name to the error class _LEGACY_ERROR_TEMP_2406

2023-05-26 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-43762?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17726543#comment-17726543
 ] 

ASF GitHub Bot commented on SPARK-43762:


User 'beliefer' has created a pull request for this issue:
https://github.com/apache/spark/pull/41290

> Assign a name to the error class _LEGACY_ERROR_TEMP_2406
> 
>
> Key: SPARK-43762
> URL: https://issues.apache.org/jira/browse/SPARK-43762
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: jiaan.geng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-43814) Spark cannot use the df.collect() result to construct the DecimalType in CatalystTypeConverters.convertToCatalyst() API

2023-05-26 Thread Ke Jia (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43814?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ke Jia updated SPARK-43814:
---
Description: 
When using the df.collect() result to construct the DecimalType in 

Decimal scale (18) cannot be greater than precision (1).
org.apache.spark.sql.AnalysisException: Decimal scale (18) cannot be greater 
than precision (1).

> Spark cannot use the df.collect() result to construct the DecimalType in 
> CatalystTypeConverters.convertToCatalyst() API
> ---
>
> Key: SPARK-43814
> URL: https://issues.apache.org/jira/browse/SPARK-43814
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.2, 3.3.2
>Reporter: Ke Jia
>Priority: Major
>
> When using the df.collect() result to construct the DecimalType in 
> Decimal scale (18) cannot be greater than precision (1).
> org.apache.spark.sql.AnalysisException: Decimal scale (18) cannot be greater 
> than precision (1).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-43814) Spark cannot use the df.collect() result to construct the DecimalType in CatalystTypeConverters.convertToCatalyst() API

2023-05-26 Thread Ke Jia (Jira)
Ke Jia created SPARK-43814:
--

 Summary: Spark cannot use the df.collect() result to construct the 
DecimalType in CatalystTypeConverters.convertToCatalyst() API
 Key: SPARK-43814
 URL: https://issues.apache.org/jira/browse/SPARK-43814
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.3.2, 3.2.2
Reporter: Ke Jia






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-43813) Enable CategoricalTests.test_groupby_apply_without_shortcut for pandas 2.0.0.

2023-05-26 Thread Haejoon Lee (Jira)
Haejoon Lee created SPARK-43813:
---

 Summary: Enable 
CategoricalTests.test_groupby_apply_without_shortcut for pandas 2.0.0.
 Key: SPARK-43813
 URL: https://issues.apache.org/jira/browse/SPARK-43813
 Project: Spark
  Issue Type: Sub-task
  Components: Pandas API on Spark
Affects Versions: 3.5.0
Reporter: Haejoon Lee






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-43812) Enable DataFrameTests.test_all for pandas 2.0.0.

2023-05-26 Thread Haejoon Lee (Jira)
Haejoon Lee created SPARK-43812:
---

 Summary: Enable DataFrameTests.test_all for pandas 2.0.0.
 Key: SPARK-43812
 URL: https://issues.apache.org/jira/browse/SPARK-43812
 Project: Spark
  Issue Type: Sub-task
  Components: Pandas API on Spark
Affects Versions: 3.5.0
Reporter: Haejoon Lee






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-43811) Enable DataFrameTests.test_reindex for pandas 2.0.0.

2023-05-26 Thread Haejoon Lee (Jira)
Haejoon Lee created SPARK-43811:
---

 Summary: Enable DataFrameTests.test_reindex for pandas 2.0.0.
 Key: SPARK-43811
 URL: https://issues.apache.org/jira/browse/SPARK-43811
 Project: Spark
  Issue Type: Sub-task
  Components: Pandas API on Spark
Affects Versions: 3.5.0
Reporter: Haejoon Lee






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-43809) Enable DataFrameSlowTests.test_cov for pandas 2.0.0.

2023-05-26 Thread Haejoon Lee (Jira)
Haejoon Lee created SPARK-43809:
---

 Summary: Enable DataFrameSlowTests.test_cov for pandas 2.0.0.
 Key: SPARK-43809
 URL: https://issues.apache.org/jira/browse/SPARK-43809
 Project: Spark
  Issue Type: Sub-task
  Components: Pandas API on Spark
Affects Versions: 3.5.0
Reporter: Haejoon Lee






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-43810) Enable DataFrameSlowTests.test_quantile for pandas 2.0.0.

2023-05-26 Thread Haejoon Lee (Jira)
Haejoon Lee created SPARK-43810:
---

 Summary: Enable DataFrameSlowTests.test_quantile for pandas 2.0.0.
 Key: SPARK-43810
 URL: https://issues.apache.org/jira/browse/SPARK-43810
 Project: Spark
  Issue Type: Sub-task
  Components: Pandas API on Spark
Affects Versions: 3.5.0
Reporter: Haejoon Lee






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-43807) Migrate _LEGACY_ERROR_TEMP_1269 to PARTITION_SCHEMA_IS_EMPTY

2023-05-26 Thread Max Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43807?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk resolved SPARK-43807.
--
Fix Version/s: 3.5.0
   Resolution: Fixed

Issue resolved by pull request 41325
[https://github.com/apache/spark/pull/41325]

> Migrate _LEGACY_ERROR_TEMP_1269 to PARTITION_SCHEMA_IS_EMPTY
> 
>
> Key: SPARK-43807
> URL: https://issues.apache.org/jira/browse/SPARK-43807
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: jiaan.geng
>Assignee: jiaan.geng
>Priority: Major
> Fix For: 3.5.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-43807) Migrate _LEGACY_ERROR_TEMP_1269 to PARTITION_SCHEMA_IS_EMPTY

2023-05-26 Thread Max Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43807?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk reassigned SPARK-43807:


Assignee: jiaan.geng

> Migrate _LEGACY_ERROR_TEMP_1269 to PARTITION_SCHEMA_IS_EMPTY
> 
>
> Key: SPARK-43807
> URL: https://issues.apache.org/jira/browse/SPARK-43807
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: jiaan.geng
>Assignee: jiaan.geng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-43576) Remove unused declarations from Core module

2023-05-26 Thread Max Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43576?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk resolved SPARK-43576.
--
Fix Version/s: 3.5.0
   Resolution: Fixed

Issue resolved by pull request 41218
[https://github.com/apache/spark/pull/41218]

> Remove unused declarations from Core module
> ---
>
> Key: SPARK-43576
> URL: https://issues.apache.org/jira/browse/SPARK-43576
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.5.0
>Reporter: BingKun Pan
>Assignee: BingKun Pan
>Priority: Minor
> Fix For: 3.5.0
>
>
> There are some unused declarations in the `core` module, and we need to clean 
> it to make code clean.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-43576) Remove unused declarations from Core module

2023-05-26 Thread Max Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43576?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk reassigned SPARK-43576:


Assignee: BingKun Pan

> Remove unused declarations from Core module
> ---
>
> Key: SPARK-43576
> URL: https://issues.apache.org/jira/browse/SPARK-43576
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.5.0
>Reporter: BingKun Pan
>Assignee: BingKun Pan
>Priority: Minor
>
> There are some unused declarations in the `core` module, and we need to clean 
> it to make code clean.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org