date:20240529

[jira] [Created] (SPARK-48470) Rename listState functions for single and list type appends

2024-05-29 Thread Anish Shrigondekar (Jira)

Anish Shrigondekar created SPARK-48470:
--

 Summary: Rename listState functions for single and list type 
appends
 Key: SPARK-48470
 URL: https://issues.apache.org/jira/browse/SPARK-48470
 Project: Spark
  Issue Type: Task
  Components: Structured Streaming
Affects Versions: 4.0.0
Reporter: Anish Shrigondekar


Rename listState functions for single and list type appends



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-45959) SPIP: Abusing DataSet.withColumn can cause huge tree with severe perf degradation

2024-05-29 Thread Asif (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-45959?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Asif updated SPARK-45959:
-
Priority: Major  (was: Minor)

> SPIP: Abusing DataSet.withColumn can cause huge tree with severe perf 
> degradation
> -
>
> Key: SPARK-45959
> URL: https://issues.apache.org/jira/browse/SPARK-45959
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.5.1
>Reporter: Asif
>Priority: Major
>  Labels: pull-request-available
>
> Though documentation clearly recommends to add all columns in a single shot, 
> but in reality is difficult to expect customer to modify their code, as in 
> spark2  the rules in analyzer were such that  they did not do deep tree 
> traversal.  Moreover in Spark3 , the plans are cloned before giving to 
> analyzer , optimizer etc which was not the case in Spark2.
> All these things have resulted in query time being increased from 5 min to 2 
> - 3 hrs.
> Many times the columns are added to plan via some for loop logic which just 
> keeps adding new computation based on some rule.
> So,  my suggestion is to Collapse the Projects early, once the analysis of 
> the logical plan is done, but before the plan gets assigned to the field 
> variable in QueryExecution.
> The PR for the above is ready for review.
> The major change is in the way the lookup is performed in CacheManager.
> I have described the logic in the PR and have added multiple tests.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-48464) Refactor SQLConfSuite and StatisticsSuite

2024-05-29 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48464?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-48464.
--
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 46796
[https://github.com/apache/spark/pull/46796]

> Refactor SQLConfSuite and StatisticsSuite
> -
>
> Key: SPARK-48464
> URL: https://issues.apache.org/jira/browse/SPARK-48464
> Project: Spark
>  Issue Type: Sub-task
>  Components: Tests
>Affects Versions: 4.0.0
>Reporter: Rui Wang
>Assignee: Rui Wang
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-48468) Add LogicalQueryStage interface in catalyst

2024-05-29 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48468?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-48468:
---
Labels: pull-request-available  (was: )

> Add LogicalQueryStage interface in catalyst
> ---
>
> Key: SPARK-48468
> URL: https://issues.apache.org/jira/browse/SPARK-48468
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Ziqi Liu
>Priority: Major
>  Labels: pull-request-available
>
> Add `LogicalQueryStage` interface in catalyst so that it's visible in logical 
> rules



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-48468) Add LogicalQueryStage interface in catalyst

2024-05-29 Thread Ziqi Liu (Jira)

Ziqi Liu created SPARK-48468:


 Summary: Add LogicalQueryStage interface in catalyst
 Key: SPARK-48468
 URL: https://issues.apache.org/jira/browse/SPARK-48468
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 4.0.0
Reporter: Ziqi Liu


Add `LogicalQueryStage` interface in catalyst so that it's visible in logical 
rules



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-48454) Directly use the parent dataframe class

2024-05-29 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48454?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-48454.
--
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 46785
[https://github.com/apache/spark/pull/46785]

> Directly use the parent dataframe class
> ---
>
> Key: SPARK-48454
> URL: https://issues.apache.org/jira/browse/SPARK-48454
> Project: Spark
>  Issue Type: Improvement
>  Components: PS
>Affects Versions: 4.0.0
>Reporter: Ruifeng Zheng
>Assignee: Ruifeng Zheng
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-48454) Directly use the parent dataframe class

2024-05-29 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48454?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-48454:


Assignee: Ruifeng Zheng

> Directly use the parent dataframe class
> ---
>
> Key: SPARK-48454
> URL: https://issues.apache.org/jira/browse/SPARK-48454
> Project: Spark
>  Issue Type: Improvement
>  Components: PS
>Affects Versions: 4.0.0
>Reporter: Ruifeng Zheng
>Assignee: Ruifeng Zheng
>Priority: Minor
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-48467) Upgrade Maven to 3.9.7

2024-05-29 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48467?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-48467:
---
Labels: pull-request-available  (was: )

> Upgrade Maven to 3.9.7
> --
>
> Key: SPARK-48467
> URL: https://issues.apache.org/jira/browse/SPARK-48467
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 4.0.0
>Reporter: BingKun Pan
>Priority: Minor
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-48442) Add parenthesis to awaitTermination call

2024-05-29 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48442?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-48442.
--
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 46779
[https://github.com/apache/spark/pull/46779]

> Add parenthesis to awaitTermination call
> 
>
> Key: SPARK-48442
> URL: https://issues.apache.org/jira/browse/SPARK-48442
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.4.3
>Reporter: Riya Verma
>Assignee: Riya Verma
>Priority: Trivial
>  Labels: correctness, pull-request-available, starter
> Fix For: 4.0.0
>
>
> In {{test_stream_reader}} and {{test_stream_writer}} of 
> {*}test_python_streaming_datasource.py{*}, the call {{q.awaitTermination}} 
> does not invoke a function call as intended, but instead returns a python 
> function object. The fix is to change this to {{{}q.awaitTermination(){}}}.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-48467) Upgrade Maven to 3.9.7

2024-05-29 Thread BingKun Pan (Jira)

BingKun Pan created SPARK-48467:
---

 Summary: Upgrade Maven to 3.9.7
 Key: SPARK-48467
 URL: https://issues.apache.org/jira/browse/SPARK-48467
 Project: Spark
  Issue Type: Improvement
  Components: Build
Affects Versions: 4.0.0
Reporter: BingKun Pan






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-48465) Avoid no-op empty relation propagation in AQE

2024-05-29 Thread Ziqi Liu (Jira)

Ziqi Liu created SPARK-48465:


 Summary: Avoid no-op empty relation propagation in AQE
 Key: SPARK-48465
 URL: https://issues.apache.org/jira/browse/SPARK-48465
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 4.0.0
Reporter: Ziqi Liu


We should avoid no-op empty relation propagation in AQE: if we convert an empty 
QueryStageExec to empty relation, it will further wrapped into a new query 
stage and execute -> produce empty result -> empty relation propagation again. 
This issue is currently not exposed because AQE will try to reuse shuffle.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-48431) Do not forward predicates on collated columns to file readers

2024-05-29 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48431?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-48431.
-
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 46760
[https://github.com/apache/spark/pull/46760]

> Do not forward predicates on collated columns to file readers
> -
>
> Key: SPARK-48431
> URL: https://issues.apache.org/jira/browse/SPARK-48431
> Project: Spark
>  Issue Type: Task
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Jan-Ole Sasse
>Assignee: Jan-Ole Sasse
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> SPARK-47657 allows to push filters on collated columns to file sources that 
> support it. If such filters are pushed to file sources, those file sources 
> must not push those filters to the actual file readers (i.e. parquet or csv 
> readers), because there is no guarantee that those support collations.
> With this task, we are widening filters on collations to be AlwaysTrue when 
> we translate filters for file sources.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-48446) Update SS Doc of dropDuplicatesWithinWatermark to use the right syntax

2024-05-29 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48446?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-48446:
---
Labels: easyfix pull-request-available  (was: easyfix)

> Update SS Doc of dropDuplicatesWithinWatermark to use the right syntax
> --
>
> Key: SPARK-48446
> URL: https://issues.apache.org/jira/browse/SPARK-48446
> Project: Spark
>  Issue Type: Documentation
>  Components: Structured Streaming
>Affects Versions: 4.0.0
>Reporter: Yuchen Liu
>Priority: Minor
>  Labels: easyfix, pull-request-available
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> For dropDuplicates, the example on 
> [https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#:~:text=)%20%5C%0A%20%20.-,dropDuplicates,-(%22guid%22]
>  is out of date compared with 
> [https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.dropDuplicates.html].
>  The argument should be a list.
> The discrepancy is also true for dropDuplicatesWithinWatermark.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-48380) AutoBatchedPickler caused Unsafe allocate to fail due to 2GB limit

2024-05-29 Thread Zheng Shao (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-48380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17850486#comment-17850486
 ] 

Zheng Shao commented on SPARK-48380:


A second look at the stacktrace shows that the current issue is not in 
AutoBatchPickler since it's actually calling the underlying iterator.next():

{{    at scala.collection.Iterator$$anon$10.next(Iterator.scala:461)}}
{{    at 
org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.next(SerDeUtil.scala:93)}}

 

It doesn't mean that AutoBatchedPickler doesn't have a problem - the batching 
logic still needs to be fixed.

But there is an underlying problem in the iterator.next() (that processes one 
row) that needs to be fixed also.

> AutoBatchedPickler caused Unsafe allocate to fail due to 2GB limit
> --
>
> Key: SPARK-48380
> URL: https://issues.apache.org/jira/browse/SPARK-48380
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.5.1
>Reporter: Zheng Shao
>Priority: Major
>  Labels: pull-request-available
>
> AutoBatchedPickler assumes that the row sizes are more or less uniform. 
> That's of course not true all the time.
> It needs to find a more conservative algorithm.
> Or simply, we should introduce an option to disable AutoBatch and pickle one 
> row at a time.
> The stacktrace:
> {{```}}
> {{Py4JJavaError: An error occurred while calling o562.saveAsTable.}}
> {{: org.apache.spark.SparkException: Job aborted due to stage failure: Task 
> 1811 in stage 8.0 failed 4 times, most recent failure: Lost task 1811.3 in 
> stage 8.0 (TID 2782) (10.251.129.187 executor 70): 
> java.lang.IllegalArgumentException: Cannot grow BufferHolder by size 
> 578595584 because the size after growing exceeds size limitation 2147483632}}
> {{    at 
> org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder.grow(BufferHolder.java:71)}}
> {{    at 
> org.apache.spark.sql.catalyst.expressions.codegen.UnsafeWriter.grow(UnsafeWriter.java:66)}}
> {{    at 
> org.apache.spark.sql.catalyst.expressions.codegen.UnsafeWriter.write(UnsafeWriter.java:201)}}
> {{    at 
> org.apache.spark.sql.catalyst.expressions.InterpretedUnsafeProjection$.writeArray(InterpretedUnsafeProjection.scala:322)}}
> {{    at 
> org.apache.spark.sql.catalyst.expressions.InterpretedUnsafeProjection$.$anonfun$generateFieldWriter$16(InterpretedUnsafeProjection.scala:200)}}
> {{    at 
> org.apache.spark.sql.catalyst.expressions.InterpretedUnsafeProjection$.$anonfun$generateFieldWriter$16$adapted(InterpretedUnsafeProjection.scala:198)}}
> {{    at 
> org.apache.spark.sql.catalyst.expressions.InterpretedUnsafeProjection$.$anonfun$generateFieldWriter$22(InterpretedUnsafeProjection.scala:288)}}
> {{    at 
> org.apache.spark.sql.catalyst.expressions.InterpretedUnsafeProjection$.$anonfun$generateFieldWriter$22$adapted(InterpretedUnsafeProjection.scala:286)}}
> {{    at 
> org.apache.spark.sql.catalyst.expressions.InterpretedUnsafeProjection$.$anonfun$generateStructWriter$2(InterpretedUnsafeProjection.scala:123)}}
> {{    at 
> org.apache.spark.sql.catalyst.expressions.InterpretedUnsafeProjection$.$anonfun$generateStructWriter$2$adapted(InterpretedUnsafeProjection.scala:120)}}
> {{    at 
> org.apache.spark.sql.catalyst.expressions.InterpretedUnsafeProjection.$anonfun$writer$3(InterpretedUnsafeProjection.scala:67)}}
> {{    at 
> org.apache.spark.sql.catalyst.expressions.InterpretedUnsafeProjection.$anonfun$writer$3$adapted(InterpretedUnsafeProjection.scala:65)}}
> {{    at 
> org.apache.spark.sql.catalyst.expressions.InterpretedUnsafeProjection.apply(InterpretedUnsafeProjection.scala:90)}}
> {{    at 
> org.apache.spark.sql.catalyst.expressions.InterpretedUnsafeProjection.apply(InterpretedUnsafeProjection.scala:36)}}
> {{    at scala.collection.Iterator$$anon$10.next(Iterator.scala:461)}}
> {{    at scala.collection.Iterator$$anon$10.next(Iterator.scala:461)}}
> {{    at scala.collection.Iterator$$anon$10.next(Iterator.scala:461)}}
> {{    at scala.collection.Iterator$$anon$10.next(Iterator.scala:461)}}
> {{    at 
> org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.next(SerDeUtil.scala:93)}}
> {{    at 
> org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.next(SerDeUtil.scala:83)}}
> {{    at 
> org.apache.spark.api.python.PythonRDD$.writeNextElementToStream(PythonRDD.scala:474)}}
> {{    at 
> org.apache.spark.api.python.PythonRunner$$anon$2.writeNextInputToStream(PythonRunner.scala:885)}}
> {{    at 
> org.apache.spark.api.python.BasePythonRunner$ReaderInputStream.writeAdditionalInputToPythonWorker(PythonRunner.scala:813)}}
> {{    at 
> org.apache.spark.api.python.BasePythonRunner$ReaderInputStream.read(PythonRunner.scala:733)}}
> {{    at 
> java.base/java.io.BufferedInputStream.fill(BufferedInputS

[jira] [Updated] (SPARK-48464) Refactor SQLConfSuite and StatisticsSuite

2024-05-29 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48464?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-48464:
---
Labels: pull-request-available  (was: )

> Refactor SQLConfSuite and StatisticsSuite
> -
>
> Key: SPARK-48464
> URL: https://issues.apache.org/jira/browse/SPARK-48464
> Project: Spark
>  Issue Type: Sub-task
>  Components: Tests
>Affects Versions: 4.0.0
>Reporter: Rui Wang
>Assignee: Rui Wang
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-48464) Refactor SQLConfSuite and StatisticsSuite

2024-05-29 Thread Rui Wang (Jira)

Rui Wang created SPARK-48464:


 Summary: Refactor SQLConfSuite and StatisticsSuite
 Key: SPARK-48464
 URL: https://issues.apache.org/jira/browse/SPARK-48464
 Project: Spark
  Issue Type: Sub-task
  Components: Tests
Affects Versions: 4.0.0
Reporter: Rui Wang
Assignee: Rui Wang






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-48462) Refactor HiveQuerySuite.scala and HiveTableScanSuite

2024-05-29 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48462?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-48462.
-
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 46792
[https://github.com/apache/spark/pull/46792]

> Refactor HiveQuerySuite.scala and HiveTableScanSuite
> 
>
> Key: SPARK-48462
> URL: https://issues.apache.org/jira/browse/SPARK-48462
> Project: Spark
>  Issue Type: Sub-task
>  Components: Tests
>Affects Versions: 4.0.0
>Reporter: Rui Wang
>Assignee: Rui Wang
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-44970) Spark History File Uploads Can Fail on S3

2024-05-29 Thread Steve Loughran (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-44970?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17850475#comment-17850475
 ] 

Steve Loughran commented on SPARK-44970:


 correct. file is only saved on close(). The incomplete multipart uploads are 
still there and you do get billed for them -which is why its critical to have a 
lifecycle rule to clean this stuff up. In theory you may be able to rebuild it 
on a failure, in practise you'd have a hard time working out the order and 
you'll still be short the last 32-64 MB of data


> Spark History File Uploads Can Fail on S3
> -
>
> Key: SPARK-44970
> URL: https://issues.apache.org/jira/browse/SPARK-44970
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Holden Karau
>Assignee: Holden Karau
>Priority: Major
>
> Sometimes if the driver OOMs the history log will not upload finish.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-48423) Unable to write MLPipeline to blob storage using .option attribute

2024-05-29 Thread Chhavi Bansal (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48423?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chhavi Bansal updated SPARK-48423:
--
Description: 
I am trying to write mllib pipeline with a series of stages set in it to azure 
blob storage giving relevant write parameters, but it still complains of 
`fs.azure.account.key` not being found in the configuration.

Sharing the code.
{code:java}
val spark = 
SparkSession.builder().appName("main").master("local[4]").getOrCreate()

import spark.implicits._

val df = spark.createDataFrame(Seq(
  (0L, "a b c d e spark"),
  (1L, "b d")
)).toDF("id", "text") 

val si = new StringIndexer().setInputCol("text").setOutputCol("IND_text")
val pipelinee = new Pipeline().setStages(Array(si))
val pipelineModel = pipelinee.fit(df)
val path = BLOB_STORAGE_PATH

pipelineModel.write
.option("spark.hadoop.fs.azure.account.key..dfs.core.windows.net",
  "__").option("fs.azure.account.key..dfs.core.windows.net", 
"__").option("fs.azure.account.oauth2.client.endpoint..dfs.core.windows.net",
 
"__").option("fs.azure.account.oauth2.client.id..dfs.core.windows.net",
 
"__").option("fs.azure.account.auth.type..dfs.core.windows.net","__").option("fs.azure.account.oauth2.client.secret..dfs.core.windows.net",
 
"__").option("fs.azure.account.oauth.provider.type..dfs.core.windows.net",
 "__")
.save(path){code}
 

The error that i get is 
{code:java}
 Failure to initialize configuration
Caused by: InvalidConfigurationValueException: Invalid configuration value 
detected for fs.azure.account.key
at 
org.apache.hadoop.fs.azurebfs.services.SimpleKeyProvider.getStorageAccountKey(SimpleKeyProvider.java:51)
    at 
org.apache.hadoop.fs.azurebfs.AbfsConfiguration.getStorageAccountKey(AbfsConfiguration.java:548)
    at 
org.apache.hadoop.fs.azurebfs.AzureBlobFileSystemStore.initializeClient(AzureBlobFileSystemStore.java:1449){code}
This shows that even though the key,value of 
{code:java}
spark.hadoop.fs.azure.account.key..dfs.core.windows.net {code}
is being sent via option param, but is not being set internally.

 

while this works only if i explicitly set the values in the
{code:java}
spark.conf.set(key,value) {code}
which might be problematic for a multi-tenant solution, which can be using the 
same spark context.

one other observation is 
{code:java}
df.write.option(key1,value1).option(key2,value2).save(path)  {code}
fails with same key error while,
{code:java}
map = Map(key1->value1, key2->value2)  
df.write.options(map).save(path) {code}
works..

 

Help required on: Similar to how dataframes 
{code:java}
df.write.options(Map) {code}
attribute helps to set the configuration, the .option(key1, value1) should also 
work to write to azure blob storage.

 

  was:
I am trying to write mllib pipeline with a series of stages set in it to azure 
blob storage giving relevant write parameters, but it still complains of 
`fs.azure.account.key` not being found in the configuration.

Sharing the code.
{code:java}
val spark = 
SparkSession.builder().appName("main").master("local[4]").getOrCreate()

import spark.implicits._

val df = spark.createDataFrame(Seq(
  (0L, "a b c d e spark"),
  (1L, "b d")
)).toDF("id", "text") 

val si = new StringIndexer().setInputCol("text").setOutputCol("IND_text")
val pipelinee = new Pipeline().setStages(Array(si))
val pipelineModel = pipelinee.fit(df)
val path = BLOB_STORAGE_PATH

pipelineModel.write
.option("spark.hadoop.fs.azure.account.key..dfs.core.windows.net",
  "__").option("fs.azure.account.key..dfs.core.windows.net", 
"__").option("fs.azure.account.oauth2.client.endpoint..dfs.core.windows.net",
 
"__").option("fs.azure.account.oauth2.client.id..dfs.core.windows.net",
 
"__").option("fs.azure.account.auth.type..dfs.core.windows.net","__").option("fs.azure.account.oauth2.client.secret..dfs.core.windows.net",
 
"__").option("fs.azure.account.oauth.provider.type..dfs.core.windows.net",
 "__")
.save(path){code}
 

The error that i get is 
{code:java}
 Failure to initialize configuration
Caused by: InvalidConfigurationValueException: Invalid configuration value 
detected for fs.azure.account.key
at 
org.apache.hadoop.fs.azurebfs.services.SimpleKeyProvider.getStorageAccountKey(SimpleKeyProvider.java:51)
    at 
org.apache.hadoop.fs.azurebfs.AbfsConfiguration.getStorageAccountKey(AbfsConfiguration.java:548)
    at 
org.apache.hadoop.fs.azurebfs.AzureBlobFileSystemStore.initializeClient(AzureBlobFileSystemStore.java:1449){code}
This shows that even though the key,value of 
{code:java}
spark.hadoop.fs.azure.account.key..dfs.core.windows.net {code}
is being sent via option param, but is not being set internally.

 

while this works only if i explicitly set the values in the
{code:java}
spark.conf.set(key,value) {code}
which might be problematic for a multi-tenant solution, which can be using the 
same spark context.

one other observation is 
{code:java}
df.write.option(key1,value1).option(key2,value2).sa

[jira] [Updated] (SPARK-48423) Unable to write MLPipeline to blob storage using .option attribute

2024-05-29 Thread Chhavi Bansal (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48423?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chhavi Bansal updated SPARK-48423:
--
Description: 
I am trying to write mllib pipeline with a series of stages set in it to azure 
blob storage giving relevant write parameters, but it still complains of 
`fs.azure.account.key` not being found in the configuration.

Sharing the code.
{code:java}
val spark = 
SparkSession.builder().appName("main").master("local[4]").getOrCreate()

import spark.implicits._

val df = spark.createDataFrame(Seq(
  (0L, "a b c d e spark"),
  (1L, "b d")
)).toDF("id", "text") 

val si = new StringIndexer().setInputCol("text").setOutputCol("IND_text")
val pipelinee = new Pipeline().setStages(Array(si))
val pipelineModel = pipelinee.fit(df)
val path = BLOB_STORAGE_PATH

pipelineModel.write
.option("spark.hadoop.fs.azure.account.key..dfs.core.windows.net",
  "__").option("fs.azure.account.key..dfs.core.windows.net", 
"__").option("fs.azure.account.oauth2.client.endpoint..dfs.core.windows.net",
 
"__").option("fs.azure.account.oauth2.client.id..dfs.core.windows.net",
 
"__").option("fs.azure.account.auth.type..dfs.core.windows.net","__").option("fs.azure.account.oauth2.client.secret..dfs.core.windows.net",
 
"__").option("fs.azure.account.oauth.provider.type..dfs.core.windows.net",
 "__")
.save(path){code}
 

The error that i get is 
{code:java}
 Failure to initialize configuration
Caused by: InvalidConfigurationValueException: Invalid configuration value 
detected for fs.azure.account.key
at 
org.apache.hadoop.fs.azurebfs.services.SimpleKeyProvider.getStorageAccountKey(SimpleKeyProvider.java:51)
    at 
org.apache.hadoop.fs.azurebfs.AbfsConfiguration.getStorageAccountKey(AbfsConfiguration.java:548)
    at 
org.apache.hadoop.fs.azurebfs.AzureBlobFileSystemStore.initializeClient(AzureBlobFileSystemStore.java:1449){code}
This shows that even though the key,value of 
{code:java}
spark.hadoop.fs.azure.account.key..dfs.core.windows.net {code}
is being sent via option param, but is not being set internally.

 

while this works only if i explicitly set the values in the
{code:java}
spark.conf.set(key,value) {code}
which might be problematic for a multi-tenant solution, which can be using the 
same spark context.

one other observation is 
{code:java}
df.write.option(key1,value1).option(key2,value2).save(path)  {code}
fails with same key error while,
{code:java}
map = Map(key1->value1, key2->value2)  
df.write.options(map).save(path) {code}
works..

 

Help required on: Similar to how dataframes `options`
{code:java}
df.write.options(Map) {code}
 helps to set the configuration, the *.option(key1, value1)* should also work 
to write to azure blob storage.

 

  was:
I am trying to write mllib pipeline with a series of stages set in it to azure 
blob storage giving relevant write parameters, but it still complains of 
`fs.azure.account.key` not being found in the configuration.

Sharing the code.
{code:java}
val spark = 
SparkSession.builder().appName("main").master("local[4]").getOrCreate()

import spark.implicits._

val df = spark.createDataFrame(Seq(
  (0L, "a b c d e spark"),
  (1L, "b d")
)).toDF("id", "text") 

val si = new StringIndexer().setInputCol("text").setOutputCol("IND_text")
val pipelinee = new Pipeline().setStages(Array(si))
val pipelineModel = pipelinee.fit(df)
val path = BLOB_STORAGE_PATH

pipelineModel.write
.option("spark.hadoop.fs.azure.account.key..dfs.core.windows.net",
  "__").option("fs.azure.account.key..dfs.core.windows.net", 
"__").option("fs.azure.account.oauth2.client.endpoint..dfs.core.windows.net",
 
"__").option("fs.azure.account.oauth2.client.id..dfs.core.windows.net",
 
"__").option("fs.azure.account.auth.type..dfs.core.windows.net","__").option("fs.azure.account.oauth2.client.secret..dfs.core.windows.net",
 
"__").option("fs.azure.account.oauth.provider.type..dfs.core.windows.net",
 "__")
.save(path){code}
 

The error that i get is 
{code:java}
 Failure to initialize configuration
Caused by: InvalidConfigurationValueException: Invalid configuration value 
detected for fs.azure.account.key
at 
org.apache.hadoop.fs.azurebfs.services.SimpleKeyProvider.getStorageAccountKey(SimpleKeyProvider.java:51)
    at 
org.apache.hadoop.fs.azurebfs.AbfsConfiguration.getStorageAccountKey(AbfsConfiguration.java:548)
    at 
org.apache.hadoop.fs.azurebfs.AzureBlobFileSystemStore.initializeClient(AzureBlobFileSystemStore.java:1449){code}
This shows that even though the key,value of 
{code:java}
spark.hadoop.fs.azure.account.key..dfs.core.windows.net {code}
is being sent via option param, but is not being set internally.

 

while this works only if i explicitly set the values in the
{code:java}
spark.conf.set(key,value) {code}
which might be problematic for a multi-tenant solution, which can be using the 
same spark context.

one other observation is 
{code:java}
df.write.option(key1,value1).option(key2,value2).

[jira] [Created] (SPARK-48463) MLLib function unable to handle tested data

2024-05-29 Thread Chhavi Bansal (Jira)

Chhavi Bansal created SPARK-48463:
-

 Summary: MLLib function unable to handle tested data
 Key: SPARK-48463
 URL: https://issues.apache.org/jira/browse/SPARK-48463
 Project: Spark
  Issue Type: Bug
  Components: ML, MLlib
Affects Versions: 3.5.1
Reporter: Chhavi Bansal


I am trying to use feature transformer on nested data after flattening, but it 
fails.

 
{code:java}
val structureData = Seq(
  Row(Row(10, 12), 1000),
  Row(Row(12, 14), 4300),
  Row( Row(37, 891), 1400),
  Row(Row(8902, 12), 4000),
  Row(Row(12, 89), 1000)
)

val structureSchema = new StructType()
  .add("location", new StructType()
.add("longitude", IntegerType)
.add("latitude", IntegerType))
  .add("salary", IntegerType) 
val df = spark.createDataFrame(spark.sparkContext.parallelize(structureData), 
structureSchema) 

def flattenSchema(schema: StructType, prefix: String = null, prefixSelect: 
String = null):
Array[Column] = {
  schema.fields.flatMap(f => {
val colName = if (prefix == null) f.name else (prefix + "." + f.name)
val colnameSelect = if (prefix == null) f.name else (prefixSelect + "." + 
f.name)

f.dataType match {
  case st: StructType => flattenSchema(st, colName, colnameSelect)
  case _ =>
Array(col(colName).as(colnameSelect))
}
  })
}

val flattenColumns = flattenSchema(df.schema)
val flattenedDf = df.select(flattenColumns: _*){code}
Now using the string indexer on the DOT notation.

 
{code:java}
val si = new 
StringIndexer().setInputCol("location.longitude").setOutputCol("longitutdee")
val pipeline = new Pipeline().setStages(Array(si))
pipeline.fit(flattenedDf).transform(flattenedDf).show() {code}
The above code fails 
{code:java}
xception in thread "main" org.apache.spark.sql.AnalysisException: Cannot 
resolve column name "location.longitude" among (location.longitude, 
location.latitude, salary); did you mean to quote the `location.longitude` 
column?
    at 
org.apache.spark.sql.errors.QueryCompilationErrors$.cannotResolveColumnNameAmongFieldsError(QueryCompilationErrors.scala:2261)
    at 
org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$resolveException(Dataset.scala:258)
    at org.apache.spark.sql.Dataset.$anonfun$resolve$1(Dataset.scala:250)
. {code}
This points to the same failure as when we try to select dot notation columns 
in a spark dataframe, which is solved using BACKTICKS *`column.name`.* 

[https://stackoverflow.com/a/51430335/11688337]

 

*so next*

I use the back ticks while defining stringIndexer
{code:java}
val si = new 
StringIndexer().setInputCol("`location.longitude`").setOutputCol("longitutdee") 
{code}
In this case *it again fails* (with a diff reason) in the stringIndexer code 
itself
{code:java}
Exception in thread "main" org.apache.spark.SparkException: Input column 
`location.longitude` does not exist.
    at 
org.apache.spark.ml.feature.StringIndexerBase.$anonfun$validateAndTransformSchema$2(StringIndexer.scala:128)
    at 
scala.collection.TraversableLike.$anonfun$flatMap$1(TraversableLike.scala:244)
    at 
scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36)
    at 
scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33) 
{code}
 

This blocks me to use feature transformation functions on nested columns. 
Any help in solving this problem will be highly appreciated.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-48463) MLLib function unable to handle nested data

2024-05-29 Thread Chhavi Bansal (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48463?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chhavi Bansal updated SPARK-48463:
--
Summary: MLLib function unable to handle nested data  (was: MLLib function 
unable to handle tested data)

> MLLib function unable to handle nested data
> ---
>
> Key: SPARK-48463
> URL: https://issues.apache.org/jira/browse/SPARK-48463
> Project: Spark
>  Issue Type: Bug
>  Components: ML, MLlib
>Affects Versions: 3.5.1
>Reporter: Chhavi Bansal
>Priority: Major
>  Labels: ML, MLPipelines, mllib, nested
>
> I am trying to use feature transformer on nested data after flattening, but 
> it fails.
>  
> {code:java}
> val structureData = Seq(
>   Row(Row(10, 12), 1000),
>   Row(Row(12, 14), 4300),
>   Row( Row(37, 891), 1400),
>   Row(Row(8902, 12), 4000),
>   Row(Row(12, 89), 1000)
> )
> val structureSchema = new StructType()
>   .add("location", new StructType()
> .add("longitude", IntegerType)
> .add("latitude", IntegerType))
>   .add("salary", IntegerType) 
> val df = spark.createDataFrame(spark.sparkContext.parallelize(structureData), 
> structureSchema) 
> def flattenSchema(schema: StructType, prefix: String = null, prefixSelect: 
> String = null):
> Array[Column] = {
>   schema.fields.flatMap(f => {
> val colName = if (prefix == null) f.name else (prefix + "." + f.name)
> val colnameSelect = if (prefix == null) f.name else (prefixSelect + "." + 
> f.name)
> f.dataType match {
>   case st: StructType => flattenSchema(st, colName, colnameSelect)
>   case _ =>
> Array(col(colName).as(colnameSelect))
> }
>   })
> }
> val flattenColumns = flattenSchema(df.schema)
> val flattenedDf = df.select(flattenColumns: _*){code}
> Now using the string indexer on the DOT notation.
>  
> {code:java}
> val si = new 
> StringIndexer().setInputCol("location.longitude").setOutputCol("longitutdee")
> val pipeline = new Pipeline().setStages(Array(si))
> pipeline.fit(flattenedDf).transform(flattenedDf).show() {code}
> The above code fails 
> {code:java}
> xception in thread "main" org.apache.spark.sql.AnalysisException: Cannot 
> resolve column name "location.longitude" among (location.longitude, 
> location.latitude, salary); did you mean to quote the `location.longitude` 
> column?
>     at 
> org.apache.spark.sql.errors.QueryCompilationErrors$.cannotResolveColumnNameAmongFieldsError(QueryCompilationErrors.scala:2261)
>     at 
> org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$resolveException(Dataset.scala:258)
>     at org.apache.spark.sql.Dataset.$anonfun$resolve$1(Dataset.scala:250)
> . {code}
> This points to the same failure as when we try to select dot notation columns 
> in a spark dataframe, which is solved using BACKTICKS *`column.name`.* 
> [https://stackoverflow.com/a/51430335/11688337]
>  
> *so next*
> I use the back ticks while defining stringIndexer
> {code:java}
> val si = new 
> StringIndexer().setInputCol("`location.longitude`").setOutputCol("longitutdee")
>  {code}
> In this case *it again fails* (with a diff reason) in the stringIndexer code 
> itself
> {code:java}
> Exception in thread "main" org.apache.spark.SparkException: Input column 
> `location.longitude` does not exist.
>     at 
> org.apache.spark.ml.feature.StringIndexerBase.$anonfun$validateAndTransformSchema$2(StringIndexer.scala:128)
>     at 
> scala.collection.TraversableLike.$anonfun$flatMap$1(TraversableLike.scala:244)
>     at 
> scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36)
>     at 
> scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33) 
> {code}
>  
> This blocks me to use feature transformation functions on nested columns. 
> Any help in solving this problem will be highly appreciated.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-48461) Replace NullPointerExceptions with proper error classes in AssertNotNull expression

2024-05-29 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48461?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-48461:
---
Labels: pull-request-available  (was: )

> Replace NullPointerExceptions with proper error classes in AssertNotNull 
> expression
> ---
>
> Key: SPARK-48461
> URL: https://issues.apache.org/jira/browse/SPARK-48461
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Daniel
>Priority: Major
>  Labels: pull-request-available
>
> [Code location 
> here|https://github.com/apache/spark/blob/f5d9b809881552c0e1b5af72b2a32caa25018eb3/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/objects/objects.scala#L1929]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-48281) Alter string search logic for: instr, substring_index (UTF8_BINARY_LCASE)

2024-05-29 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48281?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-48281:
---

Assignee: Uroš Bojanić

> Alter string search logic for: instr, substring_index (UTF8_BINARY_LCASE)
> -
>
> Key: SPARK-48281
> URL: https://issues.apache.org/jira/browse/SPARK-48281
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Uroš Bojanić
>Assignee: Uroš Bojanić
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-48281) Alter string search logic for: instr, substring_index (UTF8_BINARY_LCASE)

2024-05-29 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48281?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-48281.
-
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 46589
[https://github.com/apache/spark/pull/46589]

> Alter string search logic for: instr, substring_index (UTF8_BINARY_LCASE)
> -
>
> Key: SPARK-48281
> URL: https://issues.apache.org/jira/browse/SPARK-48281
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Uroš Bojanić
>Assignee: Uroš Bojanić
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-48462) Refactor HiveQuerySuite.scala and HiveTableScanSuite

2024-05-29 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48462?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-48462:
---
Labels: pull-request-available  (was: )

> Refactor HiveQuerySuite.scala and HiveTableScanSuite
> 
>
> Key: SPARK-48462
> URL: https://issues.apache.org/jira/browse/SPARK-48462
> Project: Spark
>  Issue Type: Sub-task
>  Components: Tests
>Affects Versions: 4.0.0
>Reporter: Rui Wang
>Assignee: Rui Wang
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-48462) Refactor HiveQuerySuite.scala and HiveTableScanSuite

2024-05-29 Thread Rui Wang (Jira)

Rui Wang created SPARK-48462:


 Summary: Refactor HiveQuerySuite.scala and HiveTableScanSuite
 Key: SPARK-48462
 URL: https://issues.apache.org/jira/browse/SPARK-48462
 Project: Spark
  Issue Type: Sub-task
  Components: Tests
Affects Versions: 4.0.0
Reporter: Rui Wang
Assignee: Rui Wang






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-48444) Refactor SQLQuerySuite

2024-05-29 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48444?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-48444.
-
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 46778
[https://github.com/apache/spark/pull/46778]

> Refactor SQLQuerySuite
> --
>
> Key: SPARK-48444
> URL: https://issues.apache.org/jira/browse/SPARK-48444
> Project: Spark
>  Issue Type: Sub-task
>  Components: Tests
>Affects Versions: 4.0.0
>Reporter: Rui Wang
>Assignee: Rui Wang
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-48447) Check state store provider class before invoking the constructor

2024-05-29 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48447?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-48447:
---
Labels: pull-request-available  (was: )

> Check state store provider class before invoking the constructor
> 
>
> Key: SPARK-48447
> URL: https://issues.apache.org/jira/browse/SPARK-48447
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 4.0.0
>Reporter: Yuchen Liu
>Priority: Major
>  Labels: pull-request-available
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> We should restrict that only classes 
> [extending|https://github.com/databricks/runtime/blob/1440e77ab54c40981066c22ec759bdafc0683e76/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/HDFSBackedStateStoreProvider.scala#L73]
>  {{StateStoreProvider}} can be constructed to prevent customer from 
> instantiating arbitrary class of objects.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-48461) Replace NullPointerExceptions with proper error classes in AssertNotNull expression

2024-05-29 Thread Daniel (Jira)

Daniel created SPARK-48461:
--

 Summary: Replace NullPointerExceptions with proper error classes 
in AssertNotNull expression
 Key: SPARK-48461
 URL: https://issues.apache.org/jira/browse/SPARK-48461
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 4.0.0
Reporter: Daniel


[Code location 
here|https://github.com/apache/spark/blob/f5d9b809881552c0e1b5af72b2a32caa25018eb3/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/objects/objects.scala#L1929]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-48460) Spark ORC writer generates incorrect meta information(min, max)

2024-05-29 Thread Volodymyr T (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48460?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Volodymyr T updated SPARK-48460:

Description: 
We found that Hive cannot concatenate some ORC files generated by Spark 3.2.1 
and higher versions which contain long strings.

Steps to reproduce the issue:
1) Create DF with a string longer than 1024
 
{code:java}
val valid = spark.sql("SELECT 1 as id, cast(NULL as string) as null, lpad('A', 
1024, 'A') as string;"){code}
{code:java}
val invalid = spark.sql("SELECT 1 as id, cast(NULL as string) as null, 
lpad('A', 1025, 'A') as string;"){code}
{code:java}
valid.withColumn("len", length($"string")).show()
+---++++ | id|null| string| len| 
+---++++ | 1|null|A...|1024| 
+---++++{code}
{code:java}
invalid.withColumn("len", length($"string")).show()
+---++++ | id|null| string| len| 
+---++++ | 1|null|A...|1025| 
+---++++{code}
2. Write in ORC format to S3
{code:java}
valid.write.format("orc")
      .option("path", "s3://bucket/test/test_orc/")
      .option("compression", "zlib")
      .mode("overwrite")
      .save(){code}
3. Check ORC meta by *hive –orcfiledump* command
{code:java}
[hadoop@ip ~]$ hive --orcfiledump s3://bucket/tets/test_orc/{code}
We can see incorrect statistics for column string
{code:java}
Column 3: count: 1 hasNull: false bytesOnDisk: 23 min: null max: null sum: 
1025{code}
{code:java}
Processing data file 
s3://bucket-dev/tets/test_orc/part-0-ec01de8f-8f6b-4937-b107-e88f5a5d2d67-c000.zlib.orc
 [length: 488]Structure for 
s3://timmedia-dev/volodymyr/test_orc/part-0-ec01de8f-8f6b-4937-b107-e88f5a5d2d67-c000.zlib.orcFile
 Version: 0.12 with FUTURERows: 1Compression: ZLIBCompression size: 262144Type: 
struct
Stripe Statistics:  Stripe 1:    Column 0: count: 1 hasNull: false    Column 1: 
count: 1 hasNull: false bytesOnDisk: 6 min: 1 max: 1 sum: 1    Column 2: count: 
0 hasNull: true bytesOnDisk: 5    Column 3: count: 1 hasNull: false 
bytesOnDisk: 23 min: null max: null sum: 1025
File Statistics:  Column 0: count: 1 hasNull: false  Column 1: count: 1 
hasNull: false bytesOnDisk: 6 min: 1 max: 1 sum: 1  Column 2: count: 0 hasNull: 
true bytesOnDisk: 5  Column 3: count: 1 hasNull: false bytesOnDisk: 23 min: 
null max: null sum: 1025
Stripes:  Stripe: offset: 3 data: 34 rows: 1 tail: 66 index: 108    Stream: 
column 0 section ROW_INDEX start: 3 length 11    Stream: column 1 section 
ROW_INDEX start: 14 length 24    Stream: column 2 section ROW_INDEX start: 38 
length 19    Stream: column 3 section ROW_INDEX start: 57 length 54    Stream: 
column 1 section DATA start: 111 length 6    Stream: column 2 section PRESENT 
start: 117 length 5    Stream: column 2 section DATA start: 122 length 0    
Stream: column 2 section LENGTH start: 122 length 0    Stream: column 2 section 
DICTIONARY_DATA start: 122 length 0    Stream: column 3 section DATA start: 122 
length 16    Stream: column 3 section LENGTH start: 138 length 7    Encoding 
column 0: DIRECT    Encoding column 1: DIRECT_V2    Encoding column 2: 
DICTIONARY_V2[0]    Encoding column 3: DIRECT_V2
File length: 488 bytesPadding length: 0 bytesPadding ratio: 0%
User Metadata:  org.apache.spark.version=3.4.1{code}
For DF with a value smaller than 1024, we can see valid statistics
{code:java}
hive --orcfiledump s3://bucket/test/test_orcProcessing data file 
s3://bucket/test/test_orc/part-0-e395cc4d-9e2a-4ef0-9adb-640ed41dd2b7-c000.zlib.orc
 [length: 485]Structure for 
s3://timmedia-dev/volodymyr/test_orc/part-0-e395cc4d-9e2a-4ef0-9adb-640ed41dd2b7-c000.zlib.orcFile
 Version: 0.12 with FUTURERows: 1Compression: ZLIBCompression size: 262144Type: 
struct
Stripe Statistics:  Stripe 1:    Column 0: count: 1 hasNull: false    Column 1: 
count: 1 hasNull: false bytesOnDisk: 6 min: 1 max: 1 sum: 1    Column 2: count: 
0 hasNull: true bytesOnDisk: 5    Column 3: count: 1 hasNull: false 
bytesOnDisk: 23 min:

[jira] [Created] (SPARK-48460) Spark ORC writer generates incorrect meta information(mix, max)

2024-05-29 Thread Volodymyr T (Jira)

Volodymyr T created SPARK-48460:
---

 Summary: Spark ORC writer generates incorrect meta 
information(mix, max)
 Key: SPARK-48460
 URL: https://issues.apache.org/jira/browse/SPARK-48460
 Project: Spark
  Issue Type: Bug
  Components: Input/Output
Affects Versions: 3.4.3, 3.3.4, 3.5.1, 3.5.0, 3.4.1, 3.4.0, 3.3.2, 3.4.2, 
3.3.3, 3.2.4, 3.2.3, 3.3.1, 3.2.2, 3.3.0, 3.2.1
Reporter: Volodymyr T


We found that Hive cannot concatenate some ORC files generated by Spark 3.2.1 
and higher versions which contain long strings.

Steps to reproduce the issue:
1) Create DF with a string longer than 1024
 
{code:java}
val valid = spark.sql("SELECT 1 as id, cast(NULL as string) as null, lpad('A', 
1024, 'A') as string;"){code}
{code:java}
val invalid = spark.sql("SELECT 1 as id, cast(NULL as string) as null, 
lpad('A', 1025, 'A') as string;"){code}
{code:java}
valid.withColumn("len", length($"string")).show()
+---++++ | id|null| string| len| 
+---++++ | 1|null|A...|1024| 
+---++++{code}
 
{code:java}
invalid.withColumn("len", length($"string")).show()
+---++++ | id|null| string| len| 
+---++++ | 1|null|A...|1025| 
+---++++{code}
2. Write in ORC format to S3
{code:java}
valid.write.format("orc")
      .option("path", "s3://bucket/test/test_orc/")
      .option("compression", "zlib")
      .mode("overwrite")
      .save(){code}
3. Check ORC meta by *hive –orcfiledump* command
{code:java}
[hadoop@ip ~]$ hive --orcfiledump s3://bucket/tets/test_orc/{code}
We can see incorrect statistics for column string
{code:java}
Column 3: count: 1 hasNull: false bytesOnDisk: 23 min: null max: null sum: 
1025{code}
{code:java}
Processing data file 
s3://bucket-dev/tets/test_orc/part-0-ec01de8f-8f6b-4937-b107-e88f5a5d2d67-c000.zlib.orc
 [length: 488]Structure for 
s3://timmedia-dev/volodymyr/test_orc/part-0-ec01de8f-8f6b-4937-b107-e88f5a5d2d67-c000.zlib.orcFile
 Version: 0.12 with FUTURERows: 1Compression: ZLIBCompression size: 262144Type: 
struct
Stripe Statistics:  Stripe 1:    Column 0: count: 1 hasNull: false    Column 1: 
count: 1 hasNull: false bytesOnDisk: 6 min: 1 max: 1 sum: 1    Column 2: count: 
0 hasNull: true bytesOnDisk: 5    Column 3: count: 1 hasNull: false 
bytesOnDisk: 23 min: null max: null sum: 1025
File Statistics:  Column 0: count: 1 hasNull: false  Column 1: count: 1 
hasNull: false bytesOnDisk: 6 min: 1 max: 1 sum: 1  Column 2: count: 0 hasNull: 
true bytesOnDisk: 5  Column 3: count: 1 hasNull: false bytesOnDisk: 23 min: 
null max: null sum: 1025
Stripes:  Stripe: offset: 3 data: 34 rows: 1 tail: 66 index: 108    Stream: 
column 0 section ROW_INDEX start: 3 length 11    Stream: column 1 section 
ROW_INDEX start: 14 length 24    Stream: column 2 section ROW_INDEX start: 38 
length 19    Stream: column 3 section ROW_INDEX start: 57 length 54    Stream: 
column 1 section DATA start: 111 length 6    Stream: column 2 section PRESENT 
start: 117 length 5    Stream: column 2 section DATA start: 122 length 0    
Stream: column 2 section LENGTH start: 122 length 0    Stream: column 2 section 
DICTIONARY_DATA start: 122 length 0    Stream: column 3 section DATA start: 122 
length 16    Stream: column 3 section LENGTH start: 138 length 7    Encoding 
column 0: DIRECT    Encoding column 1: DIRECT_V2    Encoding column 2: 
DICTIONARY_V2[0]    Encoding column 3: DIRECT_V2
File length: 488 bytesPadding length: 0 bytesPadding ratio: 0%
User Metadata:  org.apache.spark.version=3.4.1{code}
For DF with a value smaller than 1024, we can see valid statistics
{code:java}
hive --orcfiledump s3://bucket/test/test_orcProcessing data file 
s3://bucket/test/test_orc/part-0-e395cc4d-9e2a-4ef0-9adb-640ed41dd2b7-c000.zlib.orc
 [length: 485]Structure for 
s3://timmedia-dev/volodymyr/test_orc/part-0-e395cc4d-9e2a-4ef0-9adb-640ed41dd2b7-c000.zlib.orcFile
 Version: 0.12 with FUTURERows: 1Compression: ZLIBCompression size: 262144Type: 
struct
Stripe Statistics:  Stripe 1:    Column 0: count: 1 hasNull: false    Column 1: 
count: 1 hasNull: false bytesOnDisk: 6 min: 1 max: 1 sum: 1    Column 2: count: 
0 hasNull: true bytesOnDisk: 5    Column 3: count: 1 hasNull: false 
bytesOnDisk: 23 min: 
A

[jira] [Updated] (SPARK-48460) Spark ORC writer generates incorrect meta information(min, max)

2024-05-29 Thread Volodymyr T (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48460?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Volodymyr T updated SPARK-48460:

Summary: Spark ORC writer generates incorrect meta information(min, max)  
(was: Spark ORC writer generates incorrect meta information(mix, max))

> Spark ORC writer generates incorrect meta information(min, max)
> ---
>
> Key: SPARK-48460
> URL: https://issues.apache.org/jira/browse/SPARK-48460
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 3.2.1, 3.3.0, 3.2.2, 3.3.1, 3.2.3, 3.2.4, 3.3.3, 3.4.2, 
> 3.3.2, 3.4.0, 3.4.1, 3.5.0, 3.5.1, 3.3.4, 3.4.3
>Reporter: Volodymyr T
>Priority: Major
>
> We found that Hive cannot concatenate some ORC files generated by Spark 3.2.1 
> and higher versions which contain long strings.
> Steps to reproduce the issue:
> 1) Create DF with a string longer than 1024
>  
> {code:java}
> val valid = spark.sql("SELECT 1 as id, cast(NULL as string) as null, 
> lpad('A', 1024, 'A') as string;"){code}
> {code:java}
> val invalid = spark.sql("SELECT 1 as id, cast(NULL as string) as null, 
> lpad('A', 1025, 'A') as string;"){code}
> {code:java}
> valid.withColumn("len", length($"string")).show()
> +---++++ | id|null| string| len| 
> +---++++ | 1|null|A...|1024| 
> +---++++{code}
>  
> {code:java}
> invalid.withColumn("len", length($"string")).show()
> +---++++ | id|null| string| len| 
> +---++++ | 1|null|A...|1025| 
> +---++++{code}
> 2. Write in ORC format to S3
> {code:java}
> valid.write.format("orc")
>       .option("path", "s3://bucket/test/test_orc/")
>       .option("compression", "zlib")
>       .mode("overwrite")
>       .save(){code}
> 3. Check ORC meta by *hive –orcfiledump* command
> {code:java}
> [hadoop@ip ~]$ hive --orcfiledump s3://bucket/tets/test_orc/{code}
> We can see incorrect statistics for column string
> {code:java}
> Column 3: count: 1 hasNull: false bytesOnDisk: 23 min: null max: null sum: 
> 1025{code}
> {code:java}
> Processing data file 
> s3://bucket-dev/tets/test_orc/part-0-ec01de8f-8f6b-4937-b107-e88f5a5d2d67-c000.zlib.orc
>  [length: 488]Structure for 
> s3://timmedia-dev/volodymyr/test_orc/part-0-ec01de8f-8f6b-4937-b107-e88f5a5d2d67-c000.zlib.orcFile
>  Version: 0.12 with FUTURERows: 1Compression: ZLIBCompression size: 
> 262144Type: struct
> Stripe Statistics:  Stripe 1:    Column 0: count: 1 hasNull: false    Column 
> 1: count: 1 hasNull: false bytesOnDisk: 6 min: 1 max: 1 sum: 1    Column 2: 
> count: 0 hasNull: true bytesOnDisk: 5    Column 3: count: 1 hasNull: false 
> bytesOnDisk: 23 min: null max: null sum: 1025
> File Statistics:  Column 0: count: 1 hasNull: false  Column 1: count: 1 
> hasNull: false bytesOnDisk: 6 min: 1 max: 1 sum: 1  Column 2: count: 0 
> hasNull: true bytesOnDisk: 5  Column 3: count: 1 hasNull: false bytesOnDisk: 
> 23 min: null max: null sum: 1025
> Stripes:  Stripe: offset: 3 data: 34 rows: 1 tail: 66 index: 108    Stream: 
> column 0 section ROW_INDEX start: 3 length 11    Stream: column 1 section 
> ROW_INDEX start: 14 length 24    Stream: column 2 section ROW_INDEX start: 38 
> length 19    Stream: column 3 section ROW_INDEX start: 57 length 54    
> Stream: column 1 section DATA start: 111 length 6    Stream: column 2 section 
> PRESENT start: 117 length 5    Stream: column 2 section DATA start: 122 
> length 0    Stream: column 2 section LENGTH start: 122 length 0    Stream: 
> column 2 section DICTIONARY_DATA start: 122 length 0    Stream: column 3 
> section DATA start: 122 length 16    Stream: column 3 section LENGTH start: 
> 138 length 7    Encoding column 0: DIRECT    Encoding column 1: DIRECT_V2    
> Encoding column 2: DICTIONARY_V2[0]    Encoding column 3: DIRECT_V2
> File length: 488 bytesPadding length: 0 bytesPadding ratio: 0%
> User Metadata:  org.apache.spark.version=3.4.1{code}
> For DF with a value smaller than 1024, we can see valid statistics
> {code:java}
> hive --orcfiledump s3://bucket/test/test_orcProcessing data file 
> s3://bucket/test/test_orc/part-0-e395cc4d-9e2a-4ef0-9adb-640ed41dd2b7-c000.zlib.orc
>  [length: 485]Structure for 
> s3://timmedia-dev/volodymyr/test_orc/part-0-e395cc4d-9e2a-4ef0-9adb-640ed41dd2b7-c000.zlib.orcFile
>  Version: 0.12 with FUTURERows: 1Compression: ZLIBCompression size: 
> 262144Type: struct
> Stripe Statistics:  Stripe 1:    Column 0: count: 1 hasNull: false    Column 
> 1: count: 1 hasNull: false bytesOnDisk: 6 min: 1 max: 1 sum: 1    Column 2: 
> count: 0 hasNull: true bytesOnDisk: 5    Column 3: count: 1 hasNull: false 
> bytesOnDisk: 23 min: 
> AA

[jira] [Commented] (SPARK-48361) Correctness: CSV corrupt record filter with aggregate ignored

2024-05-29 Thread Ted Chester Jenks (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-48361?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17850431#comment-17850431
 ] 

Ted Chester Jenks commented on SPARK-48361:
---

Yeah, it all makes sense, but a little less than ideal.

For my purposes, this workaround is good enough so if there is no interest in 
pursuing this further I am happy to close out. However, I do still believe this 
could cause issues for others. Perhaps the pruning should be disabled if there 
is a corrupt record column in use?

> Correctness: CSV corrupt record filter with aggregate ignored
> -
>
> Key: SPARK-48361
> URL: https://issues.apache.org/jira/browse/SPARK-48361
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.5.1
> Environment: Using spark shell 3.5.1 on M1 Mac
>Reporter: Ted Chester Jenks
>Priority: Major
>
> Using corrupt record in CSV parsing for some data cleaning logic, I came 
> across a correctness bug.
>  
> The following repro can be ran with spark-shell 3.5.1.
> *Create test.csv with the following content:*
> {code:java}
> test,1,2,three
> four,5,6,seven
> 8,9
> ten,11,12,thirteen {code}
>  
>  
> *In spark-shell:*
> {code:java}
> import org.apache.spark.sql.types._ 
> import org.apache.spark.sql.functions._
>  
> # define a STRING, DOUBLE, DOUBLE, STRING schema for the data
> val schema = StructType(List(StructField("column1", StringType, true), 
> StructField("column2", DoubleType, true), StructField("column3", DoubleType, 
> true), StructField("column4", StringType, true)))
>  
> # add a column for corrupt records to the schema
> val schemaWithCorrupt = StructType(schema.fields :+ 
> StructField("_corrupt_record", StringType, true)) 
>  
> # read the CSV with the schema, headers, permissive parsing, and the corrupt 
> record column
> val df = spark.read.option("header", "true").option("mode", 
> "PERMISSIVE").option("columnNameOfCorruptRecord", 
> "_corrupt_record").schema(schemaWithCorrupt).csv("test.csv") 
>  
> # define a UDF to count the commas in the corrupt record column
> val countCommas = udf((s: String) => if (s != null) s.count(_ == ',') else 
> -1) 
>  
> # add a true/false column for whether the number of commas is 3
> val dfWithJagged = df.withColumn("__is_jagged", 
> when(col("_corrupt_record").isNull, 
> false).otherwise(countCommas(col("_corrupt_record")) =!= 3))
> dfWithJagged.show(){code}
> *Returns:*
> {code:java}
> +---+---+---++---+---+
> |column1|column2|column3| column4|_corrupt_record|__is_jagged|
> +---+---+---++---+---+
> |   four|    5.0|    6.0|   seven|           NULL|      false|
> |      8|    9.0|   NULL|    NULL|            8,9|       true|
> |    ten|   11.0|   12.0|thirteen|           NULL|      false|
> +---+---+---++---+---+ {code}
> So far so good...
>  
> *BUT*
>  
> *If we add an aggregate before we show:*
> {code:java}
> import org.apache.spark.sql.types._ 
> import org.apache.spark.sql.functions._
> val schema = StructType(List(StructField("column1", StringType, true), 
> StructField("column2", DoubleType, true), StructField("column3", DoubleType, 
> true), StructField("column4", StringType, true)))
> val schemaWithCorrupt = StructType(schema.fields :+ 
> StructField("_corrupt_record", StringType, true)) 
> val df = spark.read.option("header", "true").option("mode", 
> "PERMISSIVE").option("columnNameOfCorruptRecord", 
> "_corrupt_record").schema(schemaWithCorrupt).csv("test.csv") 
> val countCommas = udf((s: String) => if (s != null) s.count(_ == ',') else 
> -1) 
> val dfWithJagged = df.withColumn("__is_jagged", 
> when(col("_corrupt_record").isNull, 
> false).otherwise(countCommas(col("_corrupt_record")) =!= 3))
> val dfDropped = dfWithJagged.filter(col("__is_jagged") =!= true)
> val groupedSum = 
> dfDropped.groupBy("column1").agg(sum("column2").alias("sum_column2"))
> groupedSum.show(){code}
> *We get:*
> {code:java}
> +---+---+
> |column1|sum_column2|
> +---+---+
> |      8|        9.0|
> |   four|        5.0|
> |    ten|       11.0|
> +---+---+ {code}
>  
> *Which is not correct*
>  
> With the addition of the aggregate, the filter down to rows with 3 commas in 
> the corrupt record column is ignored. This does not happed with any other 
> operators I have tried - just aggregates so far.
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-48423) Unable to write MLPipeline to blob storage using .option attribute

2024-05-29 Thread Chhavi Bansal (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48423?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chhavi Bansal updated SPARK-48423:
--
Priority: Blocker  (was: Major)

> Unable to write MLPipeline to blob storage using .option attribute
> --
>
> Key: SPARK-48423
> URL: https://issues.apache.org/jira/browse/SPARK-48423
> Project: Spark
>  Issue Type: Bug
>  Components: ML, MLlib, Spark Core
>Affects Versions: 3.4.3
>Reporter: Chhavi Bansal
>Priority: Blocker
>
> I am trying to write mllib pipeline with a series of stages set in it to 
> azure blob storage giving relevant write parameters, but it still complains 
> of `fs.azure.account.key` not being found in the configuration.
> Sharing the code.
> {code:java}
> val spark = 
> SparkSession.builder().appName("main").master("local[4]").getOrCreate()
> import spark.implicits._
> val df = spark.createDataFrame(Seq(
>   (0L, "a b c d e spark"),
>   (1L, "b d")
> )).toDF("id", "text") 
> val si = new StringIndexer().setInputCol("text").setOutputCol("IND_text")
> val pipelinee = new Pipeline().setStages(Array(si))
> val pipelineModel = pipelinee.fit(df)
> val path = BLOB_STORAGE_PATH
> pipelineModel.write
> .option("spark.hadoop.fs.azure.account.key..dfs.core.windows.net",
>   "__").option("fs.azure.account.key..dfs.core.windows.net", 
> "__").option("fs.azure.account.oauth2.client.endpoint..dfs.core.windows.net",
>  
> "__").option("fs.azure.account.oauth2.client.id..dfs.core.windows.net",
>  
> "__").option("fs.azure.account.auth.type..dfs.core.windows.net","__").option("fs.azure.account.oauth2.client.secret..dfs.core.windows.net",
>  
> "__").option("fs.azure.account.oauth.provider.type..dfs.core.windows.net",
>  "__")
> .save(path){code}
>  
> The error that i get is 
> {code:java}
>  Failure to initialize configuration
> Caused by: InvalidConfigurationValueException: Invalid configuration value 
> detected for fs.azure.account.key
> at 
> org.apache.hadoop.fs.azurebfs.services.SimpleKeyProvider.getStorageAccountKey(SimpleKeyProvider.java:51)
>     at 
> org.apache.hadoop.fs.azurebfs.AbfsConfiguration.getStorageAccountKey(AbfsConfiguration.java:548)
>     at 
> org.apache.hadoop.fs.azurebfs.AzureBlobFileSystemStore.initializeClient(AzureBlobFileSystemStore.java:1449){code}
> This shows that even though the key,value of 
> {code:java}
> spark.hadoop.fs.azure.account.key..dfs.core.windows.net {code}
> is being sent via option param, but is not being set internally.
>  
> while this works only if i explicitly set the values in the
> {code:java}
> spark.conf.set(key,value) {code}
> which might be problematic for a multi-tenant solution, which can be using 
> the same spark context.
> one other observation is 
> {code:java}
> df.write.option(key1,value1).option(key2,value2).save(path)  {code}
> fails with same key error while,
> {code:java}
> map = Map(key1->value1, key2->value2)  
> df.write.options(map).save(path) {code}
> works..
>  
> My Ask is that like dataframe
> {code:java}
> write.options(Map) {code}
> helps set this configuration, the .option(key1, value1) should also work to 
> write to azure blob storage.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Reopened] (SPARK-48322) Drop internal metadata in `DataFrame.schema`

2024-05-29 Thread Ruifeng Zheng (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48322?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng reopened SPARK-48322:
---

> Drop internal metadata in `DataFrame.schema`
> 
>
> Key: SPARK-48322
> URL: https://issues.apache.org/jira/browse/SPARK-48322
> Project: Spark
>  Issue Type: Improvement
>  Components: Connect, PySpark, SQL
>Affects Versions: 4.0.0
>Reporter: Ruifeng Zheng
>Assignee: Ruifeng Zheng
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-48322) Drop internal metadata in `DataFrame.schema`

2024-05-29 Thread Ruifeng Zheng (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48322?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng reassigned SPARK-48322:
-

Assignee: (was: Ruifeng Zheng)

> Drop internal metadata in `DataFrame.schema`
> 
>
> Key: SPARK-48322
> URL: https://issues.apache.org/jira/browse/SPARK-48322
> Project: Spark
>  Issue Type: Improvement
>  Components: Connect, PySpark, SQL
>Affects Versions: 4.0.0
>Reporter: Ruifeng Zheng
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Reopened] (SPARK-42965) metadata mismatch for StructField when running some tests.

2024-05-29 Thread Ruifeng Zheng (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-42965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng reopened SPARK-42965:
---
  Assignee: (was: Ruifeng Zheng)

> metadata mismatch for StructField when running some tests.
> --
>
> Key: SPARK-42965
> URL: https://issues.apache.org/jira/browse/SPARK-42965
> Project: Spark
>  Issue Type: Improvement
>  Components: Connect, Pandas API on Spark
>Affects Versions: 3.5.0
>Reporter: Haejoon Lee
>Priority: Major
> Fix For: 4.0.0
>
>
> For some reason, the metadata of `StructField` is different in a few tests 
> when using Spark Connect. However, the function works properly.
> For example, when running `python/run-tests --testnames 
> 'pyspark.pandas.tests.connect.data_type_ops.test_parity_binary_ops 
> BinaryOpsParityTests.test_add'` it complains `AssertionError: 
> ([InternalField(dtype=int64, struct_field=StructField('bool', LongType(), 
> False))], [StructField('bool', LongType(), False)])` because metadata is 
> different something like `\{'__autoGeneratedAlias': 'true'}` but they have 
> same name, type and nullable, so the function just works well.
> Therefore, we have temporarily added a branch for Spark Connect in the code 
> so that we can create InternalFrame properly to provide more pandas APIs in 
> Spark Connect. If a clear cause is found, we may need to revert it back to 
> its original state.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-48402) Virtualenv section of "Python Package Management" does not work on Windows

2024-05-29 Thread Wolff Bock von Wuelfingen (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-48402?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17850347#comment-17850347
 ] 

Wolff Bock von Wuelfingen commented on SPARK-48402:
---

Additional notes:

#environment can be safely omitted and ignored, it does not work. From what 
i've gathered from third party websites, it's supposed to extract the content 
into a folder of that name. That is not the case.

Also, the guide is missing a very crucial part: You will have to activate the 
venv yourself, from your python script. Before you import other libraries. 
Example:
{code:java}
# Activate the virtual environment containing our dependencies
activate_env = os.path.expanduser("/usr/local/app/.venv/bin/activate_this.py")
exec(open(activate_env).read(), dict(__file__=activate_env)) {code}
This third party guide explains it nicely: 
https://saturncloud.io/blog/shipping-virtual-environments-with-pyspark-a-comprehensive-guide/

> Virtualenv section of "Python Package Management" does not work on Windows
> --
>
> Key: SPARK-48402
> URL: https://issues.apache.org/jira/browse/SPARK-48402
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 3.4.3
>Reporter: Wolff Bock von Wuelfingen
>Priority: Minor
>
> [https://spark.apache.org/docs/latest/api/python/user_guide/python_packaging.html]
> Has a section about using virtualenv, including venv-pack. venv-pack does not 
> currently work on Windows: [https://github.com/jcrist/venv-pack/issues/6]
> Either the docs need adjusting to reflect that, or a different tool needs to 
> be used. I'm far from a python expert, so i have no idea if the latter is 
> even possible.
> Would be nice to add a hint, to save people some time (maybe suggesting to 
> use WSL instead?).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-48459) Implement DataFrameQueryContext in Spark Connect

2024-05-29 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48459?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-48459:
---
Labels: pull-request-available  (was: )

> Implement DataFrameQueryContext in Spark Connect
> 
>
> Key: SPARK-48459
> URL: https://issues.apache.org/jira/browse/SPARK-48459
> Project: Spark
>  Issue Type: Improvement
>  Components: Connect, PySpark
>Affects Versions: 4.0.0
>Reporter: Hyukjin Kwon
>Priority: Major
>  Labels: pull-request-available
>
> Implements the same https://github.com/apache/spark/pull/45377 in Spark 
> Connect



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-48459) Implement DataFrameQueryContext in Spark Connect

2024-05-29 Thread Hyukjin Kwon (Jira)

Hyukjin Kwon created SPARK-48459:


 Summary: Implement DataFrameQueryContext in Spark Connect
 Key: SPARK-48459
 URL: https://issues.apache.org/jira/browse/SPARK-48459
 Project: Spark
  Issue Type: Improvement
  Components: Connect, PySpark
Affects Versions: 4.0.0
Reporter: Hyukjin Kwon


Implements the same https://github.com/apache/spark/pull/45377 in Spark Connect



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-48458) Dynamic partition override mode might be ignored in certain scenarios causing data loss

2024-05-29 Thread Artem Kupchinskiy (Jira)

Artem Kupchinskiy created SPARK-48458:
-

 Summary: Dynamic partition override mode might be ignored in 
certain scenarios causing data loss
 Key: SPARK-48458
 URL: https://issues.apache.org/jira/browse/SPARK-48458
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.5.1, 2.4.8, 4.0.0
Reporter: Artem Kupchinskiy


If an active spark session is stopped in the middle of an insert into file 
system, the session config responsible for overwriting partitions behavior 
might be not respected. The failure scenario basically is following:
 # The spark context is stopped just before  [getting a partition override mode 
setting|https://github.com/apache/spark/blob/8bbbde7cb3c396bc369c06853ed3a2ec021a2530/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InsertIntoHadoopFsRelationCommand.scala#L69]
 # Due to the [fallback config usage in case of stopped spark 
context,|https://github.com/apache/spark/blob/8bbbde7cb3c396bc369c06853ed3a2ec021a2530/sql/core/src/main/scala/org/apache/spark/sql/SparkSession.scala#L121]
 this mode is evaluated to static (default mode in the default SQLConf used as  
a fallback)
 # The data is cleared 
[here|https://github.com/apache/spark/blob/8bbbde7cb3c396bc369c06853ed3a2ec021a2530/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InsertIntoHadoopFsRelationCommand.scala#L131]
 totally which is literally a data loss from the user perspective who intends 
to overwrite data just partially.

This 
[gist|https://gist.github.com/akupchinskiy/b5f31781d59e5c0e9b172e7de40132cd] 
reproduces this behavior. On my local machine, it takes 1-3 iterations to have  
pre-created data cleared totally.

The mitigation of this bug would be usage of explicit write parameter 
`partitionOverwriteMode`  instead of relying on session configuration.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-47415) Levenshtein (all collations)

2024-05-29 Thread Jira



[ 
https://issues.apache.org/jira/browse/SPARK-47415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17850313#comment-17850313
 ] 

Uroš Bojanić commented on SPARK-47415:
--

[~nikolamand-db] [~nebojsa-db] For more context on this task, we have concluded 
that this is a hard problem and current algorithm doesn't extend well to other 
collations. For this reason, we will now switch over to pass-through approach: 
[https://github.com/apache/spark/pull/46788]

Since there is little hope to adding actual collation support in the future 
too, this means that the Levenshtein expression will work on string arguments 
with any given collation, but *will not* take that collation into consideration 
when calculating the edit distance between two strings.

> Levenshtein (all collations)
> 
>
> Key: SPARK-47415
> URL: https://issues.apache.org/jira/browse/SPARK-47415
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Uroš Bojanić
>Priority: Major
>  Labels: pull-request-available
>
> Enable collation support for the *Levenshtein* built-in string function in 
> Spark. First confirm what is the expected behaviour for this function when 
> given collated strings, and then move on to implementation and testing. 
> Implement the corresponding unit tests and E2E sql tests to reflect how this 
> function should be used with collation in SparkSQL, and feel free to use your 
> chosen Spark SQL Editor to experiment with the existing functions to learn 
> more about how they work. In addition, look into the possible use-cases and 
> implementation of similar functions within other other open-source DBMS, such 
> as [PostgreSQL|https://www.postgresql.org/docs/].
>  
> The goal for this Jira ticket is to implement the *Levenshtein* function so 
> it supports all collation types currently supported in Spark. To understand 
> what changes were introduced in order to enable full collation support for 
> other existing functions in Spark, take a look at the Spark PRs and Jira 
> tickets for completed tasks in this parent (for example: Contains, 
> StartsWith, EndsWith).
>  
> Read more about ICU [Collation Concepts|http://example.com/] and 
> [Collator|http://example.com/] class. Also, refer to the Unicode Technical 
> Standard for string 
> [searching|https://www.unicode.org/reports/tr10/#Searching] and 
> [collation|https://www.unicode.org/reports/tr35/tr35-collation.html#Collation_Type_Fallback].



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-48457) Testing and operational readiness

2024-05-29 Thread David Milicevic (Jira)

David Milicevic created SPARK-48457:
---

 Summary: Testing and operational readiness
 Key: SPARK-48457
 URL: https://issues.apache.org/jira/browse/SPARK-48457
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core
Affects Versions: 4.0.0
Reporter: David Milicevic


We are basically doing this as we are developing the feature. This work item 
should serve as a checkpoint by the end of M0 to figure out if we have covered 
everything.

 

Testing is very clearly defined by itself.

For the operational readiness part, we are still to figure out what exactly we 
can do in the case of SQL scripting. It's a really straightforward feature and 
public documentation should serve well enough for most of the issues we might 
encounter. But, we should probably think about:
 * Some KPI indicators.
 * Telemetry.
 * Something else?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-48456) Performance benchmark

2024-05-29 Thread David Milicevic (Jira)

David Milicevic created SPARK-48456:
---

 Summary: Performance benchmark
 Key: SPARK-48456
 URL: https://issues.apache.org/jira/browse/SPARK-48456
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core
Affects Versions: 4.0.0
Reporter: David Milicevic


Performance parity is officially an M2 requirement, but by the end of M0 I 
think we should start doing some perf benchmarks to figure out where do we 
stand in the beginning and if we need to change something right from the start 
before we get to work on a more complex stuff.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-48455) Public documentation

2024-05-29 Thread David Milicevic (Jira)

David Milicevic created SPARK-48455:
---

 Summary: Public documentation
 Key: SPARK-48455
 URL: https://issues.apache.org/jira/browse/SPARK-48455
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core
Affects Versions: 4.0.0
Reporter: David Milicevic


Public documentation is officially Milestone 1 requirement, but I think we 
should start working on this even during Milestone 0.


I guess this shouldn't be anything revolutionary, just a basic doc with SQL 
Scripting grammar and functions explained properly.

 

We might want to sync with Serge about this to figure out if he has any 
thoughts before we start working on it.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-48454) Directly use the parent dataframe class

2024-05-29 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48454?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot reassigned SPARK-48454:
--

Assignee: (was: Apache Spark)

> Directly use the parent dataframe class
> ---
>
> Key: SPARK-48454
> URL: https://issues.apache.org/jira/browse/SPARK-48454
> Project: Spark
>  Issue Type: Improvement
>  Components: PS
>Affects Versions: 4.0.0
>Reporter: Ruifeng Zheng
>Priority: Minor
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-48454) Directly use the parent dataframe class

2024-05-29 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48454?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot reassigned SPARK-48454:
--

Assignee: Apache Spark

> Directly use the parent dataframe class
> ---
>
> Key: SPARK-48454
> URL: https://issues.apache.org/jira/browse/SPARK-48454
> Project: Spark
>  Issue Type: Improvement
>  Components: PS
>Affects Versions: 4.0.0
>Reporter: Ruifeng Zheng
>Assignee: Apache Spark
>Priority: Minor
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-48454) Directly use the parent dataframe class

2024-05-29 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48454?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot reassigned SPARK-48454:
--

Assignee: Apache Spark

> Directly use the parent dataframe class
> ---
>
> Key: SPARK-48454
> URL: https://issues.apache.org/jira/browse/SPARK-48454
> Project: Spark
>  Issue Type: Improvement
>  Components: PS
>Affects Versions: 4.0.0
>Reporter: Ruifeng Zheng
>Assignee: Apache Spark
>Priority: Minor
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-48416) Support related nested WITH expression

2024-05-29 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48416?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot reassigned SPARK-48416:
--

Assignee: (was: Apache Spark)

> Support related nested WITH expression
> --
>
> Key: SPARK-48416
> URL: https://issues.apache.org/jira/browse/SPARK-48416
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.5.1
>Reporter: Mingliang Zhu
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-48416) Support related nested WITH expression

2024-05-29 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48416?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot reassigned SPARK-48416:
--

Assignee: Apache Spark

> Support related nested WITH expression
> --
>
> Key: SPARK-48416
> URL: https://issues.apache.org/jira/browse/SPARK-48416
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.5.1
>Reporter: Mingliang Zhu
>Assignee: Apache Spark
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-48454) Directly use the parent dataframe class

2024-05-29 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48454?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-48454:
---
Labels: pull-request-available  (was: )

> Directly use the parent dataframe class
> ---
>
> Key: SPARK-48454
> URL: https://issues.apache.org/jira/browse/SPARK-48454
> Project: Spark
>  Issue Type: Improvement
>  Components: PS
>Affects Versions: 4.0.0
>Reporter: Ruifeng Zheng
>Priority: Minor
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-48377) Multiple results API - sqlScript()

2024-05-29 Thread David Milicevic (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-48377?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17850298#comment-17850298
 ] 

David Milicevic commented on SPARK-48377:
-

I took it on myself to write an API proposal (with atomic's help).

> Multiple results API - sqlScript()
> --
>
> Key: SPARK-48377
> URL: https://issues.apache.org/jira/browse/SPARK-48377
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: David Milicevic
>Priority: Major
>
> For now:
>  * Write an API proposal
>  ** The API itself should be fine, but we need to figure out what the result 
> set should look like, i.e. in what format we return multiple DataFrames.
>  ** The result set should be compatible with CALL and EXECUTE IMMEDIATE as 
> well.
>  * Figure out how the API will propagate down the Spark Connect stack 
> (depends on SPARK-48452 investigation)
>  
> Probably to be separated into multiple subtasks in the future.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-48453) Support for PRINT/TRACE statement

2024-05-29 Thread David Milicevic (Jira)

David Milicevic created SPARK-48453:
---

 Summary: Support for PRINT/TRACE statement
 Key: SPARK-48453
 URL: https://issues.apache.org/jira/browse/SPARK-48453
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core
Affects Versions: 4.0.0
Reporter: David Milicevic


This is not defined in Ref 
Spec[,|https://docs.google.com/document/d/1cpSuR3KxRuTSJ4ZMQ73FJ4_-hjouNNU2zfI4vri6yhs/edit#heading=h.4cz970y1mk93],]
 but during POC we figured out that it might be useful.

Still need to figure out the details when we get to it, because the propagation 
to the client and UI on the client side might not be trivial, but this needs 
further investigation.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-48377) Multiple results API - sqlScript()

2024-05-29 Thread David Milicevic (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48377?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Milicevic updated SPARK-48377:

Description: 
For now:
 * Write an API proposal
 ** The API itself should be fine, but we need to figure out what the result 
set should look like, i.e. in what format we return multiple DataFrames.
 ** The result set should be compatible with CALL and EXECUTE IMMEDIATE as well.
 * Figure out how the API will propagate down the Spark Connect stack (depends 
on SPARK-48452 investigation)

 

Probably to be separated into multiple subtasks in the future.

  was:
TBD.
Probably to be separated into multiple subtasks.


> Multiple results API - sqlScript()
> --
>
> Key: SPARK-48377
> URL: https://issues.apache.org/jira/browse/SPARK-48377
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: David Milicevic
>Priority: Major
>
> For now:
>  * Write an API proposal
>  ** The API itself should be fine, but we need to figure out what the result 
> set should look like, i.e. in what format we return multiple DataFrames.
>  ** The result set should be compatible with CALL and EXECUTE IMMEDIATE as 
> well.
>  * Figure out how the API will propagate down the Spark Connect stack 
> (depends on SPARK-48452 investigation)
>  
> Probably to be separated into multiple subtasks in the future.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-48452) Spark Connect investigation

2024-05-29 Thread David Milicevic (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-48452?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17850296#comment-17850296
 ] 

David Milicevic commented on SPARK-48452:
-

[~milan.dankovic] is working on this.

> Spark Connect investigation
> ---
>
> Key: SPARK-48452
> URL: https://issues.apache.org/jira/browse/SPARK-48452
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: David Milicevic
>Priority: Major
>
> Some notebook modes, VS code extension, etc. execute SQL commands through 
> Spark Connect.
> We need to:
> - Figure out exceptions that we are getting in Spark Connect stack for SQL 
> scripts.
> - Understand the Spark Connect stack better, so we can more easily propose 
> design for new API(s) we are going to introduce in the future.
>  
> For more details, design doc can be found in parent Jira item.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-48452) Spark Connect investigation

2024-05-29 Thread David Milicevic (Jira)

David Milicevic created SPARK-48452:
---

 Summary: Spark Connect investigation
 Key: SPARK-48452
 URL: https://issues.apache.org/jira/browse/SPARK-48452
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core
Affects Versions: 4.0.0
Reporter: David Milicevic


Some notebook modes, VS code extension, etc. execute SQL commands through Spark 
Connect.

We need to:
- Figure out exceptions that we are getting in Spark Connect stack for SQL 
scripts.

- Understand the Spark Connect stack better, so we can more easily propose 
design for new API(s) we are going to introduce in the future.

 

For more details, design doc can be found in parent Jira item.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

57 matches

Mail list logo