[jira] [Created] (SPARK-48470) Rename listState functions for single and list type appends
Anish Shrigondekar created SPARK-48470: -- Summary: Rename listState functions for single and list type appends Key: SPARK-48470 URL: https://issues.apache.org/jira/browse/SPARK-48470 Project: Spark Issue Type: Task Components: Structured Streaming Affects Versions: 4.0.0 Reporter: Anish Shrigondekar Rename listState functions for single and list type appends -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45959) SPIP: Abusing DataSet.withColumn can cause huge tree with severe perf degradation
[ https://issues.apache.org/jira/browse/SPARK-45959?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Asif updated SPARK-45959: - Priority: Major (was: Minor) > SPIP: Abusing DataSet.withColumn can cause huge tree with severe perf > degradation > - > > Key: SPARK-45959 > URL: https://issues.apache.org/jira/browse/SPARK-45959 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.5.1 >Reporter: Asif >Priority: Major > Labels: pull-request-available > > Though documentation clearly recommends to add all columns in a single shot, > but in reality is difficult to expect customer to modify their code, as in > spark2 the rules in analyzer were such that they did not do deep tree > traversal. Moreover in Spark3 , the plans are cloned before giving to > analyzer , optimizer etc which was not the case in Spark2. > All these things have resulted in query time being increased from 5 min to 2 > - 3 hrs. > Many times the columns are added to plan via some for loop logic which just > keeps adding new computation based on some rule. > So, my suggestion is to Collapse the Projects early, once the analysis of > the logical plan is done, but before the plan gets assigned to the field > variable in QueryExecution. > The PR for the above is ready for review. > The major change is in the way the lookup is performed in CacheManager. > I have described the logic in the PR and have added multiple tests. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-48464) Refactor SQLConfSuite and StatisticsSuite
[ https://issues.apache.org/jira/browse/SPARK-48464?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-48464. -- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 46796 [https://github.com/apache/spark/pull/46796] > Refactor SQLConfSuite and StatisticsSuite > - > > Key: SPARK-48464 > URL: https://issues.apache.org/jira/browse/SPARK-48464 > Project: Spark > Issue Type: Sub-task > Components: Tests >Affects Versions: 4.0.0 >Reporter: Rui Wang >Assignee: Rui Wang >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-48468) Add LogicalQueryStage interface in catalyst
[ https://issues.apache.org/jira/browse/SPARK-48468?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-48468: --- Labels: pull-request-available (was: ) > Add LogicalQueryStage interface in catalyst > --- > > Key: SPARK-48468 > URL: https://issues.apache.org/jira/browse/SPARK-48468 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 4.0.0 >Reporter: Ziqi Liu >Priority: Major > Labels: pull-request-available > > Add `LogicalQueryStage` interface in catalyst so that it's visible in logical > rules -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-48468) Add LogicalQueryStage interface in catalyst
Ziqi Liu created SPARK-48468: Summary: Add LogicalQueryStage interface in catalyst Key: SPARK-48468 URL: https://issues.apache.org/jira/browse/SPARK-48468 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 4.0.0 Reporter: Ziqi Liu Add `LogicalQueryStage` interface in catalyst so that it's visible in logical rules -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-48454) Directly use the parent dataframe class
[ https://issues.apache.org/jira/browse/SPARK-48454?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-48454. -- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 46785 [https://github.com/apache/spark/pull/46785] > Directly use the parent dataframe class > --- > > Key: SPARK-48454 > URL: https://issues.apache.org/jira/browse/SPARK-48454 > Project: Spark > Issue Type: Improvement > Components: PS >Affects Versions: 4.0.0 >Reporter: Ruifeng Zheng >Assignee: Ruifeng Zheng >Priority: Minor > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-48454) Directly use the parent dataframe class
[ https://issues.apache.org/jira/browse/SPARK-48454?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-48454: Assignee: Ruifeng Zheng > Directly use the parent dataframe class > --- > > Key: SPARK-48454 > URL: https://issues.apache.org/jira/browse/SPARK-48454 > Project: Spark > Issue Type: Improvement > Components: PS >Affects Versions: 4.0.0 >Reporter: Ruifeng Zheng >Assignee: Ruifeng Zheng >Priority: Minor > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-48467) Upgrade Maven to 3.9.7
[ https://issues.apache.org/jira/browse/SPARK-48467?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-48467: --- Labels: pull-request-available (was: ) > Upgrade Maven to 3.9.7 > -- > > Key: SPARK-48467 > URL: https://issues.apache.org/jira/browse/SPARK-48467 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 4.0.0 >Reporter: BingKun Pan >Priority: Minor > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-48442) Add parenthesis to awaitTermination call
[ https://issues.apache.org/jira/browse/SPARK-48442?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-48442. -- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 46779 [https://github.com/apache/spark/pull/46779] > Add parenthesis to awaitTermination call > > > Key: SPARK-48442 > URL: https://issues.apache.org/jira/browse/SPARK-48442 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.4.3 >Reporter: Riya Verma >Assignee: Riya Verma >Priority: Trivial > Labels: correctness, pull-request-available, starter > Fix For: 4.0.0 > > > In {{test_stream_reader}} and {{test_stream_writer}} of > {*}test_python_streaming_datasource.py{*}, the call {{q.awaitTermination}} > does not invoke a function call as intended, but instead returns a python > function object. The fix is to change this to {{{}q.awaitTermination(){}}}. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-48467) Upgrade Maven to 3.9.7
BingKun Pan created SPARK-48467: --- Summary: Upgrade Maven to 3.9.7 Key: SPARK-48467 URL: https://issues.apache.org/jira/browse/SPARK-48467 Project: Spark Issue Type: Improvement Components: Build Affects Versions: 4.0.0 Reporter: BingKun Pan -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-48465) Avoid no-op empty relation propagation in AQE
Ziqi Liu created SPARK-48465: Summary: Avoid no-op empty relation propagation in AQE Key: SPARK-48465 URL: https://issues.apache.org/jira/browse/SPARK-48465 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 4.0.0 Reporter: Ziqi Liu We should avoid no-op empty relation propagation in AQE: if we convert an empty QueryStageExec to empty relation, it will further wrapped into a new query stage and execute -> produce empty result -> empty relation propagation again. This issue is currently not exposed because AQE will try to reuse shuffle. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-48431) Do not forward predicates on collated columns to file readers
[ https://issues.apache.org/jira/browse/SPARK-48431?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-48431. - Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 46760 [https://github.com/apache/spark/pull/46760] > Do not forward predicates on collated columns to file readers > - > > Key: SPARK-48431 > URL: https://issues.apache.org/jira/browse/SPARK-48431 > Project: Spark > Issue Type: Task > Components: Spark Core >Affects Versions: 4.0.0 >Reporter: Jan-Ole Sasse >Assignee: Jan-Ole Sasse >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > > SPARK-47657 allows to push filters on collated columns to file sources that > support it. If such filters are pushed to file sources, those file sources > must not push those filters to the actual file readers (i.e. parquet or csv > readers), because there is no guarantee that those support collations. > With this task, we are widening filters on collations to be AlwaysTrue when > we translate filters for file sources. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-48446) Update SS Doc of dropDuplicatesWithinWatermark to use the right syntax
[ https://issues.apache.org/jira/browse/SPARK-48446?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-48446: --- Labels: easyfix pull-request-available (was: easyfix) > Update SS Doc of dropDuplicatesWithinWatermark to use the right syntax > -- > > Key: SPARK-48446 > URL: https://issues.apache.org/jira/browse/SPARK-48446 > Project: Spark > Issue Type: Documentation > Components: Structured Streaming >Affects Versions: 4.0.0 >Reporter: Yuchen Liu >Priority: Minor > Labels: easyfix, pull-request-available > Original Estimate: 1h > Remaining Estimate: 1h > > For dropDuplicates, the example on > [https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#:~:text=)%20%5C%0A%20%20.-,dropDuplicates,-(%22guid%22] > is out of date compared with > [https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.dropDuplicates.html]. > The argument should be a list. > The discrepancy is also true for dropDuplicatesWithinWatermark. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-48380) AutoBatchedPickler caused Unsafe allocate to fail due to 2GB limit
[ https://issues.apache.org/jira/browse/SPARK-48380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17850486#comment-17850486 ] Zheng Shao commented on SPARK-48380: A second look at the stacktrace shows that the current issue is not in AutoBatchPickler since it's actually calling the underlying iterator.next(): {{ at scala.collection.Iterator$$anon$10.next(Iterator.scala:461)}} {{ at org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.next(SerDeUtil.scala:93)}} It doesn't mean that AutoBatchedPickler doesn't have a problem - the batching logic still needs to be fixed. But there is an underlying problem in the iterator.next() (that processes one row) that needs to be fixed also. > AutoBatchedPickler caused Unsafe allocate to fail due to 2GB limit > -- > > Key: SPARK-48380 > URL: https://issues.apache.org/jira/browse/SPARK-48380 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.5.1 >Reporter: Zheng Shao >Priority: Major > Labels: pull-request-available > > AutoBatchedPickler assumes that the row sizes are more or less uniform. > That's of course not true all the time. > It needs to find a more conservative algorithm. > Or simply, we should introduce an option to disable AutoBatch and pickle one > row at a time. > The stacktrace: > {{```}} > {{Py4JJavaError: An error occurred while calling o562.saveAsTable.}} > {{: org.apache.spark.SparkException: Job aborted due to stage failure: Task > 1811 in stage 8.0 failed 4 times, most recent failure: Lost task 1811.3 in > stage 8.0 (TID 2782) (10.251.129.187 executor 70): > java.lang.IllegalArgumentException: Cannot grow BufferHolder by size > 578595584 because the size after growing exceeds size limitation 2147483632}} > {{ at > org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder.grow(BufferHolder.java:71)}} > {{ at > org.apache.spark.sql.catalyst.expressions.codegen.UnsafeWriter.grow(UnsafeWriter.java:66)}} > {{ at > org.apache.spark.sql.catalyst.expressions.codegen.UnsafeWriter.write(UnsafeWriter.java:201)}} > {{ at > org.apache.spark.sql.catalyst.expressions.InterpretedUnsafeProjection$.writeArray(InterpretedUnsafeProjection.scala:322)}} > {{ at > org.apache.spark.sql.catalyst.expressions.InterpretedUnsafeProjection$.$anonfun$generateFieldWriter$16(InterpretedUnsafeProjection.scala:200)}} > {{ at > org.apache.spark.sql.catalyst.expressions.InterpretedUnsafeProjection$.$anonfun$generateFieldWriter$16$adapted(InterpretedUnsafeProjection.scala:198)}} > {{ at > org.apache.spark.sql.catalyst.expressions.InterpretedUnsafeProjection$.$anonfun$generateFieldWriter$22(InterpretedUnsafeProjection.scala:288)}} > {{ at > org.apache.spark.sql.catalyst.expressions.InterpretedUnsafeProjection$.$anonfun$generateFieldWriter$22$adapted(InterpretedUnsafeProjection.scala:286)}} > {{ at > org.apache.spark.sql.catalyst.expressions.InterpretedUnsafeProjection$.$anonfun$generateStructWriter$2(InterpretedUnsafeProjection.scala:123)}} > {{ at > org.apache.spark.sql.catalyst.expressions.InterpretedUnsafeProjection$.$anonfun$generateStructWriter$2$adapted(InterpretedUnsafeProjection.scala:120)}} > {{ at > org.apache.spark.sql.catalyst.expressions.InterpretedUnsafeProjection.$anonfun$writer$3(InterpretedUnsafeProjection.scala:67)}} > {{ at > org.apache.spark.sql.catalyst.expressions.InterpretedUnsafeProjection.$anonfun$writer$3$adapted(InterpretedUnsafeProjection.scala:65)}} > {{ at > org.apache.spark.sql.catalyst.expressions.InterpretedUnsafeProjection.apply(InterpretedUnsafeProjection.scala:90)}} > {{ at > org.apache.spark.sql.catalyst.expressions.InterpretedUnsafeProjection.apply(InterpretedUnsafeProjection.scala:36)}} > {{ at scala.collection.Iterator$$anon$10.next(Iterator.scala:461)}} > {{ at scala.collection.Iterator$$anon$10.next(Iterator.scala:461)}} > {{ at scala.collection.Iterator$$anon$10.next(Iterator.scala:461)}} > {{ at scala.collection.Iterator$$anon$10.next(Iterator.scala:461)}} > {{ at > org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.next(SerDeUtil.scala:93)}} > {{ at > org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.next(SerDeUtil.scala:83)}} > {{ at > org.apache.spark.api.python.PythonRDD$.writeNextElementToStream(PythonRDD.scala:474)}} > {{ at > org.apache.spark.api.python.PythonRunner$$anon$2.writeNextInputToStream(PythonRunner.scala:885)}} > {{ at > org.apache.spark.api.python.BasePythonRunner$ReaderInputStream.writeAdditionalInputToPythonWorker(PythonRunner.scala:813)}} > {{ at > org.apache.spark.api.python.BasePythonRunner$ReaderInputStream.read(PythonRunner.scala:733)}} > {{ at > java.base/java.io.BufferedInputStream.fill(BufferedInputS
[jira] [Updated] (SPARK-48464) Refactor SQLConfSuite and StatisticsSuite
[ https://issues.apache.org/jira/browse/SPARK-48464?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-48464: --- Labels: pull-request-available (was: ) > Refactor SQLConfSuite and StatisticsSuite > - > > Key: SPARK-48464 > URL: https://issues.apache.org/jira/browse/SPARK-48464 > Project: Spark > Issue Type: Sub-task > Components: Tests >Affects Versions: 4.0.0 >Reporter: Rui Wang >Assignee: Rui Wang >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-48464) Refactor SQLConfSuite and StatisticsSuite
Rui Wang created SPARK-48464: Summary: Refactor SQLConfSuite and StatisticsSuite Key: SPARK-48464 URL: https://issues.apache.org/jira/browse/SPARK-48464 Project: Spark Issue Type: Sub-task Components: Tests Affects Versions: 4.0.0 Reporter: Rui Wang Assignee: Rui Wang -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-48462) Refactor HiveQuerySuite.scala and HiveTableScanSuite
[ https://issues.apache.org/jira/browse/SPARK-48462?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-48462. - Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 46792 [https://github.com/apache/spark/pull/46792] > Refactor HiveQuerySuite.scala and HiveTableScanSuite > > > Key: SPARK-48462 > URL: https://issues.apache.org/jira/browse/SPARK-48462 > Project: Spark > Issue Type: Sub-task > Components: Tests >Affects Versions: 4.0.0 >Reporter: Rui Wang >Assignee: Rui Wang >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-44970) Spark History File Uploads Can Fail on S3
[ https://issues.apache.org/jira/browse/SPARK-44970?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17850475#comment-17850475 ] Steve Loughran commented on SPARK-44970: correct. file is only saved on close(). The incomplete multipart uploads are still there and you do get billed for them -which is why its critical to have a lifecycle rule to clean this stuff up. In theory you may be able to rebuild it on a failure, in practise you'd have a hard time working out the order and you'll still be short the last 32-64 MB of data > Spark History File Uploads Can Fail on S3 > - > > Key: SPARK-44970 > URL: https://issues.apache.org/jira/browse/SPARK-44970 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 4.0.0 >Reporter: Holden Karau >Assignee: Holden Karau >Priority: Major > > Sometimes if the driver OOMs the history log will not upload finish. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-48423) Unable to write MLPipeline to blob storage using .option attribute
[ https://issues.apache.org/jira/browse/SPARK-48423?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chhavi Bansal updated SPARK-48423: -- Description: I am trying to write mllib pipeline with a series of stages set in it to azure blob storage giving relevant write parameters, but it still complains of `fs.azure.account.key` not being found in the configuration. Sharing the code. {code:java} val spark = SparkSession.builder().appName("main").master("local[4]").getOrCreate() import spark.implicits._ val df = spark.createDataFrame(Seq( (0L, "a b c d e spark"), (1L, "b d") )).toDF("id", "text") val si = new StringIndexer().setInputCol("text").setOutputCol("IND_text") val pipelinee = new Pipeline().setStages(Array(si)) val pipelineModel = pipelinee.fit(df) val path = BLOB_STORAGE_PATH pipelineModel.write .option("spark.hadoop.fs.azure.account.key..dfs.core.windows.net", "__").option("fs.azure.account.key..dfs.core.windows.net", "__").option("fs.azure.account.oauth2.client.endpoint..dfs.core.windows.net", "__").option("fs.azure.account.oauth2.client.id..dfs.core.windows.net", "__").option("fs.azure.account.auth.type..dfs.core.windows.net","__").option("fs.azure.account.oauth2.client.secret..dfs.core.windows.net", "__").option("fs.azure.account.oauth.provider.type..dfs.core.windows.net", "__") .save(path){code} The error that i get is {code:java} Failure to initialize configuration Caused by: InvalidConfigurationValueException: Invalid configuration value detected for fs.azure.account.key at org.apache.hadoop.fs.azurebfs.services.SimpleKeyProvider.getStorageAccountKey(SimpleKeyProvider.java:51) at org.apache.hadoop.fs.azurebfs.AbfsConfiguration.getStorageAccountKey(AbfsConfiguration.java:548) at org.apache.hadoop.fs.azurebfs.AzureBlobFileSystemStore.initializeClient(AzureBlobFileSystemStore.java:1449){code} This shows that even though the key,value of {code:java} spark.hadoop.fs.azure.account.key..dfs.core.windows.net {code} is being sent via option param, but is not being set internally. while this works only if i explicitly set the values in the {code:java} spark.conf.set(key,value) {code} which might be problematic for a multi-tenant solution, which can be using the same spark context. one other observation is {code:java} df.write.option(key1,value1).option(key2,value2).save(path) {code} fails with same key error while, {code:java} map = Map(key1->value1, key2->value2) df.write.options(map).save(path) {code} works.. Help required on: Similar to how dataframes {code:java} df.write.options(Map) {code} attribute helps to set the configuration, the .option(key1, value1) should also work to write to azure blob storage. was: I am trying to write mllib pipeline with a series of stages set in it to azure blob storage giving relevant write parameters, but it still complains of `fs.azure.account.key` not being found in the configuration. Sharing the code. {code:java} val spark = SparkSession.builder().appName("main").master("local[4]").getOrCreate() import spark.implicits._ val df = spark.createDataFrame(Seq( (0L, "a b c d e spark"), (1L, "b d") )).toDF("id", "text") val si = new StringIndexer().setInputCol("text").setOutputCol("IND_text") val pipelinee = new Pipeline().setStages(Array(si)) val pipelineModel = pipelinee.fit(df) val path = BLOB_STORAGE_PATH pipelineModel.write .option("spark.hadoop.fs.azure.account.key..dfs.core.windows.net", "__").option("fs.azure.account.key..dfs.core.windows.net", "__").option("fs.azure.account.oauth2.client.endpoint..dfs.core.windows.net", "__").option("fs.azure.account.oauth2.client.id..dfs.core.windows.net", "__").option("fs.azure.account.auth.type..dfs.core.windows.net","__").option("fs.azure.account.oauth2.client.secret..dfs.core.windows.net", "__").option("fs.azure.account.oauth.provider.type..dfs.core.windows.net", "__") .save(path){code} The error that i get is {code:java} Failure to initialize configuration Caused by: InvalidConfigurationValueException: Invalid configuration value detected for fs.azure.account.key at org.apache.hadoop.fs.azurebfs.services.SimpleKeyProvider.getStorageAccountKey(SimpleKeyProvider.java:51) at org.apache.hadoop.fs.azurebfs.AbfsConfiguration.getStorageAccountKey(AbfsConfiguration.java:548) at org.apache.hadoop.fs.azurebfs.AzureBlobFileSystemStore.initializeClient(AzureBlobFileSystemStore.java:1449){code} This shows that even though the key,value of {code:java} spark.hadoop.fs.azure.account.key..dfs.core.windows.net {code} is being sent via option param, but is not being set internally. while this works only if i explicitly set the values in the {code:java} spark.conf.set(key,value) {code} which might be problematic for a multi-tenant solution, which can be using the same spark context. one other observation is {code:java} df.write.option(key1,value1).option(key2,value2).sa
[jira] [Updated] (SPARK-48423) Unable to write MLPipeline to blob storage using .option attribute
[ https://issues.apache.org/jira/browse/SPARK-48423?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chhavi Bansal updated SPARK-48423: -- Description: I am trying to write mllib pipeline with a series of stages set in it to azure blob storage giving relevant write parameters, but it still complains of `fs.azure.account.key` not being found in the configuration. Sharing the code. {code:java} val spark = SparkSession.builder().appName("main").master("local[4]").getOrCreate() import spark.implicits._ val df = spark.createDataFrame(Seq( (0L, "a b c d e spark"), (1L, "b d") )).toDF("id", "text") val si = new StringIndexer().setInputCol("text").setOutputCol("IND_text") val pipelinee = new Pipeline().setStages(Array(si)) val pipelineModel = pipelinee.fit(df) val path = BLOB_STORAGE_PATH pipelineModel.write .option("spark.hadoop.fs.azure.account.key..dfs.core.windows.net", "__").option("fs.azure.account.key..dfs.core.windows.net", "__").option("fs.azure.account.oauth2.client.endpoint..dfs.core.windows.net", "__").option("fs.azure.account.oauth2.client.id..dfs.core.windows.net", "__").option("fs.azure.account.auth.type..dfs.core.windows.net","__").option("fs.azure.account.oauth2.client.secret..dfs.core.windows.net", "__").option("fs.azure.account.oauth.provider.type..dfs.core.windows.net", "__") .save(path){code} The error that i get is {code:java} Failure to initialize configuration Caused by: InvalidConfigurationValueException: Invalid configuration value detected for fs.azure.account.key at org.apache.hadoop.fs.azurebfs.services.SimpleKeyProvider.getStorageAccountKey(SimpleKeyProvider.java:51) at org.apache.hadoop.fs.azurebfs.AbfsConfiguration.getStorageAccountKey(AbfsConfiguration.java:548) at org.apache.hadoop.fs.azurebfs.AzureBlobFileSystemStore.initializeClient(AzureBlobFileSystemStore.java:1449){code} This shows that even though the key,value of {code:java} spark.hadoop.fs.azure.account.key..dfs.core.windows.net {code} is being sent via option param, but is not being set internally. while this works only if i explicitly set the values in the {code:java} spark.conf.set(key,value) {code} which might be problematic for a multi-tenant solution, which can be using the same spark context. one other observation is {code:java} df.write.option(key1,value1).option(key2,value2).save(path) {code} fails with same key error while, {code:java} map = Map(key1->value1, key2->value2) df.write.options(map).save(path) {code} works.. Help required on: Similar to how dataframes `options` {code:java} df.write.options(Map) {code} helps to set the configuration, the *.option(key1, value1)* should also work to write to azure blob storage. was: I am trying to write mllib pipeline with a series of stages set in it to azure blob storage giving relevant write parameters, but it still complains of `fs.azure.account.key` not being found in the configuration. Sharing the code. {code:java} val spark = SparkSession.builder().appName("main").master("local[4]").getOrCreate() import spark.implicits._ val df = spark.createDataFrame(Seq( (0L, "a b c d e spark"), (1L, "b d") )).toDF("id", "text") val si = new StringIndexer().setInputCol("text").setOutputCol("IND_text") val pipelinee = new Pipeline().setStages(Array(si)) val pipelineModel = pipelinee.fit(df) val path = BLOB_STORAGE_PATH pipelineModel.write .option("spark.hadoop.fs.azure.account.key..dfs.core.windows.net", "__").option("fs.azure.account.key..dfs.core.windows.net", "__").option("fs.azure.account.oauth2.client.endpoint..dfs.core.windows.net", "__").option("fs.azure.account.oauth2.client.id..dfs.core.windows.net", "__").option("fs.azure.account.auth.type..dfs.core.windows.net","__").option("fs.azure.account.oauth2.client.secret..dfs.core.windows.net", "__").option("fs.azure.account.oauth.provider.type..dfs.core.windows.net", "__") .save(path){code} The error that i get is {code:java} Failure to initialize configuration Caused by: InvalidConfigurationValueException: Invalid configuration value detected for fs.azure.account.key at org.apache.hadoop.fs.azurebfs.services.SimpleKeyProvider.getStorageAccountKey(SimpleKeyProvider.java:51) at org.apache.hadoop.fs.azurebfs.AbfsConfiguration.getStorageAccountKey(AbfsConfiguration.java:548) at org.apache.hadoop.fs.azurebfs.AzureBlobFileSystemStore.initializeClient(AzureBlobFileSystemStore.java:1449){code} This shows that even though the key,value of {code:java} spark.hadoop.fs.azure.account.key..dfs.core.windows.net {code} is being sent via option param, but is not being set internally. while this works only if i explicitly set the values in the {code:java} spark.conf.set(key,value) {code} which might be problematic for a multi-tenant solution, which can be using the same spark context. one other observation is {code:java} df.write.option(key1,value1).option(key2,value2).
[jira] [Created] (SPARK-48463) MLLib function unable to handle tested data
Chhavi Bansal created SPARK-48463: - Summary: MLLib function unable to handle tested data Key: SPARK-48463 URL: https://issues.apache.org/jira/browse/SPARK-48463 Project: Spark Issue Type: Bug Components: ML, MLlib Affects Versions: 3.5.1 Reporter: Chhavi Bansal I am trying to use feature transformer on nested data after flattening, but it fails. {code:java} val structureData = Seq( Row(Row(10, 12), 1000), Row(Row(12, 14), 4300), Row( Row(37, 891), 1400), Row(Row(8902, 12), 4000), Row(Row(12, 89), 1000) ) val structureSchema = new StructType() .add("location", new StructType() .add("longitude", IntegerType) .add("latitude", IntegerType)) .add("salary", IntegerType) val df = spark.createDataFrame(spark.sparkContext.parallelize(structureData), structureSchema) def flattenSchema(schema: StructType, prefix: String = null, prefixSelect: String = null): Array[Column] = { schema.fields.flatMap(f => { val colName = if (prefix == null) f.name else (prefix + "." + f.name) val colnameSelect = if (prefix == null) f.name else (prefixSelect + "." + f.name) f.dataType match { case st: StructType => flattenSchema(st, colName, colnameSelect) case _ => Array(col(colName).as(colnameSelect)) } }) } val flattenColumns = flattenSchema(df.schema) val flattenedDf = df.select(flattenColumns: _*){code} Now using the string indexer on the DOT notation. {code:java} val si = new StringIndexer().setInputCol("location.longitude").setOutputCol("longitutdee") val pipeline = new Pipeline().setStages(Array(si)) pipeline.fit(flattenedDf).transform(flattenedDf).show() {code} The above code fails {code:java} xception in thread "main" org.apache.spark.sql.AnalysisException: Cannot resolve column name "location.longitude" among (location.longitude, location.latitude, salary); did you mean to quote the `location.longitude` column? at org.apache.spark.sql.errors.QueryCompilationErrors$.cannotResolveColumnNameAmongFieldsError(QueryCompilationErrors.scala:2261) at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$resolveException(Dataset.scala:258) at org.apache.spark.sql.Dataset.$anonfun$resolve$1(Dataset.scala:250) . {code} This points to the same failure as when we try to select dot notation columns in a spark dataframe, which is solved using BACKTICKS *`column.name`.* [https://stackoverflow.com/a/51430335/11688337] *so next* I use the back ticks while defining stringIndexer {code:java} val si = new StringIndexer().setInputCol("`location.longitude`").setOutputCol("longitutdee") {code} In this case *it again fails* (with a diff reason) in the stringIndexer code itself {code:java} Exception in thread "main" org.apache.spark.SparkException: Input column `location.longitude` does not exist. at org.apache.spark.ml.feature.StringIndexerBase.$anonfun$validateAndTransformSchema$2(StringIndexer.scala:128) at scala.collection.TraversableLike.$anonfun$flatMap$1(TraversableLike.scala:244) at scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36) at scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33) {code} This blocks me to use feature transformation functions on nested columns. Any help in solving this problem will be highly appreciated. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-48463) MLLib function unable to handle nested data
[ https://issues.apache.org/jira/browse/SPARK-48463?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chhavi Bansal updated SPARK-48463: -- Summary: MLLib function unable to handle nested data (was: MLLib function unable to handle tested data) > MLLib function unable to handle nested data > --- > > Key: SPARK-48463 > URL: https://issues.apache.org/jira/browse/SPARK-48463 > Project: Spark > Issue Type: Bug > Components: ML, MLlib >Affects Versions: 3.5.1 >Reporter: Chhavi Bansal >Priority: Major > Labels: ML, MLPipelines, mllib, nested > > I am trying to use feature transformer on nested data after flattening, but > it fails. > > {code:java} > val structureData = Seq( > Row(Row(10, 12), 1000), > Row(Row(12, 14), 4300), > Row( Row(37, 891), 1400), > Row(Row(8902, 12), 4000), > Row(Row(12, 89), 1000) > ) > val structureSchema = new StructType() > .add("location", new StructType() > .add("longitude", IntegerType) > .add("latitude", IntegerType)) > .add("salary", IntegerType) > val df = spark.createDataFrame(spark.sparkContext.parallelize(structureData), > structureSchema) > def flattenSchema(schema: StructType, prefix: String = null, prefixSelect: > String = null): > Array[Column] = { > schema.fields.flatMap(f => { > val colName = if (prefix == null) f.name else (prefix + "." + f.name) > val colnameSelect = if (prefix == null) f.name else (prefixSelect + "." + > f.name) > f.dataType match { > case st: StructType => flattenSchema(st, colName, colnameSelect) > case _ => > Array(col(colName).as(colnameSelect)) > } > }) > } > val flattenColumns = flattenSchema(df.schema) > val flattenedDf = df.select(flattenColumns: _*){code} > Now using the string indexer on the DOT notation. > > {code:java} > val si = new > StringIndexer().setInputCol("location.longitude").setOutputCol("longitutdee") > val pipeline = new Pipeline().setStages(Array(si)) > pipeline.fit(flattenedDf).transform(flattenedDf).show() {code} > The above code fails > {code:java} > xception in thread "main" org.apache.spark.sql.AnalysisException: Cannot > resolve column name "location.longitude" among (location.longitude, > location.latitude, salary); did you mean to quote the `location.longitude` > column? > at > org.apache.spark.sql.errors.QueryCompilationErrors$.cannotResolveColumnNameAmongFieldsError(QueryCompilationErrors.scala:2261) > at > org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$resolveException(Dataset.scala:258) > at org.apache.spark.sql.Dataset.$anonfun$resolve$1(Dataset.scala:250) > . {code} > This points to the same failure as when we try to select dot notation columns > in a spark dataframe, which is solved using BACKTICKS *`column.name`.* > [https://stackoverflow.com/a/51430335/11688337] > > *so next* > I use the back ticks while defining stringIndexer > {code:java} > val si = new > StringIndexer().setInputCol("`location.longitude`").setOutputCol("longitutdee") > {code} > In this case *it again fails* (with a diff reason) in the stringIndexer code > itself > {code:java} > Exception in thread "main" org.apache.spark.SparkException: Input column > `location.longitude` does not exist. > at > org.apache.spark.ml.feature.StringIndexerBase.$anonfun$validateAndTransformSchema$2(StringIndexer.scala:128) > at > scala.collection.TraversableLike.$anonfun$flatMap$1(TraversableLike.scala:244) > at > scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36) > at > scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33) > {code} > > This blocks me to use feature transformation functions on nested columns. > Any help in solving this problem will be highly appreciated. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-48461) Replace NullPointerExceptions with proper error classes in AssertNotNull expression
[ https://issues.apache.org/jira/browse/SPARK-48461?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-48461: --- Labels: pull-request-available (was: ) > Replace NullPointerExceptions with proper error classes in AssertNotNull > expression > --- > > Key: SPARK-48461 > URL: https://issues.apache.org/jira/browse/SPARK-48461 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 4.0.0 >Reporter: Daniel >Priority: Major > Labels: pull-request-available > > [Code location > here|https://github.com/apache/spark/blob/f5d9b809881552c0e1b5af72b2a32caa25018eb3/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/objects/objects.scala#L1929] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-48281) Alter string search logic for: instr, substring_index (UTF8_BINARY_LCASE)
[ https://issues.apache.org/jira/browse/SPARK-48281?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-48281: --- Assignee: Uroš Bojanić > Alter string search logic for: instr, substring_index (UTF8_BINARY_LCASE) > - > > Key: SPARK-48281 > URL: https://issues.apache.org/jira/browse/SPARK-48281 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 4.0.0 >Reporter: Uroš Bojanić >Assignee: Uroš Bojanić >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-48281) Alter string search logic for: instr, substring_index (UTF8_BINARY_LCASE)
[ https://issues.apache.org/jira/browse/SPARK-48281?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-48281. - Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 46589 [https://github.com/apache/spark/pull/46589] > Alter string search logic for: instr, substring_index (UTF8_BINARY_LCASE) > - > > Key: SPARK-48281 > URL: https://issues.apache.org/jira/browse/SPARK-48281 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 4.0.0 >Reporter: Uroš Bojanić >Assignee: Uroš Bojanić >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-48462) Refactor HiveQuerySuite.scala and HiveTableScanSuite
[ https://issues.apache.org/jira/browse/SPARK-48462?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-48462: --- Labels: pull-request-available (was: ) > Refactor HiveQuerySuite.scala and HiveTableScanSuite > > > Key: SPARK-48462 > URL: https://issues.apache.org/jira/browse/SPARK-48462 > Project: Spark > Issue Type: Sub-task > Components: Tests >Affects Versions: 4.0.0 >Reporter: Rui Wang >Assignee: Rui Wang >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-48462) Refactor HiveQuerySuite.scala and HiveTableScanSuite
Rui Wang created SPARK-48462: Summary: Refactor HiveQuerySuite.scala and HiveTableScanSuite Key: SPARK-48462 URL: https://issues.apache.org/jira/browse/SPARK-48462 Project: Spark Issue Type: Sub-task Components: Tests Affects Versions: 4.0.0 Reporter: Rui Wang Assignee: Rui Wang -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-48444) Refactor SQLQuerySuite
[ https://issues.apache.org/jira/browse/SPARK-48444?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-48444. - Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 46778 [https://github.com/apache/spark/pull/46778] > Refactor SQLQuerySuite > -- > > Key: SPARK-48444 > URL: https://issues.apache.org/jira/browse/SPARK-48444 > Project: Spark > Issue Type: Sub-task > Components: Tests >Affects Versions: 4.0.0 >Reporter: Rui Wang >Assignee: Rui Wang >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-48447) Check state store provider class before invoking the constructor
[ https://issues.apache.org/jira/browse/SPARK-48447?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-48447: --- Labels: pull-request-available (was: ) > Check state store provider class before invoking the constructor > > > Key: SPARK-48447 > URL: https://issues.apache.org/jira/browse/SPARK-48447 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 4.0.0 >Reporter: Yuchen Liu >Priority: Major > Labels: pull-request-available > Original Estimate: 24h > Remaining Estimate: 24h > > We should restrict that only classes > [extending|https://github.com/databricks/runtime/blob/1440e77ab54c40981066c22ec759bdafc0683e76/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/HDFSBackedStateStoreProvider.scala#L73] > {{StateStoreProvider}} can be constructed to prevent customer from > instantiating arbitrary class of objects. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-48461) Replace NullPointerExceptions with proper error classes in AssertNotNull expression
Daniel created SPARK-48461: -- Summary: Replace NullPointerExceptions with proper error classes in AssertNotNull expression Key: SPARK-48461 URL: https://issues.apache.org/jira/browse/SPARK-48461 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 4.0.0 Reporter: Daniel [Code location here|https://github.com/apache/spark/blob/f5d9b809881552c0e1b5af72b2a32caa25018eb3/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/objects/objects.scala#L1929] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-48460) Spark ORC writer generates incorrect meta information(min, max)
[ https://issues.apache.org/jira/browse/SPARK-48460?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Volodymyr T updated SPARK-48460: Description: We found that Hive cannot concatenate some ORC files generated by Spark 3.2.1 and higher versions which contain long strings. Steps to reproduce the issue: 1) Create DF with a string longer than 1024 {code:java} val valid = spark.sql("SELECT 1 as id, cast(NULL as string) as null, lpad('A', 1024, 'A') as string;"){code} {code:java} val invalid = spark.sql("SELECT 1 as id, cast(NULL as string) as null, lpad('A', 1025, 'A') as string;"){code} {code:java} valid.withColumn("len", length($"string")).show() +---++++ | id|null| string| len| +---++++ | 1|null|A...|1024| +---++++{code} {code:java} invalid.withColumn("len", length($"string")).show() +---++++ | id|null| string| len| +---++++ | 1|null|A...|1025| +---++++{code} 2. Write in ORC format to S3 {code:java} valid.write.format("orc") .option("path", "s3://bucket/test/test_orc/") .option("compression", "zlib") .mode("overwrite") .save(){code} 3. Check ORC meta by *hive –orcfiledump* command {code:java} [hadoop@ip ~]$ hive --orcfiledump s3://bucket/tets/test_orc/{code} We can see incorrect statistics for column string {code:java} Column 3: count: 1 hasNull: false bytesOnDisk: 23 min: null max: null sum: 1025{code} {code:java} Processing data file s3://bucket-dev/tets/test_orc/part-0-ec01de8f-8f6b-4937-b107-e88f5a5d2d67-c000.zlib.orc [length: 488]Structure for s3://timmedia-dev/volodymyr/test_orc/part-0-ec01de8f-8f6b-4937-b107-e88f5a5d2d67-c000.zlib.orcFile Version: 0.12 with FUTURERows: 1Compression: ZLIBCompression size: 262144Type: struct Stripe Statistics: Stripe 1: Column 0: count: 1 hasNull: false Column 1: count: 1 hasNull: false bytesOnDisk: 6 min: 1 max: 1 sum: 1 Column 2: count: 0 hasNull: true bytesOnDisk: 5 Column 3: count: 1 hasNull: false bytesOnDisk: 23 min: null max: null sum: 1025 File Statistics: Column 0: count: 1 hasNull: false Column 1: count: 1 hasNull: false bytesOnDisk: 6 min: 1 max: 1 sum: 1 Column 2: count: 0 hasNull: true bytesOnDisk: 5 Column 3: count: 1 hasNull: false bytesOnDisk: 23 min: null max: null sum: 1025 Stripes: Stripe: offset: 3 data: 34 rows: 1 tail: 66 index: 108 Stream: column 0 section ROW_INDEX start: 3 length 11 Stream: column 1 section ROW_INDEX start: 14 length 24 Stream: column 2 section ROW_INDEX start: 38 length 19 Stream: column 3 section ROW_INDEX start: 57 length 54 Stream: column 1 section DATA start: 111 length 6 Stream: column 2 section PRESENT start: 117 length 5 Stream: column 2 section DATA start: 122 length 0 Stream: column 2 section LENGTH start: 122 length 0 Stream: column 2 section DICTIONARY_DATA start: 122 length 0 Stream: column 3 section DATA start: 122 length 16 Stream: column 3 section LENGTH start: 138 length 7 Encoding column 0: DIRECT Encoding column 1: DIRECT_V2 Encoding column 2: DICTIONARY_V2[0] Encoding column 3: DIRECT_V2 File length: 488 bytesPadding length: 0 bytesPadding ratio: 0% User Metadata: org.apache.spark.version=3.4.1{code} For DF with a value smaller than 1024, we can see valid statistics {code:java} hive --orcfiledump s3://bucket/test/test_orcProcessing data file s3://bucket/test/test_orc/part-0-e395cc4d-9e2a-4ef0-9adb-640ed41dd2b7-c000.zlib.orc [length: 485]Structure for s3://timmedia-dev/volodymyr/test_orc/part-0-e395cc4d-9e2a-4ef0-9adb-640ed41dd2b7-c000.zlib.orcFile Version: 0.12 with FUTURERows: 1Compression: ZLIBCompression size: 262144Type: struct Stripe Statistics: Stripe 1: Column 0: count: 1 hasNull: false Column 1: count: 1 hasNull: false bytesOnDisk: 6 min: 1 max: 1 sum: 1 Column 2: count: 0 hasNull: true bytesOnDisk: 5 Column 3: count: 1 hasNull: false bytesOnDisk: 23 min:
[jira] [Created] (SPARK-48460) Spark ORC writer generates incorrect meta information(mix, max)
Volodymyr T created SPARK-48460: --- Summary: Spark ORC writer generates incorrect meta information(mix, max) Key: SPARK-48460 URL: https://issues.apache.org/jira/browse/SPARK-48460 Project: Spark Issue Type: Bug Components: Input/Output Affects Versions: 3.4.3, 3.3.4, 3.5.1, 3.5.0, 3.4.1, 3.4.0, 3.3.2, 3.4.2, 3.3.3, 3.2.4, 3.2.3, 3.3.1, 3.2.2, 3.3.0, 3.2.1 Reporter: Volodymyr T We found that Hive cannot concatenate some ORC files generated by Spark 3.2.1 and higher versions which contain long strings. Steps to reproduce the issue: 1) Create DF with a string longer than 1024 {code:java} val valid = spark.sql("SELECT 1 as id, cast(NULL as string) as null, lpad('A', 1024, 'A') as string;"){code} {code:java} val invalid = spark.sql("SELECT 1 as id, cast(NULL as string) as null, lpad('A', 1025, 'A') as string;"){code} {code:java} valid.withColumn("len", length($"string")).show() +---++++ | id|null| string| len| +---++++ | 1|null|A...|1024| +---++++{code} {code:java} invalid.withColumn("len", length($"string")).show() +---++++ | id|null| string| len| +---++++ | 1|null|A...|1025| +---++++{code} 2. Write in ORC format to S3 {code:java} valid.write.format("orc") .option("path", "s3://bucket/test/test_orc/") .option("compression", "zlib") .mode("overwrite") .save(){code} 3. Check ORC meta by *hive –orcfiledump* command {code:java} [hadoop@ip ~]$ hive --orcfiledump s3://bucket/tets/test_orc/{code} We can see incorrect statistics for column string {code:java} Column 3: count: 1 hasNull: false bytesOnDisk: 23 min: null max: null sum: 1025{code} {code:java} Processing data file s3://bucket-dev/tets/test_orc/part-0-ec01de8f-8f6b-4937-b107-e88f5a5d2d67-c000.zlib.orc [length: 488]Structure for s3://timmedia-dev/volodymyr/test_orc/part-0-ec01de8f-8f6b-4937-b107-e88f5a5d2d67-c000.zlib.orcFile Version: 0.12 with FUTURERows: 1Compression: ZLIBCompression size: 262144Type: struct Stripe Statistics: Stripe 1: Column 0: count: 1 hasNull: false Column 1: count: 1 hasNull: false bytesOnDisk: 6 min: 1 max: 1 sum: 1 Column 2: count: 0 hasNull: true bytesOnDisk: 5 Column 3: count: 1 hasNull: false bytesOnDisk: 23 min: null max: null sum: 1025 File Statistics: Column 0: count: 1 hasNull: false Column 1: count: 1 hasNull: false bytesOnDisk: 6 min: 1 max: 1 sum: 1 Column 2: count: 0 hasNull: true bytesOnDisk: 5 Column 3: count: 1 hasNull: false bytesOnDisk: 23 min: null max: null sum: 1025 Stripes: Stripe: offset: 3 data: 34 rows: 1 tail: 66 index: 108 Stream: column 0 section ROW_INDEX start: 3 length 11 Stream: column 1 section ROW_INDEX start: 14 length 24 Stream: column 2 section ROW_INDEX start: 38 length 19 Stream: column 3 section ROW_INDEX start: 57 length 54 Stream: column 1 section DATA start: 111 length 6 Stream: column 2 section PRESENT start: 117 length 5 Stream: column 2 section DATA start: 122 length 0 Stream: column 2 section LENGTH start: 122 length 0 Stream: column 2 section DICTIONARY_DATA start: 122 length 0 Stream: column 3 section DATA start: 122 length 16 Stream: column 3 section LENGTH start: 138 length 7 Encoding column 0: DIRECT Encoding column 1: DIRECT_V2 Encoding column 2: DICTIONARY_V2[0] Encoding column 3: DIRECT_V2 File length: 488 bytesPadding length: 0 bytesPadding ratio: 0% User Metadata: org.apache.spark.version=3.4.1{code} For DF with a value smaller than 1024, we can see valid statistics {code:java} hive --orcfiledump s3://bucket/test/test_orcProcessing data file s3://bucket/test/test_orc/part-0-e395cc4d-9e2a-4ef0-9adb-640ed41dd2b7-c000.zlib.orc [length: 485]Structure for s3://timmedia-dev/volodymyr/test_orc/part-0-e395cc4d-9e2a-4ef0-9adb-640ed41dd2b7-c000.zlib.orcFile Version: 0.12 with FUTURERows: 1Compression: ZLIBCompression size: 262144Type: struct Stripe Statistics: Stripe 1: Column 0: count: 1 hasNull: false Column 1: count: 1 hasNull: false bytesOnDisk: 6 min: 1 max: 1 sum: 1 Column 2: count: 0 hasNull: true bytesOnDisk: 5 Column 3: count: 1 hasNull: false bytesOnDisk: 23 min: A
[jira] [Updated] (SPARK-48460) Spark ORC writer generates incorrect meta information(min, max)
[ https://issues.apache.org/jira/browse/SPARK-48460?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Volodymyr T updated SPARK-48460: Summary: Spark ORC writer generates incorrect meta information(min, max) (was: Spark ORC writer generates incorrect meta information(mix, max)) > Spark ORC writer generates incorrect meta information(min, max) > --- > > Key: SPARK-48460 > URL: https://issues.apache.org/jira/browse/SPARK-48460 > Project: Spark > Issue Type: Bug > Components: Input/Output >Affects Versions: 3.2.1, 3.3.0, 3.2.2, 3.3.1, 3.2.3, 3.2.4, 3.3.3, 3.4.2, > 3.3.2, 3.4.0, 3.4.1, 3.5.0, 3.5.1, 3.3.4, 3.4.3 >Reporter: Volodymyr T >Priority: Major > > We found that Hive cannot concatenate some ORC files generated by Spark 3.2.1 > and higher versions which contain long strings. > Steps to reproduce the issue: > 1) Create DF with a string longer than 1024 > > {code:java} > val valid = spark.sql("SELECT 1 as id, cast(NULL as string) as null, > lpad('A', 1024, 'A') as string;"){code} > {code:java} > val invalid = spark.sql("SELECT 1 as id, cast(NULL as string) as null, > lpad('A', 1025, 'A') as string;"){code} > {code:java} > valid.withColumn("len", length($"string")).show() > +---++++ | id|null| string| len| > +---++++ | 1|null|A...|1024| > +---++++{code} > > {code:java} > invalid.withColumn("len", length($"string")).show() > +---++++ | id|null| string| len| > +---++++ | 1|null|A...|1025| > +---++++{code} > 2. Write in ORC format to S3 > {code:java} > valid.write.format("orc") > .option("path", "s3://bucket/test/test_orc/") > .option("compression", "zlib") > .mode("overwrite") > .save(){code} > 3. Check ORC meta by *hive –orcfiledump* command > {code:java} > [hadoop@ip ~]$ hive --orcfiledump s3://bucket/tets/test_orc/{code} > We can see incorrect statistics for column string > {code:java} > Column 3: count: 1 hasNull: false bytesOnDisk: 23 min: null max: null sum: > 1025{code} > {code:java} > Processing data file > s3://bucket-dev/tets/test_orc/part-0-ec01de8f-8f6b-4937-b107-e88f5a5d2d67-c000.zlib.orc > [length: 488]Structure for > s3://timmedia-dev/volodymyr/test_orc/part-0-ec01de8f-8f6b-4937-b107-e88f5a5d2d67-c000.zlib.orcFile > Version: 0.12 with FUTURERows: 1Compression: ZLIBCompression size: > 262144Type: struct > Stripe Statistics: Stripe 1: Column 0: count: 1 hasNull: false Column > 1: count: 1 hasNull: false bytesOnDisk: 6 min: 1 max: 1 sum: 1 Column 2: > count: 0 hasNull: true bytesOnDisk: 5 Column 3: count: 1 hasNull: false > bytesOnDisk: 23 min: null max: null sum: 1025 > File Statistics: Column 0: count: 1 hasNull: false Column 1: count: 1 > hasNull: false bytesOnDisk: 6 min: 1 max: 1 sum: 1 Column 2: count: 0 > hasNull: true bytesOnDisk: 5 Column 3: count: 1 hasNull: false bytesOnDisk: > 23 min: null max: null sum: 1025 > Stripes: Stripe: offset: 3 data: 34 rows: 1 tail: 66 index: 108 Stream: > column 0 section ROW_INDEX start: 3 length 11 Stream: column 1 section > ROW_INDEX start: 14 length 24 Stream: column 2 section ROW_INDEX start: 38 > length 19 Stream: column 3 section ROW_INDEX start: 57 length 54 > Stream: column 1 section DATA start: 111 length 6 Stream: column 2 section > PRESENT start: 117 length 5 Stream: column 2 section DATA start: 122 > length 0 Stream: column 2 section LENGTH start: 122 length 0 Stream: > column 2 section DICTIONARY_DATA start: 122 length 0 Stream: column 3 > section DATA start: 122 length 16 Stream: column 3 section LENGTH start: > 138 length 7 Encoding column 0: DIRECT Encoding column 1: DIRECT_V2 > Encoding column 2: DICTIONARY_V2[0] Encoding column 3: DIRECT_V2 > File length: 488 bytesPadding length: 0 bytesPadding ratio: 0% > User Metadata: org.apache.spark.version=3.4.1{code} > For DF with a value smaller than 1024, we can see valid statistics > {code:java} > hive --orcfiledump s3://bucket/test/test_orcProcessing data file > s3://bucket/test/test_orc/part-0-e395cc4d-9e2a-4ef0-9adb-640ed41dd2b7-c000.zlib.orc > [length: 485]Structure for > s3://timmedia-dev/volodymyr/test_orc/part-0-e395cc4d-9e2a-4ef0-9adb-640ed41dd2b7-c000.zlib.orcFile > Version: 0.12 with FUTURERows: 1Compression: ZLIBCompression size: > 262144Type: struct > Stripe Statistics: Stripe 1: Column 0: count: 1 hasNull: false Column > 1: count: 1 hasNull: false bytesOnDisk: 6 min: 1 max: 1 sum: 1 Column 2: > count: 0 hasNull: true bytesOnDisk: 5 Column 3: count: 1 hasNull: false > bytesOnDisk: 23 min: > AA
[jira] [Commented] (SPARK-48361) Correctness: CSV corrupt record filter with aggregate ignored
[ https://issues.apache.org/jira/browse/SPARK-48361?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17850431#comment-17850431 ] Ted Chester Jenks commented on SPARK-48361: --- Yeah, it all makes sense, but a little less than ideal. For my purposes, this workaround is good enough so if there is no interest in pursuing this further I am happy to close out. However, I do still believe this could cause issues for others. Perhaps the pruning should be disabled if there is a corrupt record column in use? > Correctness: CSV corrupt record filter with aggregate ignored > - > > Key: SPARK-48361 > URL: https://issues.apache.org/jira/browse/SPARK-48361 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.5.1 > Environment: Using spark shell 3.5.1 on M1 Mac >Reporter: Ted Chester Jenks >Priority: Major > > Using corrupt record in CSV parsing for some data cleaning logic, I came > across a correctness bug. > > The following repro can be ran with spark-shell 3.5.1. > *Create test.csv with the following content:* > {code:java} > test,1,2,three > four,5,6,seven > 8,9 > ten,11,12,thirteen {code} > > > *In spark-shell:* > {code:java} > import org.apache.spark.sql.types._ > import org.apache.spark.sql.functions._ > > # define a STRING, DOUBLE, DOUBLE, STRING schema for the data > val schema = StructType(List(StructField("column1", StringType, true), > StructField("column2", DoubleType, true), StructField("column3", DoubleType, > true), StructField("column4", StringType, true))) > > # add a column for corrupt records to the schema > val schemaWithCorrupt = StructType(schema.fields :+ > StructField("_corrupt_record", StringType, true)) > > # read the CSV with the schema, headers, permissive parsing, and the corrupt > record column > val df = spark.read.option("header", "true").option("mode", > "PERMISSIVE").option("columnNameOfCorruptRecord", > "_corrupt_record").schema(schemaWithCorrupt).csv("test.csv") > > # define a UDF to count the commas in the corrupt record column > val countCommas = udf((s: String) => if (s != null) s.count(_ == ',') else > -1) > > # add a true/false column for whether the number of commas is 3 > val dfWithJagged = df.withColumn("__is_jagged", > when(col("_corrupt_record").isNull, > false).otherwise(countCommas(col("_corrupt_record")) =!= 3)) > dfWithJagged.show(){code} > *Returns:* > {code:java} > +---+---+---++---+---+ > |column1|column2|column3| column4|_corrupt_record|__is_jagged| > +---+---+---++---+---+ > | four| 5.0| 6.0| seven| NULL| false| > | 8| 9.0| NULL| NULL| 8,9| true| > | ten| 11.0| 12.0|thirteen| NULL| false| > +---+---+---++---+---+ {code} > So far so good... > > *BUT* > > *If we add an aggregate before we show:* > {code:java} > import org.apache.spark.sql.types._ > import org.apache.spark.sql.functions._ > val schema = StructType(List(StructField("column1", StringType, true), > StructField("column2", DoubleType, true), StructField("column3", DoubleType, > true), StructField("column4", StringType, true))) > val schemaWithCorrupt = StructType(schema.fields :+ > StructField("_corrupt_record", StringType, true)) > val df = spark.read.option("header", "true").option("mode", > "PERMISSIVE").option("columnNameOfCorruptRecord", > "_corrupt_record").schema(schemaWithCorrupt).csv("test.csv") > val countCommas = udf((s: String) => if (s != null) s.count(_ == ',') else > -1) > val dfWithJagged = df.withColumn("__is_jagged", > when(col("_corrupt_record").isNull, > false).otherwise(countCommas(col("_corrupt_record")) =!= 3)) > val dfDropped = dfWithJagged.filter(col("__is_jagged") =!= true) > val groupedSum = > dfDropped.groupBy("column1").agg(sum("column2").alias("sum_column2")) > groupedSum.show(){code} > *We get:* > {code:java} > +---+---+ > |column1|sum_column2| > +---+---+ > | 8| 9.0| > | four| 5.0| > | ten| 11.0| > +---+---+ {code} > > *Which is not correct* > > With the addition of the aggregate, the filter down to rows with 3 commas in > the corrupt record column is ignored. This does not happed with any other > operators I have tried - just aggregates so far. > > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-48423) Unable to write MLPipeline to blob storage using .option attribute
[ https://issues.apache.org/jira/browse/SPARK-48423?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chhavi Bansal updated SPARK-48423: -- Priority: Blocker (was: Major) > Unable to write MLPipeline to blob storage using .option attribute > -- > > Key: SPARK-48423 > URL: https://issues.apache.org/jira/browse/SPARK-48423 > Project: Spark > Issue Type: Bug > Components: ML, MLlib, Spark Core >Affects Versions: 3.4.3 >Reporter: Chhavi Bansal >Priority: Blocker > > I am trying to write mllib pipeline with a series of stages set in it to > azure blob storage giving relevant write parameters, but it still complains > of `fs.azure.account.key` not being found in the configuration. > Sharing the code. > {code:java} > val spark = > SparkSession.builder().appName("main").master("local[4]").getOrCreate() > import spark.implicits._ > val df = spark.createDataFrame(Seq( > (0L, "a b c d e spark"), > (1L, "b d") > )).toDF("id", "text") > val si = new StringIndexer().setInputCol("text").setOutputCol("IND_text") > val pipelinee = new Pipeline().setStages(Array(si)) > val pipelineModel = pipelinee.fit(df) > val path = BLOB_STORAGE_PATH > pipelineModel.write > .option("spark.hadoop.fs.azure.account.key..dfs.core.windows.net", > "__").option("fs.azure.account.key..dfs.core.windows.net", > "__").option("fs.azure.account.oauth2.client.endpoint..dfs.core.windows.net", > > "__").option("fs.azure.account.oauth2.client.id..dfs.core.windows.net", > > "__").option("fs.azure.account.auth.type..dfs.core.windows.net","__").option("fs.azure.account.oauth2.client.secret..dfs.core.windows.net", > > "__").option("fs.azure.account.oauth.provider.type..dfs.core.windows.net", > "__") > .save(path){code} > > The error that i get is > {code:java} > Failure to initialize configuration > Caused by: InvalidConfigurationValueException: Invalid configuration value > detected for fs.azure.account.key > at > org.apache.hadoop.fs.azurebfs.services.SimpleKeyProvider.getStorageAccountKey(SimpleKeyProvider.java:51) > at > org.apache.hadoop.fs.azurebfs.AbfsConfiguration.getStorageAccountKey(AbfsConfiguration.java:548) > at > org.apache.hadoop.fs.azurebfs.AzureBlobFileSystemStore.initializeClient(AzureBlobFileSystemStore.java:1449){code} > This shows that even though the key,value of > {code:java} > spark.hadoop.fs.azure.account.key..dfs.core.windows.net {code} > is being sent via option param, but is not being set internally. > > while this works only if i explicitly set the values in the > {code:java} > spark.conf.set(key,value) {code} > which might be problematic for a multi-tenant solution, which can be using > the same spark context. > one other observation is > {code:java} > df.write.option(key1,value1).option(key2,value2).save(path) {code} > fails with same key error while, > {code:java} > map = Map(key1->value1, key2->value2) > df.write.options(map).save(path) {code} > works.. > > My Ask is that like dataframe > {code:java} > write.options(Map) {code} > helps set this configuration, the .option(key1, value1) should also work to > write to azure blob storage. > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-48322) Drop internal metadata in `DataFrame.schema`
[ https://issues.apache.org/jira/browse/SPARK-48322?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ruifeng Zheng reopened SPARK-48322: --- > Drop internal metadata in `DataFrame.schema` > > > Key: SPARK-48322 > URL: https://issues.apache.org/jira/browse/SPARK-48322 > Project: Spark > Issue Type: Improvement > Components: Connect, PySpark, SQL >Affects Versions: 4.0.0 >Reporter: Ruifeng Zheng >Assignee: Ruifeng Zheng >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-48322) Drop internal metadata in `DataFrame.schema`
[ https://issues.apache.org/jira/browse/SPARK-48322?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ruifeng Zheng reassigned SPARK-48322: - Assignee: (was: Ruifeng Zheng) > Drop internal metadata in `DataFrame.schema` > > > Key: SPARK-48322 > URL: https://issues.apache.org/jira/browse/SPARK-48322 > Project: Spark > Issue Type: Improvement > Components: Connect, PySpark, SQL >Affects Versions: 4.0.0 >Reporter: Ruifeng Zheng >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-42965) metadata mismatch for StructField when running some tests.
[ https://issues.apache.org/jira/browse/SPARK-42965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ruifeng Zheng reopened SPARK-42965: --- Assignee: (was: Ruifeng Zheng) > metadata mismatch for StructField when running some tests. > -- > > Key: SPARK-42965 > URL: https://issues.apache.org/jira/browse/SPARK-42965 > Project: Spark > Issue Type: Improvement > Components: Connect, Pandas API on Spark >Affects Versions: 3.5.0 >Reporter: Haejoon Lee >Priority: Major > Fix For: 4.0.0 > > > For some reason, the metadata of `StructField` is different in a few tests > when using Spark Connect. However, the function works properly. > For example, when running `python/run-tests --testnames > 'pyspark.pandas.tests.connect.data_type_ops.test_parity_binary_ops > BinaryOpsParityTests.test_add'` it complains `AssertionError: > ([InternalField(dtype=int64, struct_field=StructField('bool', LongType(), > False))], [StructField('bool', LongType(), False)])` because metadata is > different something like `\{'__autoGeneratedAlias': 'true'}` but they have > same name, type and nullable, so the function just works well. > Therefore, we have temporarily added a branch for Spark Connect in the code > so that we can create InternalFrame properly to provide more pandas APIs in > Spark Connect. If a clear cause is found, we may need to revert it back to > its original state. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-48402) Virtualenv section of "Python Package Management" does not work on Windows
[ https://issues.apache.org/jira/browse/SPARK-48402?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17850347#comment-17850347 ] Wolff Bock von Wuelfingen commented on SPARK-48402: --- Additional notes: #environment can be safely omitted and ignored, it does not work. From what i've gathered from third party websites, it's supposed to extract the content into a folder of that name. That is not the case. Also, the guide is missing a very crucial part: You will have to activate the venv yourself, from your python script. Before you import other libraries. Example: {code:java} # Activate the virtual environment containing our dependencies activate_env = os.path.expanduser("/usr/local/app/.venv/bin/activate_this.py") exec(open(activate_env).read(), dict(__file__=activate_env)) {code} This third party guide explains it nicely: https://saturncloud.io/blog/shipping-virtual-environments-with-pyspark-a-comprehensive-guide/ > Virtualenv section of "Python Package Management" does not work on Windows > -- > > Key: SPARK-48402 > URL: https://issues.apache.org/jira/browse/SPARK-48402 > Project: Spark > Issue Type: Documentation > Components: Documentation >Affects Versions: 3.4.3 >Reporter: Wolff Bock von Wuelfingen >Priority: Minor > > [https://spark.apache.org/docs/latest/api/python/user_guide/python_packaging.html] > Has a section about using virtualenv, including venv-pack. venv-pack does not > currently work on Windows: [https://github.com/jcrist/venv-pack/issues/6] > Either the docs need adjusting to reflect that, or a different tool needs to > be used. I'm far from a python expert, so i have no idea if the latter is > even possible. > Would be nice to add a hint, to save people some time (maybe suggesting to > use WSL instead?). -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-48459) Implement DataFrameQueryContext in Spark Connect
[ https://issues.apache.org/jira/browse/SPARK-48459?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-48459: --- Labels: pull-request-available (was: ) > Implement DataFrameQueryContext in Spark Connect > > > Key: SPARK-48459 > URL: https://issues.apache.org/jira/browse/SPARK-48459 > Project: Spark > Issue Type: Improvement > Components: Connect, PySpark >Affects Versions: 4.0.0 >Reporter: Hyukjin Kwon >Priority: Major > Labels: pull-request-available > > Implements the same https://github.com/apache/spark/pull/45377 in Spark > Connect -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-48459) Implement DataFrameQueryContext in Spark Connect
Hyukjin Kwon created SPARK-48459: Summary: Implement DataFrameQueryContext in Spark Connect Key: SPARK-48459 URL: https://issues.apache.org/jira/browse/SPARK-48459 Project: Spark Issue Type: Improvement Components: Connect, PySpark Affects Versions: 4.0.0 Reporter: Hyukjin Kwon Implements the same https://github.com/apache/spark/pull/45377 in Spark Connect -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-48458) Dynamic partition override mode might be ignored in certain scenarios causing data loss
Artem Kupchinskiy created SPARK-48458: - Summary: Dynamic partition override mode might be ignored in certain scenarios causing data loss Key: SPARK-48458 URL: https://issues.apache.org/jira/browse/SPARK-48458 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.5.1, 2.4.8, 4.0.0 Reporter: Artem Kupchinskiy If an active spark session is stopped in the middle of an insert into file system, the session config responsible for overwriting partitions behavior might be not respected. The failure scenario basically is following: # The spark context is stopped just before [getting a partition override mode setting|https://github.com/apache/spark/blob/8bbbde7cb3c396bc369c06853ed3a2ec021a2530/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InsertIntoHadoopFsRelationCommand.scala#L69] # Due to the [fallback config usage in case of stopped spark context,|https://github.com/apache/spark/blob/8bbbde7cb3c396bc369c06853ed3a2ec021a2530/sql/core/src/main/scala/org/apache/spark/sql/SparkSession.scala#L121] this mode is evaluated to static (default mode in the default SQLConf used as a fallback) # The data is cleared [here|https://github.com/apache/spark/blob/8bbbde7cb3c396bc369c06853ed3a2ec021a2530/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InsertIntoHadoopFsRelationCommand.scala#L131] totally which is literally a data loss from the user perspective who intends to overwrite data just partially. This [gist|https://gist.github.com/akupchinskiy/b5f31781d59e5c0e9b172e7de40132cd] reproduces this behavior. On my local machine, it takes 1-3 iterations to have pre-created data cleared totally. The mitigation of this bug would be usage of explicit write parameter `partitionOverwriteMode` instead of relying on session configuration. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-47415) Levenshtein (all collations)
[ https://issues.apache.org/jira/browse/SPARK-47415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17850313#comment-17850313 ] Uroš Bojanić commented on SPARK-47415: -- [~nikolamand-db] [~nebojsa-db] For more context on this task, we have concluded that this is a hard problem and current algorithm doesn't extend well to other collations. For this reason, we will now switch over to pass-through approach: [https://github.com/apache/spark/pull/46788] Since there is little hope to adding actual collation support in the future too, this means that the Levenshtein expression will work on string arguments with any given collation, but *will not* take that collation into consideration when calculating the edit distance between two strings. > Levenshtein (all collations) > > > Key: SPARK-47415 > URL: https://issues.apache.org/jira/browse/SPARK-47415 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 4.0.0 >Reporter: Uroš Bojanić >Priority: Major > Labels: pull-request-available > > Enable collation support for the *Levenshtein* built-in string function in > Spark. First confirm what is the expected behaviour for this function when > given collated strings, and then move on to implementation and testing. > Implement the corresponding unit tests and E2E sql tests to reflect how this > function should be used with collation in SparkSQL, and feel free to use your > chosen Spark SQL Editor to experiment with the existing functions to learn > more about how they work. In addition, look into the possible use-cases and > implementation of similar functions within other other open-source DBMS, such > as [PostgreSQL|https://www.postgresql.org/docs/]. > > The goal for this Jira ticket is to implement the *Levenshtein* function so > it supports all collation types currently supported in Spark. To understand > what changes were introduced in order to enable full collation support for > other existing functions in Spark, take a look at the Spark PRs and Jira > tickets for completed tasks in this parent (for example: Contains, > StartsWith, EndsWith). > > Read more about ICU [Collation Concepts|http://example.com/] and > [Collator|http://example.com/] class. Also, refer to the Unicode Technical > Standard for string > [searching|https://www.unicode.org/reports/tr10/#Searching] and > [collation|https://www.unicode.org/reports/tr35/tr35-collation.html#Collation_Type_Fallback]. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-48457) Testing and operational readiness
David Milicevic created SPARK-48457: --- Summary: Testing and operational readiness Key: SPARK-48457 URL: https://issues.apache.org/jira/browse/SPARK-48457 Project: Spark Issue Type: Sub-task Components: Spark Core Affects Versions: 4.0.0 Reporter: David Milicevic We are basically doing this as we are developing the feature. This work item should serve as a checkpoint by the end of M0 to figure out if we have covered everything. Testing is very clearly defined by itself. For the operational readiness part, we are still to figure out what exactly we can do in the case of SQL scripting. It's a really straightforward feature and public documentation should serve well enough for most of the issues we might encounter. But, we should probably think about: * Some KPI indicators. * Telemetry. * Something else? -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-48456) Performance benchmark
David Milicevic created SPARK-48456: --- Summary: Performance benchmark Key: SPARK-48456 URL: https://issues.apache.org/jira/browse/SPARK-48456 Project: Spark Issue Type: Sub-task Components: Spark Core Affects Versions: 4.0.0 Reporter: David Milicevic Performance parity is officially an M2 requirement, but by the end of M0 I think we should start doing some perf benchmarks to figure out where do we stand in the beginning and if we need to change something right from the start before we get to work on a more complex stuff. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-48455) Public documentation
David Milicevic created SPARK-48455: --- Summary: Public documentation Key: SPARK-48455 URL: https://issues.apache.org/jira/browse/SPARK-48455 Project: Spark Issue Type: Sub-task Components: Spark Core Affects Versions: 4.0.0 Reporter: David Milicevic Public documentation is officially Milestone 1 requirement, but I think we should start working on this even during Milestone 0. I guess this shouldn't be anything revolutionary, just a basic doc with SQL Scripting grammar and functions explained properly. We might want to sync with Serge about this to figure out if he has any thoughts before we start working on it. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-48454) Directly use the parent dataframe class
[ https://issues.apache.org/jira/browse/SPARK-48454?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot reassigned SPARK-48454: -- Assignee: (was: Apache Spark) > Directly use the parent dataframe class > --- > > Key: SPARK-48454 > URL: https://issues.apache.org/jira/browse/SPARK-48454 > Project: Spark > Issue Type: Improvement > Components: PS >Affects Versions: 4.0.0 >Reporter: Ruifeng Zheng >Priority: Minor > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-48454) Directly use the parent dataframe class
[ https://issues.apache.org/jira/browse/SPARK-48454?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot reassigned SPARK-48454: -- Assignee: Apache Spark > Directly use the parent dataframe class > --- > > Key: SPARK-48454 > URL: https://issues.apache.org/jira/browse/SPARK-48454 > Project: Spark > Issue Type: Improvement > Components: PS >Affects Versions: 4.0.0 >Reporter: Ruifeng Zheng >Assignee: Apache Spark >Priority: Minor > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-48454) Directly use the parent dataframe class
[ https://issues.apache.org/jira/browse/SPARK-48454?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot reassigned SPARK-48454: -- Assignee: Apache Spark > Directly use the parent dataframe class > --- > > Key: SPARK-48454 > URL: https://issues.apache.org/jira/browse/SPARK-48454 > Project: Spark > Issue Type: Improvement > Components: PS >Affects Versions: 4.0.0 >Reporter: Ruifeng Zheng >Assignee: Apache Spark >Priority: Minor > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-48416) Support related nested WITH expression
[ https://issues.apache.org/jira/browse/SPARK-48416?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot reassigned SPARK-48416: -- Assignee: (was: Apache Spark) > Support related nested WITH expression > -- > > Key: SPARK-48416 > URL: https://issues.apache.org/jira/browse/SPARK-48416 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.5.1 >Reporter: Mingliang Zhu >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-48416) Support related nested WITH expression
[ https://issues.apache.org/jira/browse/SPARK-48416?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot reassigned SPARK-48416: -- Assignee: Apache Spark > Support related nested WITH expression > -- > > Key: SPARK-48416 > URL: https://issues.apache.org/jira/browse/SPARK-48416 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.5.1 >Reporter: Mingliang Zhu >Assignee: Apache Spark >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-48454) Directly use the parent dataframe class
[ https://issues.apache.org/jira/browse/SPARK-48454?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-48454: --- Labels: pull-request-available (was: ) > Directly use the parent dataframe class > --- > > Key: SPARK-48454 > URL: https://issues.apache.org/jira/browse/SPARK-48454 > Project: Spark > Issue Type: Improvement > Components: PS >Affects Versions: 4.0.0 >Reporter: Ruifeng Zheng >Priority: Minor > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-48377) Multiple results API - sqlScript()
[ https://issues.apache.org/jira/browse/SPARK-48377?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17850298#comment-17850298 ] David Milicevic commented on SPARK-48377: - I took it on myself to write an API proposal (with atomic's help). > Multiple results API - sqlScript() > -- > > Key: SPARK-48377 > URL: https://issues.apache.org/jira/browse/SPARK-48377 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 4.0.0 >Reporter: David Milicevic >Priority: Major > > For now: > * Write an API proposal > ** The API itself should be fine, but we need to figure out what the result > set should look like, i.e. in what format we return multiple DataFrames. > ** The result set should be compatible with CALL and EXECUTE IMMEDIATE as > well. > * Figure out how the API will propagate down the Spark Connect stack > (depends on SPARK-48452 investigation) > > Probably to be separated into multiple subtasks in the future. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-48453) Support for PRINT/TRACE statement
David Milicevic created SPARK-48453: --- Summary: Support for PRINT/TRACE statement Key: SPARK-48453 URL: https://issues.apache.org/jira/browse/SPARK-48453 Project: Spark Issue Type: Sub-task Components: Spark Core Affects Versions: 4.0.0 Reporter: David Milicevic This is not defined in Ref Spec[,|https://docs.google.com/document/d/1cpSuR3KxRuTSJ4ZMQ73FJ4_-hjouNNU2zfI4vri6yhs/edit#heading=h.4cz970y1mk93],] but during POC we figured out that it might be useful. Still need to figure out the details when we get to it, because the propagation to the client and UI on the client side might not be trivial, but this needs further investigation. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-48377) Multiple results API - sqlScript()
[ https://issues.apache.org/jira/browse/SPARK-48377?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] David Milicevic updated SPARK-48377: Description: For now: * Write an API proposal ** The API itself should be fine, but we need to figure out what the result set should look like, i.e. in what format we return multiple DataFrames. ** The result set should be compatible with CALL and EXECUTE IMMEDIATE as well. * Figure out how the API will propagate down the Spark Connect stack (depends on SPARK-48452 investigation) Probably to be separated into multiple subtasks in the future. was: TBD. Probably to be separated into multiple subtasks. > Multiple results API - sqlScript() > -- > > Key: SPARK-48377 > URL: https://issues.apache.org/jira/browse/SPARK-48377 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 4.0.0 >Reporter: David Milicevic >Priority: Major > > For now: > * Write an API proposal > ** The API itself should be fine, but we need to figure out what the result > set should look like, i.e. in what format we return multiple DataFrames. > ** The result set should be compatible with CALL and EXECUTE IMMEDIATE as > well. > * Figure out how the API will propagate down the Spark Connect stack > (depends on SPARK-48452 investigation) > > Probably to be separated into multiple subtasks in the future. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-48452) Spark Connect investigation
[ https://issues.apache.org/jira/browse/SPARK-48452?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17850296#comment-17850296 ] David Milicevic commented on SPARK-48452: - [~milan.dankovic] is working on this. > Spark Connect investigation > --- > > Key: SPARK-48452 > URL: https://issues.apache.org/jira/browse/SPARK-48452 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 4.0.0 >Reporter: David Milicevic >Priority: Major > > Some notebook modes, VS code extension, etc. execute SQL commands through > Spark Connect. > We need to: > - Figure out exceptions that we are getting in Spark Connect stack for SQL > scripts. > - Understand the Spark Connect stack better, so we can more easily propose > design for new API(s) we are going to introduce in the future. > > For more details, design doc can be found in parent Jira item. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-48452) Spark Connect investigation
David Milicevic created SPARK-48452: --- Summary: Spark Connect investigation Key: SPARK-48452 URL: https://issues.apache.org/jira/browse/SPARK-48452 Project: Spark Issue Type: Sub-task Components: Spark Core Affects Versions: 4.0.0 Reporter: David Milicevic Some notebook modes, VS code extension, etc. execute SQL commands through Spark Connect. We need to: - Figure out exceptions that we are getting in Spark Connect stack for SQL scripts. - Understand the Spark Connect stack better, so we can more easily propose design for new API(s) we are going to introduce in the future. For more details, design doc can be found in parent Jira item. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org