[jira] [Created] (SPARK-48578) Add new expressions for UTF8 string validation

2024-06-09 Thread Jira
Uroš Bojanić created SPARK-48578:


 Summary: Add new expressions for UTF8 string validation
 Key: SPARK-48578
 URL: https://issues.apache.org/jira/browse/SPARK-48578
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 4.0.0
Reporter: Uroš Bojanić






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-48577) Replace invalid byte sequences in UTF8Strings

2024-06-09 Thread Jira
Uroš Bojanić created SPARK-48577:


 Summary: Replace invalid byte sequences in UTF8Strings
 Key: SPARK-48577
 URL: https://issues.apache.org/jira/browse/SPARK-48577
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 4.0.0
Reporter: Uroš Bojanić






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-48576) Rename UTF8_BINARY_LCASE to UTF8_LCASE

2024-06-09 Thread Jira
Uroš Bojanić created SPARK-48576:


 Summary: Rename UTF8_BINARY_LCASE to UTF8_LCASE
 Key: SPARK-48576
 URL: https://issues.apache.org/jira/browse/SPARK-48576
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 4.0.0
Reporter: Uroš Bojanić






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-48560) Make StreamingQueryListener.spark settable

2024-06-09 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48560?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-48560.
--
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 46909
[https://github.com/apache/spark/pull/46909]

> Make StreamingQueryListener.spark settable
> --
>
> Key: SPARK-48560
> URL: https://issues.apache.org/jira/browse/SPARK-48560
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, Structured Streaming
>Affects Versions: 4.0.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> Downstream users might already implement StreamingQueryListener.spark.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48575) spark.history.fs.update.interval calling too many directory pollings when spark log dir contains many sparkEvent apps

2024-06-09 Thread Arnaud Nauwynck (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arnaud Nauwynck updated SPARK-48575:

Attachment: EventLogFileReaders.patch

> spark.history.fs.update.interval calling too many directory pollings when 
> spark log dir contains many sparkEvent apps 
> --
>
> Key: SPARK-48575
> URL: https://issues.apache.org/jira/browse/SPARK-48575
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.3.3, 4.0.0, 3.5.1, 3.4.3
>Reporter: Arnaud Nauwynck
>Priority: Critical
> Attachments: EventLogFileReaders.patch
>
>
> In the case of a sparkLog dir containing "lot" of spark eventLogs sub-dirs 
> (example 1000),
> running a supposedly "iddle" Sparkhistory server is causing millions of 
> directory listing calls each hour.
> see code : 
> [https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala#L283|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala#L283]
> example: with ~1000 apps, every 10 seconds (default of 
> "spark.history.fs.update.interval") SparkHistory is performing
> - 1x  VirtualFileSystem.listStatus(path)   with path=sparkLog dir
> - then 2x foreach each appSubDirPath (corresponding to a sparkApp eventLogs)
>=> 2 x 1000 x VirtualFileSystem.listStatus(appSubDirPath)
> On a cloud provider (example Azure), this cost a lot per month : 
> because "List FileSystem" calls ~$0.065 per 1 ops for Tier "Hot" or 
> $0.0228 for "Premium" (cf 
> https://azure.microsoft.com/en-us/pricing/details/storage/data-lake/ )
> Let's do the multiplications:
> 30 (days per month) * 86400 (sec per day) / 10 (interval second) = 259 000 
> update times
> ... * 2001  (listings ops per update) = 518 millions listing calls per month
> ... * 0.0228 / 1 = 1182 USD/month 
> Admitedly, the retention conf "spark.history.fs.cleaner.maxAge"  (default 
> =7d) for spark eventLog is too much for workflows than run many short spark 
> apps, and it would be possible to reduce it.
> It is extremely important to reduce these recurrent costs
> Here are several whishes
> 1/ fix "bug" in Spark History that is calling twice the 
> VirtualFileSystem.listStatus(appSubDirPath). 
> cf source code: [first call 
> EventLogFileReaders.scala#L123|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/history/EventLogFileReaders.scala#L123]
>   : it only test that the dir contains a child file with name prefix 
> "eventLog" and an appStatus file, but then the list is unused.
> It creates an instance of RollingEventLogFilesFileReader, and shortly after, 
> the listing is called again:
> cf [second call (lazy field) 
> EventLogFileReaders.scala#L224|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/history/EventLogFileReaders.scala#L224]
> the lazy field "files" is immediatly evaluated after object creation from 
> here:
> [second called from 
> EventLogFileReaders.scala#L252|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/history/EventLogFileReaders.scala#L252]
> .. [called from FsHistoryProvider.scala#L506 
> |https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala#L506]
> Indeed, it is easilly possible to perform only 1 listing per sub-dir
> (cf attach patch, changing ~5 lines of code)
> This would divide cost by x2.
> 2/ in addition to conf "spark.history.fs.cleaner.maxAge", add another conf 
> param ""spark.history.fs.cleaner.maxCount" to limit the number of spark apps. 
> This could be defaulted to ~50. 
> This would additionaly divide cost by x10 (in case you have 1000 apps).
> 3/ change the code in SparkHistory to check lazily for update only on demand 
> when someone click in Spark History web UI. For example, if the last cached 
> update time is less than "spark.history.fs.update.interval" then no update is 
> needed, else update is immediatly performed and cached before returning 
> response.
> 4/ change the code in SparkHistory to avoid doing a listing on each app 
> sub-dir.
>  It is possible to perform a single listing on "sparkLog" top level dir, to 
> discover new apps.
>  Then for each app subdir, most of them are already finished, and already 
> recompacted by SparkHistory itself. This info is already stored in spark 
> history Keystore db. 
>   Allmost all the listing sub-dirs can thefore be completly avoided.
>   see [KVStore declaration 
> |https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala#L141],
>  

[jira] [Updated] (SPARK-48575) spark.history.fs.update.interval calling too many directory pollings when spark log dir contains many sparkEvent apps

2024-06-09 Thread Arnaud Nauwynck (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arnaud Nauwynck updated SPARK-48575:

Attachment: (was: EventLogFileReaders.patch)

> spark.history.fs.update.interval calling too many directory pollings when 
> spark log dir contains many sparkEvent apps 
> --
>
> Key: SPARK-48575
> URL: https://issues.apache.org/jira/browse/SPARK-48575
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.3.3, 4.0.0, 3.5.1, 3.4.3
>Reporter: Arnaud Nauwynck
>Priority: Critical
>
> In the case of a sparkLog dir containing "lot" of spark eventLogs sub-dirs 
> (example 1000),
> running a supposedly "iddle" Sparkhistory server is causing millions of 
> directory listing calls each hour.
> see code : 
> [https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala#L283|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala#L283]
> example: with ~1000 apps, every 10 seconds (default of 
> "spark.history.fs.update.interval") SparkHistory is performing
> - 1x  VirtualFileSystem.listStatus(path)   with path=sparkLog dir
> - then 2x foreach each appSubDirPath (corresponding to a sparkApp eventLogs)
>=> 2 x 1000 x VirtualFileSystem.listStatus(appSubDirPath)
> On a cloud provider (example Azure), this cost a lot per month : 
> because "List FileSystem" calls ~$0.065 per 1 ops for Tier "Hot" or 
> $0.0228 for "Premium" (cf 
> https://azure.microsoft.com/en-us/pricing/details/storage/data-lake/ )
> Let's do the multiplications:
> 30 (days per month) * 86400 (sec per day) / 10 (interval second) = 259 000 
> update times
> ... * 2001  (listings ops per update) = 518 millions listing calls per month
> ... * 0.0228 / 1 = 1182 USD/month 
> Admitedly, the retention conf "spark.history.fs.cleaner.maxAge"  (default 
> =7d) for spark eventLog is too much for workflows than run many short spark 
> apps, and it would be possible to reduce it.
> It is extremely important to reduce these recurrent costs
> Here are several whishes
> 1/ fix "bug" in Spark History that is calling twice the 
> VirtualFileSystem.listStatus(appSubDirPath). 
> cf source code: [first call 
> EventLogFileReaders.scala#L123|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/history/EventLogFileReaders.scala#L123]
>   : it only test that the dir contains a child file with name prefix 
> "eventLog" and an appStatus file, but then the list is unused.
> It creates an instance of RollingEventLogFilesFileReader, and shortly after, 
> the listing is called again:
> cf [second call (lazy field) 
> EventLogFileReaders.scala#L224|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/history/EventLogFileReaders.scala#L224]
> the lazy field "files" is immediatly evaluated after object creation from 
> here:
> [second called from 
> EventLogFileReaders.scala#L252|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/history/EventLogFileReaders.scala#L252]
> .. [called from FsHistoryProvider.scala#L506 
> |https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala#L506]
> Indeed, it is easilly possible to perform only 1 listing per sub-dir
> (cf attach patch, changing ~5 lines of code)
> This would divide cost by x2.
> 2/ in addition to conf "spark.history.fs.cleaner.maxAge", add another conf 
> param ""spark.history.fs.cleaner.maxCount" to limit the number of spark apps. 
> This could be defaulted to ~50. 
> This would additionaly divide cost by x10 (in case you have 1000 apps).
> 3/ change the code in SparkHistory to check lazily for update only on demand 
> when someone click in Spark History web UI. For example, if the last cached 
> update time is less than "spark.history.fs.update.interval" then no update is 
> needed, else update is immediatly performed and cached before returning 
> response.
> 4/ change the code in SparkHistory to avoid doing a listing on each app 
> sub-dir.
>  It is possible to perform a single listing on "sparkLog" top level dir, to 
> discover new apps.
>  Then for each app subdir, most of them are already finished, and already 
> recompacted by SparkHistory itself. This info is already stored in spark 
> history Keystore db. 
>   Allmost all the listing sub-dirs can thefore be completly avoided.
>   see [KVStore declaration 
> |https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala#L141],
>  [KVStore 
> 

[jira] [Assigned] (SPARK-48012) SPJ: Support Transfrom Expressions for One Side Shuffle

2024-06-09 Thread Chao Sun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48012?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun reassigned SPARK-48012:


Assignee: Szehon Ho

> SPJ: Support Transfrom Expressions for One Side Shuffle
> ---
>
> Key: SPARK-48012
> URL: https://issues.apache.org/jira/browse/SPARK-48012
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.3
>Reporter: Szehon Ho
>Assignee: Szehon Ho
>Priority: Major
>  Labels: pull-request-available
>
> SPARK-41471 allowed Spark to shuffle just one side and still conduct SPJ, if 
> the other side is KeyGroupedPartitioning.  However, the support was just for 
> a KeyGroupedPartition without any partition transform (day, year, bucket).  
> It will be useful to add support for partition transform as well, as there 
> are many tables partitioned by those transforms.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-48012) SPJ: Support Transfrom Expressions for One Side Shuffle

2024-06-09 Thread Chao Sun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48012?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun resolved SPARK-48012.
--
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 46255
[https://github.com/apache/spark/pull/46255]

> SPJ: Support Transfrom Expressions for One Side Shuffle
> ---
>
> Key: SPARK-48012
> URL: https://issues.apache.org/jira/browse/SPARK-48012
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.3
>Reporter: Szehon Ho
>Assignee: Szehon Ho
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> SPARK-41471 allowed Spark to shuffle just one side and still conduct SPJ, if 
> the other side is KeyGroupedPartitioning.  However, the support was just for 
> a KeyGroupedPartition without any partition transform (day, year, bucket).  
> It will be useful to add support for partition transform as well, as there 
> are many tables partitioned by those transforms.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org