[jira] [Created] (SPARK-48578) Add new expressions for UTF8 string validation
Uroš Bojanić created SPARK-48578: Summary: Add new expressions for UTF8 string validation Key: SPARK-48578 URL: https://issues.apache.org/jira/browse/SPARK-48578 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 4.0.0 Reporter: Uroš Bojanić -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-48577) Replace invalid byte sequences in UTF8Strings
Uroš Bojanić created SPARK-48577: Summary: Replace invalid byte sequences in UTF8Strings Key: SPARK-48577 URL: https://issues.apache.org/jira/browse/SPARK-48577 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 4.0.0 Reporter: Uroš Bojanić -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-48576) Rename UTF8_BINARY_LCASE to UTF8_LCASE
Uroš Bojanić created SPARK-48576: Summary: Rename UTF8_BINARY_LCASE to UTF8_LCASE Key: SPARK-48576 URL: https://issues.apache.org/jira/browse/SPARK-48576 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 4.0.0 Reporter: Uroš Bojanić -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-48560) Make StreamingQueryListener.spark settable
[ https://issues.apache.org/jira/browse/SPARK-48560?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-48560. -- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 46909 [https://github.com/apache/spark/pull/46909] > Make StreamingQueryListener.spark settable > -- > > Key: SPARK-48560 > URL: https://issues.apache.org/jira/browse/SPARK-48560 > Project: Spark > Issue Type: Improvement > Components: PySpark, Structured Streaming >Affects Versions: 4.0.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > > Downstream users might already implement StreamingQueryListener.spark. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-48575) spark.history.fs.update.interval calling too many directory pollings when spark log dir contains many sparkEvent apps
[ https://issues.apache.org/jira/browse/SPARK-48575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Arnaud Nauwynck updated SPARK-48575: Attachment: EventLogFileReaders.patch > spark.history.fs.update.interval calling too many directory pollings when > spark log dir contains many sparkEvent apps > -- > > Key: SPARK-48575 > URL: https://issues.apache.org/jira/browse/SPARK-48575 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.3.3, 4.0.0, 3.5.1, 3.4.3 >Reporter: Arnaud Nauwynck >Priority: Critical > Attachments: EventLogFileReaders.patch > > > In the case of a sparkLog dir containing "lot" of spark eventLogs sub-dirs > (example 1000), > running a supposedly "iddle" Sparkhistory server is causing millions of > directory listing calls each hour. > see code : > [https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala#L283|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala#L283] > example: with ~1000 apps, every 10 seconds (default of > "spark.history.fs.update.interval") SparkHistory is performing > - 1x VirtualFileSystem.listStatus(path) with path=sparkLog dir > - then 2x foreach each appSubDirPath (corresponding to a sparkApp eventLogs) >=> 2 x 1000 x VirtualFileSystem.listStatus(appSubDirPath) > On a cloud provider (example Azure), this cost a lot per month : > because "List FileSystem" calls ~$0.065 per 1 ops for Tier "Hot" or > $0.0228 for "Premium" (cf > https://azure.microsoft.com/en-us/pricing/details/storage/data-lake/ ) > Let's do the multiplications: > 30 (days per month) * 86400 (sec per day) / 10 (interval second) = 259 000 > update times > ... * 2001 (listings ops per update) = 518 millions listing calls per month > ... * 0.0228 / 1 = 1182 USD/month > Admitedly, the retention conf "spark.history.fs.cleaner.maxAge" (default > =7d) for spark eventLog is too much for workflows than run many short spark > apps, and it would be possible to reduce it. > It is extremely important to reduce these recurrent costs > Here are several whishes > 1/ fix "bug" in Spark History that is calling twice the > VirtualFileSystem.listStatus(appSubDirPath). > cf source code: [first call > EventLogFileReaders.scala#L123|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/history/EventLogFileReaders.scala#L123] > : it only test that the dir contains a child file with name prefix > "eventLog" and an appStatus file, but then the list is unused. > It creates an instance of RollingEventLogFilesFileReader, and shortly after, > the listing is called again: > cf [second call (lazy field) > EventLogFileReaders.scala#L224|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/history/EventLogFileReaders.scala#L224] > the lazy field "files" is immediatly evaluated after object creation from > here: > [second called from > EventLogFileReaders.scala#L252|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/history/EventLogFileReaders.scala#L252] > .. [called from FsHistoryProvider.scala#L506 > |https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala#L506] > Indeed, it is easilly possible to perform only 1 listing per sub-dir > (cf attach patch, changing ~5 lines of code) > This would divide cost by x2. > 2/ in addition to conf "spark.history.fs.cleaner.maxAge", add another conf > param ""spark.history.fs.cleaner.maxCount" to limit the number of spark apps. > This could be defaulted to ~50. > This would additionaly divide cost by x10 (in case you have 1000 apps). > 3/ change the code in SparkHistory to check lazily for update only on demand > when someone click in Spark History web UI. For example, if the last cached > update time is less than "spark.history.fs.update.interval" then no update is > needed, else update is immediatly performed and cached before returning > response. > 4/ change the code in SparkHistory to avoid doing a listing on each app > sub-dir. > It is possible to perform a single listing on "sparkLog" top level dir, to > discover new apps. > Then for each app subdir, most of them are already finished, and already > recompacted by SparkHistory itself. This info is already stored in spark > history Keystore db. > Allmost all the listing sub-dirs can thefore be completly avoided. > see [KVStore declaration > |https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala#L141], >
[jira] [Updated] (SPARK-48575) spark.history.fs.update.interval calling too many directory pollings when spark log dir contains many sparkEvent apps
[ https://issues.apache.org/jira/browse/SPARK-48575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Arnaud Nauwynck updated SPARK-48575: Attachment: (was: EventLogFileReaders.patch) > spark.history.fs.update.interval calling too many directory pollings when > spark log dir contains many sparkEvent apps > -- > > Key: SPARK-48575 > URL: https://issues.apache.org/jira/browse/SPARK-48575 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.3.3, 4.0.0, 3.5.1, 3.4.3 >Reporter: Arnaud Nauwynck >Priority: Critical > > In the case of a sparkLog dir containing "lot" of spark eventLogs sub-dirs > (example 1000), > running a supposedly "iddle" Sparkhistory server is causing millions of > directory listing calls each hour. > see code : > [https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala#L283|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala#L283] > example: with ~1000 apps, every 10 seconds (default of > "spark.history.fs.update.interval") SparkHistory is performing > - 1x VirtualFileSystem.listStatus(path) with path=sparkLog dir > - then 2x foreach each appSubDirPath (corresponding to a sparkApp eventLogs) >=> 2 x 1000 x VirtualFileSystem.listStatus(appSubDirPath) > On a cloud provider (example Azure), this cost a lot per month : > because "List FileSystem" calls ~$0.065 per 1 ops for Tier "Hot" or > $0.0228 for "Premium" (cf > https://azure.microsoft.com/en-us/pricing/details/storage/data-lake/ ) > Let's do the multiplications: > 30 (days per month) * 86400 (sec per day) / 10 (interval second) = 259 000 > update times > ... * 2001 (listings ops per update) = 518 millions listing calls per month > ... * 0.0228 / 1 = 1182 USD/month > Admitedly, the retention conf "spark.history.fs.cleaner.maxAge" (default > =7d) for spark eventLog is too much for workflows than run many short spark > apps, and it would be possible to reduce it. > It is extremely important to reduce these recurrent costs > Here are several whishes > 1/ fix "bug" in Spark History that is calling twice the > VirtualFileSystem.listStatus(appSubDirPath). > cf source code: [first call > EventLogFileReaders.scala#L123|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/history/EventLogFileReaders.scala#L123] > : it only test that the dir contains a child file with name prefix > "eventLog" and an appStatus file, but then the list is unused. > It creates an instance of RollingEventLogFilesFileReader, and shortly after, > the listing is called again: > cf [second call (lazy field) > EventLogFileReaders.scala#L224|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/history/EventLogFileReaders.scala#L224] > the lazy field "files" is immediatly evaluated after object creation from > here: > [second called from > EventLogFileReaders.scala#L252|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/history/EventLogFileReaders.scala#L252] > .. [called from FsHistoryProvider.scala#L506 > |https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala#L506] > Indeed, it is easilly possible to perform only 1 listing per sub-dir > (cf attach patch, changing ~5 lines of code) > This would divide cost by x2. > 2/ in addition to conf "spark.history.fs.cleaner.maxAge", add another conf > param ""spark.history.fs.cleaner.maxCount" to limit the number of spark apps. > This could be defaulted to ~50. > This would additionaly divide cost by x10 (in case you have 1000 apps). > 3/ change the code in SparkHistory to check lazily for update only on demand > when someone click in Spark History web UI. For example, if the last cached > update time is less than "spark.history.fs.update.interval" then no update is > needed, else update is immediatly performed and cached before returning > response. > 4/ change the code in SparkHistory to avoid doing a listing on each app > sub-dir. > It is possible to perform a single listing on "sparkLog" top level dir, to > discover new apps. > Then for each app subdir, most of them are already finished, and already > recompacted by SparkHistory itself. This info is already stored in spark > history Keystore db. > Allmost all the listing sub-dirs can thefore be completly avoided. > see [KVStore declaration > |https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala#L141], > [KVStore >
[jira] [Assigned] (SPARK-48012) SPJ: Support Transfrom Expressions for One Side Shuffle
[ https://issues.apache.org/jira/browse/SPARK-48012?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Sun reassigned SPARK-48012: Assignee: Szehon Ho > SPJ: Support Transfrom Expressions for One Side Shuffle > --- > > Key: SPARK-48012 > URL: https://issues.apache.org/jira/browse/SPARK-48012 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.3 >Reporter: Szehon Ho >Assignee: Szehon Ho >Priority: Major > Labels: pull-request-available > > SPARK-41471 allowed Spark to shuffle just one side and still conduct SPJ, if > the other side is KeyGroupedPartitioning. However, the support was just for > a KeyGroupedPartition without any partition transform (day, year, bucket). > It will be useful to add support for partition transform as well, as there > are many tables partitioned by those transforms. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-48012) SPJ: Support Transfrom Expressions for One Side Shuffle
[ https://issues.apache.org/jira/browse/SPARK-48012?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Sun resolved SPARK-48012. -- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 46255 [https://github.com/apache/spark/pull/46255] > SPJ: Support Transfrom Expressions for One Side Shuffle > --- > > Key: SPARK-48012 > URL: https://issues.apache.org/jira/browse/SPARK-48012 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.3 >Reporter: Szehon Ho >Assignee: Szehon Ho >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > > SPARK-41471 allowed Spark to shuffle just one side and still conduct SPJ, if > the other side is KeyGroupedPartitioning. However, the support was just for > a KeyGroupedPartition without any partition transform (day, year, bucket). > It will be useful to add support for partition transform as well, as there > are many tables partitioned by those transforms. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org