[jira] [Commented] (SPARK-48307) InlineCTE should keep not-inlined relations in the original WithCTE node
[ https://issues.apache.org/jira/browse/SPARK-48307?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17860739#comment-17860739 ] ci-cassandra.apache.org commented on SPARK-48307: - User 'cloud-fan' has created a pull request for this issue: https://github.com/apache/spark/pull/47141 > InlineCTE should keep not-inlined relations in the original WithCTE node > > > Key: SPARK-48307 > URL: https://issues.apache.org/jira/browse/SPARK-48307 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 4.0.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-45096) Optimize apt-get install in Dockerfile
[ https://issues.apache.org/jira/browse/SPARK-45096?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17762527#comment-17762527 ] ci-cassandra.apache.org commented on SPARK-45096: - User 'zhengruifeng' has created a pull request for this issue: https://github.com/apache/spark/pull/42842 > Optimize apt-get install in Dockerfile > -- > > Key: SPARK-45096 > URL: https://issues.apache.org/jira/browse/SPARK-45096 > Project: Spark > Issue Type: Sub-task > Components: Project Infra >Affects Versions: 4.0.0 >Reporter: Ruifeng Zheng >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-44540) Remove unused stylesheet and javascript files of jsonFormatter
[ https://issues.apache.org/jira/browse/SPARK-44540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17746781#comment-17746781 ] ci-cassandra.apache.org commented on SPARK-44540: - User 'yaooqinn' has created a pull request for this issue: https://github.com/apache/spark/pull/42145 > Remove unused stylesheet and javascript files of jsonFormatter > -- > > Key: SPARK-44540 > URL: https://issues.apache.org/jira/browse/SPARK-44540 > Project: Spark > Issue Type: Improvement > Components: Web UI >Affects Versions: 3.5.0 >Reporter: Kent Yao >Priority: Major > > jsonFormatter.min.css and jsonFormatter.min.js is unreached -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-44454) HiveShim getTablesByType support fallback
[ https://issues.apache.org/jira/browse/SPARK-44454?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17746751#comment-17746751 ] ci-cassandra.apache.org commented on SPARK-44454: - User 'cxzl25' has created a pull request for this issue: https://github.com/apache/spark/pull/42033 > HiveShim getTablesByType support fallback > - > > Key: SPARK-44454 > URL: https://issues.apache.org/jira/browse/SPARK-44454 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.1 >Reporter: dzcxzl >Priority: Minor > > When we use a high version of Hive Client to communicate with a low version > of Hive meta store, we may encounter Invalid method name: > 'get_tables_by_type'. > > {code:java} > 23/07/17 12:45:24,391 [main] DEBUG SparkSqlParser: Parsing command: show views > 23/07/17 12:45:24,489 [main] ERROR log: Got exception: > org.apache.thrift.TApplicationException Invalid method name: > 'get_tables_by_type' > org.apache.thrift.TApplicationException: Invalid method name: > 'get_tables_by_type' > at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:79) > at > org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.recv_get_tables_by_type(ThriftHiveMetastore.java:1433) > at > org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.get_tables_by_type(ThriftHiveMetastore.java:1418) > at > org.apache.hadoop.hive.metastore.HiveMetaStoreClient.getTables(HiveMetaStoreClient.java:1411) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.invoke(RetryingMetaStoreClient.java:173) > at com.sun.proxy.$Proxy23.getTables(Unknown Source) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.apache.hadoop.hive.metastore.HiveMetaStoreClient$SynchronizedHandler.invoke(HiveMetaStoreClient.java:2344) > at com.sun.proxy.$Proxy23.getTables(Unknown Source) > at org.apache.hadoop.hive.ql.metadata.Hive.getTablesByType(Hive.java:1427) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.apache.spark.sql.hive.client.Shim_v2_3.getTablesByType(HiveShim.scala:1408) > at > org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$listTablesByType$1(HiveClientImpl.scala:789) > at > org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$withHiveState$1(HiveClientImpl.scala:294) > at > org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:225) > at > org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:224) > at > org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:274) > at > org.apache.spark.sql.hive.client.HiveClientImpl.listTablesByType(HiveClientImpl.scala:785) > at > org.apache.spark.sql.hive.HiveExternalCatalog.$anonfun$listViews$1(HiveExternalCatalog.scala:895) > at > org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:108) > at > org.apache.spark.sql.hive.HiveExternalCatalog.listViews(HiveExternalCatalog.scala:893) > at > org.apache.spark.sql.catalyst.catalog.ExternalCatalogWithListener.listViews(ExternalCatalogWithListener.scala:158) > at > org.apache.spark.sql.catalyst.catalog.SessionCatalog.listViews(SessionCatalog.scala:1040) > at > org.apache.spark.sql.execution.command.ShowViewsCommand.$anonfun$run$5(views.scala:407) > at scala.Option.getOrElse(Option.scala:189) > at > org.apache.spark.sql.execution.command.ShowViewsCommand.run(views.scala:407) > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-44059) Add named argument support for SQL functions
[ https://issues.apache.org/jira/browse/SPARK-44059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17742499#comment-17742499 ] ci-cassandra.apache.org commented on SPARK-44059: - User 'learningchess2003' has created a pull request for this issue: https://github.com/apache/spark/pull/41864 > Add named argument support for SQL functions > > > Key: SPARK-44059 > URL: https://issues.apache.org/jira/browse/SPARK-44059 > Project: Spark > Issue Type: New Feature > Components: Spark Core, SQL >Affects Versions: 3.5.0 >Reporter: Richard Yu >Priority: Major > > Today, there is increasing demand for named argument functions, especially as > we continue to introduce longer and longer parameter lists in our SQL > functions. In these functions, many arguments could have default values, > making it a waste to specify them all even if it is redundant. This is an > umbrella ticket to track smaller subtasks which would be completed for > implementing this feature. > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-44217) Allow custom precision for fp approx equality
[ https://issues.apache.org/jira/browse/SPARK-44217?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17742495#comment-17742495 ] ci-cassandra.apache.org commented on SPARK-44217: - User 'asl3' has created a pull request for this issue: https://github.com/apache/spark/pull/41947 > Allow custom precision for fp approx equality > - > > Key: SPARK-44217 > URL: https://issues.apache.org/jira/browse/SPARK-44217 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.5.0 >Reporter: Amanda Liu >Priority: Major > > SPIP: > https://docs.google.com/document/d/1OkyBn3JbEHkkQgSQ45Lq82esXjr9rm2Vj7Ih_4zycRc/edit#heading=h.f5f0u2riv07v -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-44395) Update table function arguments to require parentheses around identifier after the TABLE keyword
[ https://issues.apache.org/jira/browse/SPARK-44395?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17742496#comment-17742496 ] ci-cassandra.apache.org commented on SPARK-44395: - User 'dtenedor' has created a pull request for this issue: https://github.com/apache/spark/pull/41965 > Update table function arguments to require parentheses around identifier > after the TABLE keyword > > > Key: SPARK-44395 > URL: https://issues.apache.org/jira/browse/SPARK-44395 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 4.0.0 >Reporter: Daniel >Priority: Major > > Per the SQL standard, `TABLE identifier` should actually be passed as > `TABLE(identifier)`. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-43995) Implement UDFRegistration
[ https://issues.apache.org/jira/browse/SPARK-43995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17742490#comment-17742490 ] ci-cassandra.apache.org commented on SPARK-43995: - User 'vicennial' has created a pull request for this issue: https://github.com/apache/spark/pull/41953 > Implement UDFRegistration > - > > Key: SPARK-43995 > URL: https://issues.apache.org/jira/browse/SPARK-43995 > Project: Spark > Issue Type: New Feature > Components: Connect >Affects Versions: 3.5.0 >Reporter: Venkata Sai Akhil Gudesa >Priority: Major > > Reference file - > [https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/UDFRegistration.scala] > API to be implemented: > * > {noformat} > def register(name: String, udf: UserDefinedFunction): > UserDefinedFunction{noformat} > * > ** > [Reference|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/UDFRegistration.scala#L112-L123] > * > {noformat} > def register[RT: TypeTag](name: String, func: Function0[RT]): > UserDefinedFunction{noformat} > * > ** From [0 to 22 > arguments|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/UDFRegistration.scala#L125-L642] > * > {noformat} > def register(name: String, f: UDF0[_], returnType: DataType): Unit{noformat} > * > ** From [0 to 22 > arguments|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/UDFRegistration.scala#L735-L1076] > > We currently do not support UDAFs so the relevant UDAF APIs may be skipped as > well as the python/pyspark (in the context of the scala client) related APIs. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-44325) Define the computing logic through PartitionEvaluator API and use it in SortMergeJoinExec
[ https://issues.apache.org/jira/browse/SPARK-44325?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17740836#comment-17740836 ] ci-cassandra.apache.org commented on SPARK-44325: - User 'vinodkc' has created a pull request for this issue: https://github.com/apache/spark/pull/41884 > Define the computing logic through PartitionEvaluator API and use it in > SortMergeJoinExec > - > > Key: SPARK-44325 > URL: https://issues.apache.org/jira/browse/SPARK-44325 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.5.0 >Reporter: Vinod KC >Priority: Major > > Define the computing logic through PartitionEvaluator API and use it in > SortMergeJoinExec -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-44278) Implement a GRPC server interceptor that cleans up thread local properties
[ https://issues.apache.org/jira/browse/SPARK-44278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17739681#comment-17739681 ] ci-cassandra.apache.org commented on SPARK-44278: - User 'heyihong' has created a pull request for this issue: https://github.com/apache/spark/pull/41831 > Implement a GRPC server interceptor that cleans up thread local properties > -- > > Key: SPARK-44278 > URL: https://issues.apache.org/jira/browse/SPARK-44278 > Project: Spark > Issue Type: Improvement > Components: Connect >Affects Versions: 3.5.0 >Reporter: Yihong He >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-44210) Strengthen type checking and better comply with Connect specifications for `levenshtein` function
[ https://issues.apache.org/jira/browse/SPARK-44210?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17738211#comment-17738211 ] ci-cassandra.apache.org commented on SPARK-44210: - User 'panbingkun' has created a pull request for this issue: https://github.com/apache/spark/pull/41724 > Strengthen type checking and better comply with Connect specifications for > `levenshtein` function > - > > Key: SPARK-44210 > URL: https://issues.apache.org/jira/browse/SPARK-44210 > Project: Spark > Issue Type: Improvement > Components: Connect, SQL >Affects Versions: 3.5.0 >Reporter: BingKun Pan >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41599) Memory leak in FileSystem.CACHE when submitting apps to secure cluster using InProcessLauncher
[ https://issues.apache.org/jira/browse/SPARK-41599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17738209#comment-17738209 ] ci-cassandra.apache.org commented on SPARK-41599: - User 'risyomei' has created a pull request for this issue: https://github.com/apache/spark/pull/41692 > Memory leak in FileSystem.CACHE when submitting apps to secure cluster using > InProcessLauncher > -- > > Key: SPARK-41599 > URL: https://issues.apache.org/jira/browse/SPARK-41599 > Project: Spark > Issue Type: Bug > Components: Deploy, YARN >Affects Versions: 3.1.2 >Reporter: Maciej Smolenski >Priority: Major > Attachments: InProcLaunchFsIssue.scala, > SPARK-41599-fixes-to-limit-FileSystem-CACHE-size-when-using-InProcessLauncher.diff > > > When submitting spark application in kerberos environment the credentials of > 'current user' (UserGroupInformation.getCurrentUser()) are being modified. > Filesystem.CACHE entries contain 'current user' (with user credentials) as a > key. > Submitting many spark applications using InProcessLauncher cause that > FileSystem.CACHE becomes bigger and bigger. > Finally process exits because of OutOfMemory error. > Code for reproduction attached. > > Output from running 'jmap -histo' on reproduction jvm shows that the number > of FileSystem$Cache$Key increases in time: > time: #instances class > 1671533274: 2 org.apache.hadoop.fs.FileSystem$Cache$Key > 167155: 11 org.apache.hadoop.fs.FileSystem$Cache$Key > 1671533395: 21 org.apache.hadoop.fs.FileSystem$Cache$Key > 1671533455: 30 org.apache.hadoop.fs.FileSystem$Cache$Key > 1671533515: 39 org.apache.hadoop.fs.FileSystem$Cache$Key > 1671533576: 48 org.apache.hadoop.fs.FileSystem$Cache$Key > 1671533636: 57 org.apache.hadoop.fs.FileSystem$Cache$Key > 1671533696: 66 org.apache.hadoop.fs.FileSystem$Cache$Key > 1671533757: 75 org.apache.hadoop.fs.FileSystem$Cache$Key > 1671533817: 84 org.apache.hadoop.fs.FileSystem$Cache$Key > 1671533877: 93 org.apache.hadoop.fs.FileSystem$Cache$Key > 1671533937: 102 org.apache.hadoop.fs.FileSystem$Cache$Key > 1671533998: 111 org.apache.hadoop.fs.FileSystem$Cache$Key > 1671534058: 120 org.apache.hadoop.fs.FileSystem$Cache$Key > 1671534118: 135 org.apache.hadoop.fs.FileSystem$Cache$Key > 1671534178: 140 org.apache.hadoop.fs.FileSystem$Cache$Key > 1671534239: 150 org.apache.hadoop.fs.FileSystem$Cache$Key > 1671534299: 159 org.apache.hadoop.fs.FileSystem$Cache$Key > 1671534359: 168 org.apache.hadoop.fs.FileSystem$Cache$Key > 1671534419: 177 org.apache.hadoop.fs.FileSystem$Cache$Key > 1671534480: 186 org.apache.hadoop.fs.FileSystem$Cache$Key > 1671534540: 195 org.apache.hadoop.fs.FileSystem$Cache$Key > 1671534600: 204 org.apache.hadoop.fs.FileSystem$Cache$Key > 1671534661: 213 org.apache.hadoop.fs.FileSystem$Cache$Key > 1671534721: 222 org.apache.hadoop.fs.FileSystem$Cache$Key > 1671534781: 231 org.apache.hadoop.fs.FileSystem$Cache$Key > 1671534841: 240 org.apache.hadoop.fs.FileSystem$Cache$Key > 1671534902: 249 org.apache.hadoop.fs.FileSystem$Cache$Key > 1671534962: 257 org.apache.hadoop.fs.FileSystem$Cache$Key > 1671535022: 264 org.apache.hadoop.fs.FileSystem$Cache$Key > 1671535083: 273 org.apache.hadoop.fs.FileSystem$Cache$Key > 1671535143: 282 org.apache.hadoop.fs.FileSystem$Cache$Key > 1671535203: 291 org.apache.hadoop.fs.FileSystem$Cache$Key -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-44145) Callback prior to execution
[ https://issues.apache.org/jira/browse/SPARK-44145?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17738210#comment-17738210 ] ci-cassandra.apache.org commented on SPARK-44145: - User 'jdesjean' has created a pull request for this issue: https://github.com/apache/spark/pull/41748 > Callback prior to execution > --- > > Key: SPARK-44145 > URL: https://issues.apache.org/jira/browse/SPARK-44145 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.5.0 >Reporter: Jean-Francois Desjeans Gauthier >Priority: Major > > Commands are eagerly executed after analysis phase, while other queries are > executed after planning planning. Users of Spark need to understand time > spent prior to execution. Currently, they need to understand the difference > between these 2 modes. Add a callback after query planning is completed that > can be used for such use. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-44128) Upgrade netty from 4.1.92 to 4.1.93
[ https://issues.apache.org/jira/browse/SPARK-44128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17738065#comment-17738065 ] ci-cassandra.apache.org commented on SPARK-44128: - User 'panbingkun' has created a pull request for this issue: https://github.com/apache/spark/pull/41681 > Upgrade netty from 4.1.92 to 4.1.93 > --- > > Key: SPARK-44128 > URL: https://issues.apache.org/jira/browse/SPARK-44128 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.5.0 >Reporter: BingKun Pan >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-43470) Add operating system ,Java, Python version information to application log
[ https://issues.apache.org/jira/browse/SPARK-43470?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17737312#comment-17737312 ] ci-cassandra.apache.org commented on SPARK-43470: - User 'vinodkc' has created a pull request for this issue: https://github.com/apache/spark/pull/41144 > Add operating system ,Java, Python version information to application log > - > > Key: SPARK-43470 > URL: https://issues.apache.org/jira/browse/SPARK-43470 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.4.0 >Reporter: Vinod KC >Priority: Minor > > Include the operating system and Java version, python version information in > the Application log. This will provide useful context and aid in > troubleshooting and debugging any issues that may arise, particularly when > Spark runs across heterogeneous environments (systems with varying operating > systems and Java versions). -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-43523) Memory leak in Spark UI
[ https://issues.apache.org/jira/browse/SPARK-43523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17729480#comment-17729480 ] ci-cassandra.apache.org commented on SPARK-43523: - User 'aminebag' has created a pull request for this issue: https://github.com/apache/spark/pull/41423 > Memory leak in Spark UI > --- > > Key: SPARK-43523 > URL: https://issues.apache.org/jira/browse/SPARK-43523 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 2.4.4, 3.4.0 >Reporter: Amine Bagdouri >Priority: Major > Attachments: spark_shell_oom.log, spark_ui_memory_leak.zip > > > We have a distributed Spark application running on Azure HDInsight using > Spark version 2.4.4. > After a few days of active processing on our application, we have noticed > that the GC CPU time ratio of the driver is close to 100%. We suspected a > memory leak. Thus, we have produced a heap dump and analyzed it using Eclipse > Memory Analyzer. > Here is some interesting data from the driver's heap dump (heap size is 8 GB): > * The estimated retained heap size of String objects (~5M instances) is 3.3 > GB. It seems that most of these instances correspond to spark events. > * Spark UI's AppStatusListener instance estimated retained size is 1.1 GB. > * The number of LiveJob objects with status "RUNNING" is 18K, knowing that > there shouldn't be more than 16 live running jobs since we use a fixed size > thread pool of 16 threads to run spark queries. > * The number of LiveTask objects is 485K. > * The AsyncEventQueue instance associated to the AppStatusListener has a > value of 854 for dropped events count and a value of 10001 for total events > count, knowing that the dropped events counter is reset every minute and that > the queue's default capacity is 1. > We think that there is a memory leak in Spark UI. Here is our analysis of the > root cause of this leak: > * AppStatusListener is notified of Spark events using a bounded queue in > AsyncEventQueue. > * AppStatusListener updates its state (kvstore, liveTasks, liveStages, > liveJobs, ...) based on the received events. For example, onTaskStart adds a > task to liveTasks map and onTaskEnd removes the task from liveTasks map. > * When the rate of events is very high, the bounded queue in AsyncEventQueue > is full, some events are dropped and don't make it to AppStatusListener. > * Dropped events that signal the end of a processing unit prevent the state > of AppStatusListener from being cleaned. For example, a dropped onTaskEnd > event, will prevent the task from being removed from liveTasks map, and the > task will remain in the heap until the driver's JVM is stopped. > We were able to confirm our analysis by reducing the capacity of the > AsyncEventQueue (spark.scheduler.listenerbus.eventqueue.capacity=10). After > having launched many spark queries using this config, we observed that the > number of active jobs in Spark UI increased rapidly and remained high even > though all submitted queries have completed. We have also noticed that some > executor task counters in Spark UI were negative, which confirms that > AppStatusListener state does not accurately reflect the reality and that it > can be a victim of event drops. > Suggested fix: > There are some limits today on the number of "dead" objects in > AppStatusListener's maps (for example: spark.ui.retainedJobs). We suggest > enforcing another configurable limit on the number of total objects in > AppStatusListener's maps and kvstore. This should limit the leak in the case > of high events rate, but AppStatusListener stats will remain inaccurate. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-43376) Improve reuse subquery with table cache
[ https://issues.apache.org/jira/browse/SPARK-43376?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17729336#comment-17729336 ] ci-cassandra.apache.org commented on SPARK-43376: - User 'ulysses-you' has created a pull request for this issue: https://github.com/apache/spark/pull/41454 > Improve reuse subquery with table cache > --- > > Key: SPARK-43376 > URL: https://issues.apache.org/jira/browse/SPARK-43376 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.5.0 >Reporter: XiDuo You >Assignee: XiDuo You >Priority: Major > Fix For: 3.5.0 > > > AQE can not reuse subquery if it is pushed into InMemoryTableScan. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-43783) Enable FeatureTests.test_standard_scaler for pandas 2.0.0.
[ https://issues.apache.org/jira/browse/SPARK-43783?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17729335#comment-17729335 ] ci-cassandra.apache.org commented on SPARK-43783: - User 'WeichenXu123' has created a pull request for this issue: https://github.com/apache/spark/pull/41456 > Enable FeatureTests.test_standard_scaler for pandas 2.0.0. > -- > > Key: SPARK-43783 > URL: https://issues.apache.org/jira/browse/SPARK-43783 > Project: Spark > Issue Type: Sub-task > Components: ML >Affects Versions: 3.5.0 >Reporter: Haejoon Lee >Priority: Major > > Fix `FeatureTests.test_standard_scaler` In > `python/pyspark/mlv2/tests/test_feature.py` -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-43413) IN subquery ListQuery has wrong nullability
[ https://issues.apache.org/jira/browse/SPARK-43413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17720780#comment-17720780 ] ci-cassandra.apache.org commented on SPARK-43413: - User 'jchen5' has created a pull request for this issue: https://github.com/apache/spark/pull/41094 > IN subquery ListQuery has wrong nullability > --- > > Key: SPARK-43413 > URL: https://issues.apache.org/jira/browse/SPARK-43413 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.0 >Reporter: Jack Chen >Priority: Major > > IN subquery expressions are incorrectly marked as non-nullable, even when > they are actually nullable. They correctly check the nullability of the > left-hand-side, but the right-hand-side of a IN subquery, the ListQuery, is > currently defined with nullability = false always. This is incorrect and can > lead to incorrect query transformations. > Example: (non_nullable_col IN (select nullable_col)) <=> TRUE . Here the IN > expression returns NULL when the nullable_col is null, but our code marks it > as non-nullable, and therefore SimplifyBinaryComparison transforms away the > <=> TRUE, transforming the expression to non_nullable_col IN (select > nullable_col) , which is an incorrect transformation because NULL values of > nullable_col now cause the expression to yield NULL instead of FALSE. > This bug can potentially lead to wrong results, but in most cases this > doesn't directly cause wrong results end-to-end, because IN subqueries are > almost always transformed to semi/anti/existence joins in > RewritePredicateSubquery, and this rewrite can also incorrectly discard > NULLs, which is another bug. But we can observe it causing wrong behavior in > unit tests, and it could easily lead to incorrect query results if there are > changes to the surrounding context, so it should be fixed regardless. > This is a long-standing bug that has existed at least since 2016, as long as > the ListQuery class has existed. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-43284) _metadata.file_path regression
[ https://issues.apache.org/jira/browse/SPARK-43284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17719609#comment-17719609 ] ci-cassandra.apache.org commented on SPARK-43284: - User 'databricks-david-lewis' has created a pull request for this issue: https://github.com/apache/spark/pull/40947 > _metadata.file_path regression > -- > > Key: SPARK-43284 > URL: https://issues.apache.org/jira/browse/SPARK-43284 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.4.0 >Reporter: David Lewis >Assignee: David Lewis >Priority: Major > Fix For: 3.4.1, 3.5.0 > > > As part of the [SparkPath > refactor|https://issues.apache.org/jira/browse/SPARK-41970] the behavior of > `_metadata.file_path` was inadvertently changed. In Spark 3.4+ it now returns > a non-encoded path string, as opposed to a url-encoded path string. > This ticket is to fix that regression. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42999) Impl Dataset#foreach, foreachPartitions
[ https://issues.apache.org/jira/browse/SPARK-42999?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17709519#comment-17709519 ] ci-cassandra.apache.org commented on SPARK-42999: - User 'zhenlineo' has created a pull request for this issue: https://github.com/apache/spark/pull/40628 > Impl Dataset#foreach, foreachPartitions > --- > > Key: SPARK-42999 > URL: https://issues.apache.org/jira/browse/SPARK-42999 > Project: Spark > Issue Type: Improvement > Components: Connect >Affects Versions: 3.5.0 >Reporter: Zhen Li >Priority: Major > > Impl the missing methods in Scala Client Dataset API. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-43019) Move Ordering to PhysicalDataType
[ https://issues.apache.org/jira/browse/SPARK-43019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17709209#comment-17709209 ] ci-cassandra.apache.org commented on SPARK-43019: - User 'amaliujia' has created a pull request for this issue: https://github.com/apache/spark/pull/40651 > Move Ordering to PhysicalDataType > - > > Key: SPARK-43019 > URL: https://issues.apache.org/jira/browse/SPARK-43019 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.5.0 >Reporter: Rui Wang >Assignee: Rui Wang >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-43041) Restore constructors of exceptions for compatibility in connector API
[ https://issues.apache.org/jira/browse/SPARK-43041?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17709208#comment-17709208 ] ci-cassandra.apache.org commented on SPARK-43041: - User 'aokolnychyi' has created a pull request for this issue: https://github.com/apache/spark/pull/40679 > Restore constructors of exceptions for compatibility in connector API > - > > Key: SPARK-43041 > URL: https://issues.apache.org/jira/browse/SPARK-43041 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.4.0 >Reporter: Xinrong Meng >Assignee: Anton Okolnychyi >Priority: Blocker > Fix For: 3.4.0 > > > Thanks [~aokolnychyi] for raising the issue as shown below: > {quote} > I have a question about changes to exceptions used in the public connector > API, such as NoSuchTableException and TableAlreadyExistsException. > I consider those as part of the public Catalog API (TableCatalog uses them in > method definitions). However, it looks like PR #37887 has changed them in an > incompatible way. Old constructors accepting Identifier objects got removed. > The only way to construct such exceptions is either by passing database and > table strings or Scala Seq. Shall we add back old constructors to avoid > breaking connectors? > {quote} > We should restore constructors of those exceptions to preserve the > compatibility in connector API. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41628) Support async query execution
[ https://issues.apache.org/jira/browse/SPARK-41628?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17708024#comment-17708024 ] ci-cassandra.apache.org commented on SPARK-41628: - User 'Hisoka-X' has created a pull request for this issue: https://github.com/apache/spark/pull/40649 > Support async query execution > - > > Key: SPARK-41628 > URL: https://issues.apache.org/jira/browse/SPARK-41628 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Martin Grund >Priority: Major > > Today the query execution is completely synchronous, add an additional > asynchronous API that allows to submit and polll for the result. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-43011) array_insert should fail with 0 index
[ https://issues.apache.org/jira/browse/SPARK-43011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17707932#comment-17707932 ] ci-cassandra.apache.org commented on SPARK-43011: - User 'zhengruifeng' has created a pull request for this issue: https://github.com/apache/spark/pull/40641 > array_insert should fail with 0 index > - > > Key: SPARK-43011 > URL: https://issues.apache.org/jira/browse/SPARK-43011 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org