[GitHub] spark issue #20029: [SPARK-22793][SQL]Memory leak in Spark Thrift Server
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/20029 `addJar ` is cross-session. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20029: [SPARK-22793][SQL]Memory leak in Spark Thrift Server
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20029 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85736/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20029: [SPARK-22793][SQL]Memory leak in Spark Thrift Server
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20029 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20029: [SPARK-22793][SQL]Memory leak in Spark Thrift Server
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20029 **[Test build #85736 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85736/testReport)** for PR 20029 at commit [`2b1e166`](https://github.com/apache/spark/commit/2b1e166f4f43b300d272fc7ce1d9d7997f7ae3cd). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20029: [SPARK-22793][SQL]Memory leak in Spark Thrift Server
Github user viirya commented on the issue: https://github.com/apache/spark/pull/20029 > The hiveClient created for the resourceLoader is only used to addJar, which is, in turn, to add Jar to the shared IsolatedClientLoader. Then we can just use the shared hive client for this purpose. Shouldn't `addJar` be session-based? At least seems in Hive it is: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Cli#LanguageManualCli-HiveResources Although looks like in `SessionResourceLoader` for `SessionState`, `addJar` isn't session-based too. So at least seems we have consistent behavior. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20029: [SPARK-22793][SQL]Memory leak in Spark Thrift Server
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20029 **[Test build #85736 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85736/testReport)** for PR 20029 at commit [`2b1e166`](https://github.com/apache/spark/commit/2b1e166f4f43b300d272fc7ce1d9d7997f7ae3cd). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20029: [SPARK-22793][SQL]Memory leak in Spark Thrift Server
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/20029 ok to test --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20029: [SPARK-22793][SQL]Memory leak in Spark Thrift Server
Github user liufengdb commented on the issue: https://github.com/apache/spark/pull/20029 lgtm! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20029: [SPARK-22793][SQL]Memory leak in Spark Thrift Server
Github user liufengdb commented on the issue: https://github.com/apache/spark/pull/20029 By [this line](https://github.com/apache/spark/blob/master/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkSQLSessionManager.scala#L78), yes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20029: [SPARK-22793][SQL]Memory leak in Spark Thrift Server
Github user mgaido91 commented on the issue: https://github.com/apache/spark/pull/20029 > The hiveClient created for the resourceLoader is only used to addJar, which is, in turn, to add Jar to the shared IsolatedClientLoader. Then we can just use the shared hive client for this purpose. @liufengdb does it mean that we are creating more than one SparkSession in the thriftserver? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20029: [SPARK-22793][SQL]Memory leak in Spark Thrift Server
Github user liufengdb commented on the issue: https://github.com/apache/spark/pull/20029 @zuotingbing I took a close look at the related code and thought the issue you raised is valid: 1. The hiveClient created for the [resourceLoader](https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveSessionStateBuilder.scala#L45) is only used to addJar, which is, in turn, to add Jar to the shared [`IsolatedClientLoader`](https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala#L817). Then we can just use the shared hive client for this purpose. 2. Another possible reason to use a new hive client is to run [this hive statement](https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala#L818). But I think it just some leftovers from old spark and should be removed. So overall it is fined to use the shared `client` from `HiveExternalCatalog` without creating a new hive client. 3. Currrently, there are no ways to cleanup the resource created by a [new session of SQLContext/SparkSession](https://github.com/apache/spark/blob/master/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkSQLSessionManager.scala#L78). I couldn't understand the design tradeoff behind [this](https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/SparkSession.scala#L716) (@srowen ). So it is not easy to remove the temp dirs when a session is closed. 4. To what extent, does spark need these scratch dirs? Is it possible we can make this step optional, if it is not used for all the deployment modes? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20029: [SPARK-22793][SQL]Memory leak in Spark Thrift Server
Github user mgaido91 commented on the issue: https://github.com/apache/spark/pull/20029 What it seems is never closed by your analysis is the client used to interact with the metastore. This might be a problem which we are not aware of in normal SQL applications, since we have only one client in those cases. What you are doing in your fix is avoiding creating a client for each `HiveSessionBuilder`, thus: 1. this would mean that we are creating more than one `SessionBuilder`, ie. more than one `SparkSession`, which is not true as far as I know. 2. any session would share the same client to connect to the metastore, which is wrong IMHO. Please let me know if I misunderstood or I was wrong with something. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20029: [SPARK-22793][SQL]Memory leak in Spark Thrift Server
Github user zuotingbing commented on the issue: https://github.com/apache/spark/pull/20029 Could you please to check this PR? Thanks @liufengdb --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20029: [SPARK-22793][SQL]Memory leak in Spark Thrift Server
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/20029 cc @liufengdb --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20029: [SPARK-22793][SQL]Memory leak in Spark Thrift Server
Github user zuotingbing commented on the issue: https://github.com/apache/spark/pull/20029 `override protected lazy val resourceLoader: HiveSessionResourceLoader = { val client: HiveClient = externalCatalog.client.newSession() new HiveSessionResourceLoader(session, client) }` Is it necessary to create a new HiveClient here since there has a hiveClient in externalCatalog already? If it is necessary, we need to supply a method to close the hiveClient which be created here and in this method we alsO need to clean up the scratchdir(`hdfsSessionPath` and `localSessionPath`) which are created by HiveClientImpl. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20029: [SPARK-22793][SQL]Memory leak in Spark Thrift Server
Github user zuotingbing commented on the issue: https://github.com/apache/spark/pull/20029 It seems each time when connect to thrift server through beeline, the `SessionState.start(state)` will be called two times. one is in `HiveSessionImpl:open` , another is in `HiveClientImpl.newSession()` for `sql("use default")` . When close the beeline connection, only close the HiveSession with `HiveSessionImpl.close()`, but the object of `HiveClientImpl.newSession()` will be left over. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20029: [SPARK-22793][SQL]Memory leak in Spark Thrift Server
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/20029 > we can find the object org.apache.spark.sql.hive.client.HiveClientImpl and org.apache.hadoop.hive.ql.session.SessionState keep increasing Can you check the GC root and explain why they are increasing? The fix looks not correct to me as we should create new session. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20029: [SPARK-22793][SQL]Memory leak in Spark Thrift Server
Github user srowen commented on the issue: https://github.com/apache/spark/pull/20029 I'm asking you to respond to https://github.com/apache/spark/pull/19989#issuecomment-351985114 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20029: [SPARK-22793][SQL]Memory leak in Spark Thrift Server
Github user zuotingbing commented on the issue: https://github.com/apache/spark/pull/20029 Thanks @srowen , so whom could i ping to make sure this change has no side effects? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20029: [SPARK-22793][SQL]Memory leak in Spark Thrift Server
Github user srowen commented on the issue: https://github.com/apache/spark/pull/20029 This indeed is the primary change as it's open vs master. https://github.com/apache/spark/pull/19989 had some concerns about whether this affects correctness though? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20029: [SPARK-22793][SQL]Memory leak in Spark Thrift Server
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20029 Can one of the admins verify this patch? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org