[GitHub] spark issue #20823: [SPARK-23674] Add Spark ML Listener for Tracking ML Pipe...
Github user merlintang commented on the issue: https://github.com/apache/spark/pull/20823 ok to test --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22550: [SPARK-25501] Kafka delegation token support
Github user merlintang commented on the issue: https://github.com/apache/spark/pull/22550 close this one since other PR is working on this. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22550: [SPARK-25501] Kafka delegation token support
Github user merlintang closed the pull request at: https://github.com/apache/spark/pull/22550 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22598: [SPARK-25501][SS] Add kafka delegation token support.
Github user merlintang commented on the issue: https://github.com/apache/spark/pull/22598 @gaborgsomogyi thanks for your PR, I am going through the details and test on my local machine. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22550: [SPARK-25501] Kafka delegation token support
GitHub user merlintang opened a pull request: https://github.com/apache/spark/pull/22550 [SPARK-25501] Kafka delegation token support ## What changes were proposed in this pull request? Kafaka is going to support delegation token, Spark need to read the delegation token like Hive, HDFS and HBase server. ## How was this patch tested? manually check You can merge this pull request into a Git repository by running: $ git pull https://github.com/merlintang/spark kafka-dtoken Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/22550.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #22550 commit c59ea5eaffc9889074226cf96a0e704672cdb290 Author: Mingjie Tang Date: 2018-09-25T21:20:30Z [RMP-11860][SPARK-25501] Kafka Delegation Token Support for Spark commit 7202ff968fa9a330e112a4958e38fd7f36e53341 Author: Mingjie Tang Date: 2018-09-26T00:31:35Z update with kafka config --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21455: [SPARK-24093][DStream][Minor]Make some fields of KafkaSt...
Github user merlintang commented on the issue: https://github.com/apache/spark/pull/21455 @gabor. These fields are important for us the understand the spark kafka streaming data like the topic name. we can use these information to track the system status. On Tue, Jun 26, 2018 at 4:52 AM Gabor Somogyi wrote: > Why is it required at all? Making things visible without proper reason is > not a good idea. > > â > You are receiving this because you authored the thread. > Reply to this email directly, view it on GitHub > <https://github.com/apache/spark/pull/21455#issuecomment-400279326>, or mute > the thread > <https://github.com/notifications/unsubscribe-auth/ABXY-RAJjzhNWzKkXIaFGViMpWFEB0hEks5uAiBlgaJpZM4USHvN> > . > --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20823: [SPARK-23674] Add Spark ML Listener for Tracking ML Pipe...
Github user merlintang commented on the issue: https://github.com/apache/spark/pull/20823 @holdenk can you look at this PR? thanks in advance. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21455: [SPARK-24093][DStream][Minor]Make some fields of KafkaSt...
Github user merlintang commented on the issue: https://github.com/apache/spark/pull/21455 @jerryshao Actually, we can not use reflection to get this field information. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21504: [SPARK-24479][SS] Added config for registering st...
Github user merlintang commented on a diff in the pull request: https://github.com/apache/spark/pull/21504#discussion_r193911087 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/streaming/StreamingQueryManager.scala --- @@ -55,6 +56,11 @@ class StreamingQueryManager private[sql] (sparkSession: SparkSession) extends Lo @GuardedBy("awaitTerminationLock") private var lastTerminatedQuery: StreamingQuery = null + sparkSession.sparkContext.conf.get(STREAMING_QUERY_LISTENERS).foreach { classNames => +Utils.loadExtensions(classOf[StreamingQueryListener], classNames, + sparkSession.sparkContext.conf).foreach(addListener) + } + --- End diff -- two comments here: 1. we need to log the registration here 2. we need to use try catch for this, it is possible that register fail. this would break the job. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20823: [SPARK-23674] Add Spark ML Listener for Tracking ML Pipe...
Github user merlintang commented on the issue: https://github.com/apache/spark/pull/20823 @jmwdpk can you update this pr, since there is conflict. I have update this pr. https://github.com/merlintang/spark/commits/SPARK-23674 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21455: [SPARK-24093][DStream][Minor]Make some fields of KafkaSt...
Github user merlintang commented on the issue: https://github.com/apache/spark/pull/21455 @jerryshao can you review this minor update ? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21455: [SPARK-24093][DStream][Minor]Make some fields of ...
GitHub user merlintang opened a pull request: https://github.com/apache/spark/pull/21455 [SPARK-24093][DStream][Minor]Make some fields of KafkaStreamWriter/In⦠â¦ternalRowMicroBatchWriter visible to outside of the classes ## What changes were proposed in this pull request? This PR is created to make relevant fields of KafkaStreamWriter and InternalRowMicroBatchWriter visible to outside of the classes. ## How was this patch tested? manual tests (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests) (If this patch involves UI changes, please attach a screenshot; otherwise, remove this) Please review http://spark.apache.org/contributing.html before opening a pull request. You can merge this pull request into a Git repository by running: $ git pull https://github.com/merlintang/spark âSpark-24093â Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/21455.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #21455 commit 6233528063996dabe780d5b04f874f22846e40d4 Author: Mingjie Tang Date: 2018-05-29T19:49:17Z [SPARK-24093][DStream][Minor]Make some fields of KafkaStreamWriter/InternalRowMicroBatchWriter visible to outside of the classes --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19885: [SPARK-22587] Spark job fails if fs.defaultFS and applic...
Github user merlintang commented on the issue: https://github.com/apache/spark/pull/19885 @jerryshao can you backport this to branch 2.2 as well. thanks --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19885: [SPARK-22587] Spark job fails if fs.defaultFS and applic...
Github user merlintang commented on the issue: https://github.com/apache/spark/pull/19885 @jerryshao and @steveloughran thanks for your comments and review. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19885: [SPARK-22587] Spark job fails if fs.defaultFS and applic...
Github user merlintang commented on the issue: https://github.com/apache/spark/pull/19885 @steveloughran can you review the added system test cases? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19885: [SPARK-22587] Spark job fails if fs.defaultFS and applic...
Github user merlintang commented on the issue: https://github.com/apache/spark/pull/19885 My local test is ok. I would set up a system test and update this soon. sorry about this delay. On Tue, Jan 2, 2018 at 3:42 PM, Marcelo Vanzin wrote: > Any updates? > > â > You are receiving this because you were mentioned. > Reply to this email directly, view it on GitHub > <https://github.com/apache/spark/pull/19885#issuecomment-354905082>, or mute > the thread > <https://github.com/notifications/unsubscribe-auth/ABXY-RQZs7FMQzEUSsq4qiej6xlpU2g8ks5tGr72gaJpZM4Q1hI9> > . > --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19885: [SPARK-22587] Spark job fails if fs.defaultFS and applic...
Github user merlintang commented on the issue: https://github.com/apache/spark/pull/19885 I am so sorry for the late of testing function, I would update it soon. On Thu, Dec 14, 2017 at 12:55 PM, UCB AMPLab wrote: > Can one of the admins verify this patch? > > â > You are receiving this because you were mentioned. > Reply to this email directly, view it on GitHub > <https://github.com/apache/spark/pull/19885#issuecomment-351832769>, or mute > the thread > <https://github.com/notifications/unsubscribe-auth/ABXY-dnjtJZHPD-2OulAGPdSSASXOKCJks5tAYsugaJpZM4Q1hI9> > . > --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19885: [SPARK-22587] Spark job fails if fs.defaultFS and applic...
Github user merlintang commented on the issue: https://github.com/apache/spark/pull/19885 I have added this test case for the URI comparing based on Steve's comments. I have tested this in my local vm, it pass the test. meanwhile, for the hdfs://namenode1/path1 hdfs://namenode1:8020/path2 , the default port number of hdfs can be got. thus, they also matched. below is the test case: test("compare URI for filesystem") { //case 1 var srcUri = new URI("file:///file1") var dstUri = new URI("file:///file2") assert(Client.compareUri(srcUri, dstUri) == true) //case 2 srcUri = new URI("file:///c:file1") dstUri = new URI("file://c:file2") assert(Client.compareUri(srcUri, dstUri) == true) //case 3 srcUri = new URI("file://host/file1") dstUri = new URI("file://host/file2") assert(Client.compareUri(srcUri, dstUri) == true) //case 4 srcUri = new URI("wasb://bucket1@user") dstUri = new URI("wasb://bucket1@user/") assert(Client.compareUri(srcUri, dstUri) == true) //case 5 srcUri = new URI("hdfs:/path1") dstUri = new URI("hdfs:/path2") assert(Client.compareUri(srcUri, dstUri) == true) //case 6 srcUri = new URI("file:///file1") dstUri = new URI("file://host/file2") assert(Client.compareUri(srcUri, dstUri) == false) //case 7 srcUri = new URI("file://host/file1") dstUri = new URI("file:///file2") assert(Client.compareUri(srcUri, dstUri) == false) //case 8 srcUri = new URI("file://host/file1") dstUri = new URI("file://host2/file2") assert(Client.compareUri(srcUri, dstUri) == false) //case 9 srcUri = new URI("wasb://bucket1@user") dstUri = new URI("wasb://bucket2@user/") assert(Client.compareUri(srcUri, dstUri) == false) //case 10 srcUri = new URI("wasb://bucket1@user") dstUri = new URI("wasb://bucket1@user2/") assert(Client.compareUri(srcUri, dstUri) == false) //case 11 srcUri = new URI("s3a://user@pass:bucket1/") dstUri = new URI("s3a://user2@pass2:bucket1/") assert(Client.compareUri(srcUri, dstUri) == false) //case 12 srcUri = new URI("hdfs://namenode1/path1") dstUri = new URI("hdfs://namenode1:8080/path2") assert(Client.compareUri(srcUri, dstUri) == false) //case 13 srcUri = new URI("hdfs://namenode1:8020/path1") dstUri = new URI("hdfs://namenode1:8080/path2") assert(Client.compareUri(srcUri, dstUri) == false) } --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19885: [SPARK-22587] Spark job fails if fs.defaultFS and applic...
Github user merlintang commented on the issue: https://github.com/apache/spark/pull/19885 @jerryshao yes, hdfs://us...@nn1.com:8020 and hdfs://us...@nn1.com:8020 would consider as two filesystem, since the authority information should be taken into consideration. that is why need to add the authority to check two FS. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19885: [SPARK-22587] Spark job fails if fs.defaultFS and...
Github user merlintang commented on a diff in the pull request: https://github.com/apache/spark/pull/19885#discussion_r154827513 --- Diff: resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala --- @@ -1428,6 +1428,12 @@ private object Client extends Logging { return false } +val srcAuthority = srcUri.getAuthority() +val detAuthority = dstUri.getAuthority() +if (srcAuthority != detAuthority || (srcAuthority != null && !srcAuthority.equalsIgnoreCase(detAuthority))) { --- End diff -- thanks all, I would update the PR soon. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19885: [SPARK-22587] Spark job fails if fs.defaultFS and applic...
Github user merlintang commented on the issue: https://github.com/apache/spark/pull/19885 @jerryshao can you review this patch? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19885: [SPARK-22587] Spark job fails if fs.defaultFS and...
GitHub user merlintang opened a pull request: https://github.com/apache/spark/pull/19885 [SPARK-22587] Spark job fails if fs.defaultFS and application jar are d⦠â¦ifferent url ## What changes were proposed in this pull request? Two filesystems comparing does not consider the authority of URI. Therefore, we have to add the authority to compare two filesystem, and two filesystem with different authority can not be the same FS. Please review http://spark.apache.org/contributing.html before opening a pull request. You can merge this pull request into a Git repository by running: $ git pull https://github.com/merlintang/spark EAR-7377 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/19885.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #19885 commit 3675f0a41fc0715d3d7122bbff3ab6d8fbe057c9 Author: Mingjie Tang Date: 2017-12-04T23:31:31Z SPARK-22587 Spark job fails if fs.defaultFS and application jar are different url --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16165: [SPARK-8617] [WEBUI] HistoryServer: Include in-progress ...
Github user merlintang commented on the issue: https://github.com/apache/spark/pull/16165 @markhamstra Thanks all. btw: what if there are many redundant inprogress files in the disk and impact the system performance? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16165: [SPARK-8617] [WEBUI] HistoryServer: Include in-progress ...
Github user merlintang commented on the issue: https://github.com/apache/spark/pull/16165 @vanzin sorry, I mean the 2.1.1 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16165: [SPARK-8617] [WEBUI] HistoryServer: Include in-progress ...
Github user merlintang commented on the issue: https://github.com/apache/spark/pull/16165 should we backport this into 2.1? @vanzin --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17092: [SPARK-18450][ML] Scala API Change for LSH AND-amplifica...
Github user merlintang commented on the issue: https://github.com/apache/spark/pull/17092 @Yunni I test this patch locally, it can work, but I have one idea to improve it. We can discuss it in other ticket. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17092: [SPARK-18450][ML] Scala API Change for LSH AND-amplifica...
Github user merlintang commented on the issue: https://github.com/apache/spark/pull/17092 @Yunni ok, let us discuss the further optimization step in other ticket. the current patch is LGTM. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16965: [Spark-18450][ML] Scala API Change for LSH AND-amplifica...
Github user merlintang commented on the issue: https://github.com/apache/spark/pull/16965 @Yunni thanks, where I mention the L is the number of hash tables. By this way, the memory usage would be O(L*N). the approximate NN searching cost in one partition is O(L*N'). Where N is the number of input dataset, and N' is the number of data points in one partition. right? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16965: [Spark-18450][ML] Scala API Change for LSH AND-amplifica...
Github user merlintang commented on the issue: https://github.com/apache/spark/pull/16965 @Yunni Ok, if we want to move this quicker, we can keep the current AND-OR implementation. (2)(3) you mention that you explode the inner table (dataset). Does it mean for each tuple of inner table (says t_i) and multiple hash functions (say h_0, h_1, ... h_l) . you create multiple rows like (h_0, t_i), (h_1, t_i), ... (h_l, t_i). am i correct? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16965: [Spark-18450][ML] Scala API Change for LSH AND-amplifica...
Github user merlintang commented on the issue: https://github.com/apache/spark/pull/16965 @Yunni Yes, we can use the AND-OR to increase the possibility by having more the numHashTables and numHashFunctions. For the further user extension, if users have a hash function with lower possibility, the OR-AND could be used. (1) I do not need to change Array[Vector], numHashTables, numHashFunctions, we need to change the function to compute the hashDistance (i.e.,hashDistance), as well as the sameBucket function in the approxNearestNeighbors. (3) for the simijoin, I have one question here, if you do a join based on the hashed value of input tuples, the joined key would be array(vector). Am i right? if it is, does this meet OR-amplification? please clarify me if I am wrong. (4) for the index part, I think it would be work. it is pretty similar as the routing table idea for the graphx. thus, I think we can create a other data frame with the same partitioner of the input data frame, then, the newly created data frame would contain the index for the input tables without disturbing the data frame. 5) the other major concern would be memory overhead, Can we reduce the memory usage for the output hash value i.e., array(vector)? Because the users said that the current way spent extensive of memory. therefore, one way we can do using the bit to respected the hashed value for the min-hash, the other way would use the sparse vector. what do you think ? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16965: [Spark-18450][ML] Scala API Change for LSH AND-amplifica...
Github user merlintang commented on the issue: https://github.com/apache/spark/pull/16965 @Yunni I agree with you that the current NN search and Join are using the AND-OR. We can discuss how to use the OR-AND for that two searching as well. For the OR-AND option, it is used when the effective threshold is low. please refer to the table in the page 31 and 33. http://web.stanford.edu/class/cs345a/slides/05-LSH.pdf You can notice, when the p is lower, the OR-AND can amplify the hash family possibility from 0.0985 to 0.5440. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16965: [Spark-18450][ML] Scala API Change for LSH AND-amplifica...
Github user merlintang commented on the issue: https://github.com/apache/spark/pull/16965 It seems this patch provide the AND-OR amplification. Can we provide the option for users to choose the OR-AND amplification as well? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15819: [SPARK-18372][SQL][Branch-1.6].Staging directory ...
Github user merlintang closed the pull request at: https://github.com/apache/spark/pull/15819 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15819: [SPARK-18372][SQL][Branch-1.6].Staging directory fail to...
Github user merlintang commented on the issue: https://github.com/apache/spark/pull/15819 Many thanks, Xiao. I learnt lots. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15819: [SPARK-18372][SQL][Branch-1.6].Staging directory ...
Github user merlintang commented on a diff in the pull request: https://github.com/apache/spark/pull/15819#discussion_r94906952 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/client/VersionsSuite.scala --- @@ -216,5 +219,37 @@ class VersionsSuite extends SparkFunSuite with Logging { "as 'COMPACT' WITH DEFERRED REBUILD") client.reset() } + +test(s"$version: CREATE TABLE AS SELECT") { + withTable("tbl") { +sqlContext.sql("CREATE TABLE tbl AS SELECT 1 AS a") +assert(sqlContext.table("tbl").collect().toSeq == Seq(Row(1))) + } +} + +test(s"$version: Delete the temporary staging directory and files after each insert") { + import sqlContext.implicits._ --- End diff -- thanks, xiao, I have reverted that and test locally. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15819: [SPARK-18372][SQL][Branch-1.6].Staging directory ...
Github user merlintang commented on a diff in the pull request: https://github.com/apache/spark/pull/15819#discussion_r94727237 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/client/VersionsSuite.scala --- @@ -216,5 +219,37 @@ class VersionsSuite extends SparkFunSuite with Logging { "as 'COMPACT' WITH DEFERRED REBUILD") client.reset() } + +test(s"$version: CREATE TABLE AS SELECT") { + withTable("tbl") { +sqlContext.sql("CREATE TABLE tbl AS SELECT 1 AS a") +assert(sqlContext.table("tbl").collect().toSeq == Seq(Row(1))) + } +} + +test(s"$version: Delete the temporary staging directory and files after each insert") { + withTempDir { tmpDir => +withTable("tab", "tbl") { + sqlContext.sql( +s""" + |CREATE TABLE tab(c1 string) --- End diff -- thanks, it is updated. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15819: [SPARK-18372][SQL][Branch-1.6].Staging directory ...
Github user merlintang commented on a diff in the pull request: https://github.com/apache/spark/pull/15819#discussion_r94727256 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/client/VersionsSuite.scala --- @@ -216,5 +219,37 @@ class VersionsSuite extends SparkFunSuite with Logging { "as 'COMPACT' WITH DEFERRED REBUILD") client.reset() } + +test(s"$version: CREATE TABLE AS SELECT") { + withTable("tbl") { +sqlContext.sql("CREATE TABLE tbl AS SELECT 1 AS a") +assert(sqlContext.table("tbl").collect().toSeq == Seq(Row(1))) + } +} + +test(s"$version: Delete the temporary staging directory and files after each insert") { + withTempDir { tmpDir => +withTable("tab", "tbl") { + sqlContext.sql( +s""" + |CREATE TABLE tab(c1 string) + |location '${tmpDir.toURI.toString}' + """.stripMargin) + + import sqlContext.implicits._ --- End diff -- Done --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15819: [SPARK-18372][SQL][Branch-1.6].Staging directory ...
Github user merlintang commented on a diff in the pull request: https://github.com/apache/spark/pull/15819#discussion_r94727246 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/InsertIntoHiveTable.scala --- @@ -54,6 +63,63 @@ case class InsertIntoHiveTable( @transient private lazy val hiveContext = new Context(sc.hiveconf) @transient private lazy val catalog = sc.catalog + @transient var createdTempDir: Option[Path] = None + val stagingDir = new HiveConf().getVar(HiveConf.ConfVars.STAGINGDIR) + + private def executionId: String = { +val rand: Random = new Random +val format: SimpleDateFormat = new SimpleDateFormat("-MM-dd_HH-mm-ss_SSS") +val executionId: String = "hive_" + format.format(new Date) + "_" + Math.abs(rand.nextLong) + executionId --- End diff -- Done! thanks xiao. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15819: [SPARK-18372][SQL][Branch-1.6].Staging directory fail to...
Github user merlintang commented on the issue: https://github.com/apache/spark/pull/15819 @gatorsmile can you retest the patch, then we can merge. Sorry to ping you multiple times since several users are asking this. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15819: [SPARK-18372][SQL][Branch-1.6].Staging directory ...
Github user merlintang commented on a diff in the pull request: https://github.com/apache/spark/pull/15819#discussion_r94361979 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/client/VersionsSuite.scala --- @@ -216,5 +218,37 @@ class VersionsSuite extends SparkFunSuite with Logging { "as 'COMPACT' WITH DEFERRED REBUILD") client.reset() } + +test(s"$version: CREATE TABLE AS SELECT") { + withTable("tbl") { +sqlContext.sql("CREATE TABLE tbl AS SELECT 1 AS a") +assert(sqlContext.table("tbl").collect().toSeq == Seq(Row(1))) + } +} + +test(s"$version: Delete the temporary staging directory and files after each insert") { + withTempDir { tmpDir => +withTable("tab", "tbl") { + sqlContext.sql( +s""" + |CREATE TABLE tab(c1 string) + |location '${tmpDir.toURI.toString}' + """.stripMargin) + + sqlContext.sql("CREATE TABLE tbl AS SELECT 1 AS a") --- End diff -- Sorry Xiao, since one of my best friend is Tao. :). Sorry. It is updated. Thanks again. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15819: [SPARK-18372][SQL][Branch-1.6].Staging directory ...
Github user merlintang commented on a diff in the pull request: https://github.com/apache/spark/pull/15819#discussion_r94359244 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/client/VersionsSuite.scala --- @@ -216,5 +218,37 @@ class VersionsSuite extends SparkFunSuite with Logging { "as 'COMPACT' WITH DEFERRED REBUILD") client.reset() } + +test(s"$version: CREATE TABLE AS SELECT") { + withTable("tbl") { +sqlContext.sql("CREATE TABLE tbl AS SELECT 1 AS a") +assert(sqlContext.table("tbl").collect().toSeq == Seq(Row(1))) + } +} + +test(s"$version: Delete the temporary staging directory and files after each insert") { + withTempDir { tmpDir => +withTable("tab", "tbl") { + sqlContext.sql( +s""" + |CREATE TABLE tab(c1 string) + |location '${tmpDir.toURI.toString}' + """.stripMargin) + + sqlContext.sql("CREATE TABLE tbl AS SELECT 1 AS a") --- End diff -- thanks Tao, I have created a dataframe, then create registerTempTable as following. val df = sqlContext.createDataFrame((1 to 2).map(i => (i, "a"))).toDF("key", "value") df.select("value").repartition(1).registerTempTable("tbl") it can work, but it looks like fuzzy. what do you think? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15819: [SPARK-18372][SQL][Branch-1.6].Staging directory ...
Github user merlintang commented on a diff in the pull request: https://github.com/apache/spark/pull/15819#discussion_r94351849 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/InsertIntoHiveTable.scala --- @@ -54,6 +63,63 @@ case class InsertIntoHiveTable( @transient private lazy val hiveContext = new Context(sc.hiveconf) @transient private lazy val catalog = sc.catalog + @transient var createdTempDir: Option[Path] = None + val stagingDir = new HiveConf().getVar(HiveConf.ConfVars.STAGINGDIR) + + private def executionId: String = { +val rand: Random = new Random +val format: SimpleDateFormat = new SimpleDateFormat("-MM-dd_HH-mm-ss_SSS") +val executionId: String = "hive_" + format.format(new Date) + "_" + Math.abs(rand.nextLong) +return executionId --- End diff -- done. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15819: [SPARK-18372][SQL][Branch-1.6].Staging directory ...
Github user merlintang commented on a diff in the pull request: https://github.com/apache/spark/pull/15819#discussion_r94351862 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/client/VersionsSuite.scala --- @@ -216,5 +218,37 @@ class VersionsSuite extends SparkFunSuite with Logging { "as 'COMPACT' WITH DEFERRED REBUILD") client.reset() } + +test(s"$version: CREATE TABLE AS SELECT") { + withTable("tbl") { +sqlContext.sql("CREATE TABLE tbl AS SELECT 1 AS a") +assert(sqlContext.table("tbl").collect().toSeq == Seq(Row(1))) + } +} + +test(s"$version: Delete the temporary staging directory and files after each insert") { + withTempDir { tmpDir => +withTable("tab", "tbl") { + sqlContext.sql( +s""" + |CREATE TABLE tab(c1 string) + |location '${tmpDir.toURI.toString}' + """.stripMargin) + + sqlContext.sql("CREATE TABLE tbl AS SELECT 1 AS a") --- End diff -- does the temporary view supported in the 1.6.x? I just used the hivecontext to create the view, but it does not work. because this is small test case, the created table here would be ok. please advise. thanks so much, Tao. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15819: [SPARK-18372][SQL][Branch-1.6].Staging directory fail to...
Github user merlintang commented on the issue: https://github.com/apache/spark/pull/15819 @gatorsmile I have backport the test case in #16339 with small modification. because the "INSERT OVERWRITE TABLE tab SELECT '$i'" will bring the issue from hive side e.g., https://issues.apache.org/jira/browse/HIVE-12200. Thus, I just create a temp table and insert data from the select temp table. please double check and verify. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15819: [SPARK-18372][SQL][Branch-1.6].Staging directory fail to...
Github user merlintang commented on the issue: https://github.com/apache/spark/pull/15819 yes, let me backport the test cases for checking the staging file. On Thu, Dec 29, 2016 at 10:11 PM, Xiao Li wrote: > Is that possible to backport the test cases in #16399 > <https://github.com/apache/spark/pull/16399>? > > â > You are receiving this because you were mentioned. > Reply to this email directly, view it on GitHub > <https://github.com/apache/spark/pull/15819#issuecomment-269736325>, or mute > the thread > <https://github.com/notifications/unsubscribe-auth/ABXY-dUsq_LmnsbbZ4qWULVBU8rAzvwCks5rNKCLgaJpZM4KtFSt> > . > --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15819: [SPARK-18372][SQL][Branch-1.6].Staging directory fail to...
Github user merlintang commented on the issue: https://github.com/apache/spark/pull/15819 Thanks, Wenchen, I have backport the code of #16339 to here, I have tested it locally. Can you review and verify? On Sun, Dec 25, 2016 at 11:04 PM, Wenchen Fan wrote: > #16399 <https://github.com/apache/spark/pull/16399> has been merged, feel > free if you wanna backport to 1.6 > > â > You are receiving this because you were mentioned. > Reply to this email directly, view it on GitHub > <https://github.com/apache/spark/pull/15819#issuecomment-269173723>, or mute > the thread > <https://github.com/notifications/unsubscribe-auth/ABXY-efAatioCgVN7gTwHAmvYFjll5ksks5rL2bsgaJpZM4KtFSt> > . > --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15819: [SPARK-18372][SQL][Branch-1.6].Staging directory fail to...
Github user merlintang commented on the issue: https://github.com/apache/spark/pull/15819 @gatorsmile Great! thanks so much, because I was pinged multiple times for this bug. :) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15819: [SPARK-18372][SQL][Branch-1.6].Staging directory fail to...
Github user merlintang commented on the issue: https://github.com/apache/spark/pull/15819 @cloud-fan @gatorsmile I have backport the code from #16134, can you verify and backport this to spark 1.6.x? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15819: [SPARK-18372][SQL][Branch-1.6].Staging directory fail to...
Github user merlintang commented on the issue: https://github.com/apache/spark/pull/15819 @gatorsmile one more customer is running into this issue in the spark 1.6.x. I backport the code #16134 to here and test it manually. Please verify. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16134: [SPARK-18703] [SQL] Drop Staging Directories and Data Fi...
Github user merlintang commented on the issue: https://github.com/apache/spark/pull/16134 This patch is related to the path #15819 for spark 1.6. In the #15819, I can add the code from this patch(#16134) now, then we can fix the staging files issues in the spark 1.6.x. On Thu, Dec 15, 2016 at 12:54 PM, Reynold Xin wrote: > sounds good to backport into 2.x branches. We can also backport into 1.6 > if it is easy. > > â > You are receiving this because you are subscribed to this thread. > Reply to this email directly, view it on GitHub > <https://github.com/apache/spark/pull/16134#issuecomment-267441114>, or mute > the thread > <https://github.com/notifications/unsubscribe-auth/ABXY-dvWSszpAXGIg06108mEbuFIjlltks5rIakAgaJpZM4LDg41> > . > --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16134: [SPARK-18703] [SQL] Drop Staging Directories and Data Fi...
Github user merlintang commented on the issue: https://github.com/apache/spark/pull/16134 +1 backport to spark 1.6.x On Thu, Dec 15, 2016 at 8:14 AM, Xiao Li wrote: > The staging directory and files will not be removed when users hitting > abnormal termination of JVM. In addition, if the JVM does not stop, these > temporary files could still consume a lot of spaces. Thus, I think we need > to backport it. However, I am not sure whether we should backport it to all > the previous versions (2.1, 2.0 and 1.6) > > @rxin <https://github.com/rxin> Could you please make a decision? > > â > You are receiving this because you are subscribed to this thread. > Reply to this email directly, view it on GitHub > <https://github.com/apache/spark/pull/16134#issuecomment-267368652>, or mute > the thread > <https://github.com/notifications/unsubscribe-auth/ABXY-SgI6KMktXc0vMSQyycvTPIA4Ewqks5rIWd0gaJpZM4LDg41> > . > --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15819: [SPARK-18372][SQL][Branch-1.6].Staging directory fail to...
Github user merlintang commented on the issue: https://github.com/apache/spark/pull/15819 @cloud-fan @gatorsmile this patch is related to #16134, It seems #16134 would be merged soon. Meanwhile, should we backport #16104 into 1.6.x? please advise. or else, I just backport #16134 and #12770 to the spark 1.6.x? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16134: [SPARK-18703] [SQL] Drop Staging Directories and ...
Github user merlintang commented on a diff in the pull request: https://github.com/apache/spark/pull/16134#discussion_r92244682 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/InsertIntoHiveTable.scala --- @@ -328,6 +332,15 @@ case class InsertIntoHiveTable( holdDDLTime) } +// Attempt to delete the staging directory and the inclusive files. If failed, the files are +// expected to be dropped at the normal termination of VM since deleteOnExit is used. +try { + createdTempDir.foreach { path => path.getFileSystem(hadoopConf).delete(path, true) } +} catch { + case NonFatal(e) => +logWarning(s"Unable to delete staging directory: $stagingDir.\n" + e) +} + // Invalidate the cache. sqlContext.sharedState.cacheManager.invalidateCache(table) --- End diff -- should we delete the staging files before or after the invalidateCache? does it matter? logically, we should invalid cache first, then remove the intermediate dataset s.t the cache can be recovered from the file from disks. am i right? please clarify? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15819: [SPARK-18372][SQL][Branch-1.6].Staging directory fail to...
Github user merlintang commented on the issue: https://github.com/apache/spark/pull/15819 Great, once the #16134 <https://github.com/apache/spark/pull/16134> is done, we can backport them together. On Tue, Dec 13, 2016 at 12:18 AM, Wenchen Fan wrote: > yea, I think we should backport a complete staging dir cleanup > functionality to 1.6, let's wait for #16134 > <https://github.com/apache/spark/pull/16134> > > â > You are receiving this because you were mentioned. > Reply to this email directly, view it on GitHub > <https://github.com/apache/spark/pull/15819#issuecomment-266674495>, or mute > the thread > <https://github.com/notifications/unsubscribe-auth/ABXY-VSxRlOyt2H4ySKmNJm4j4q5facoks5rHlS8gaJpZM4KtFSt> > . > --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15819: [SPARK-18372][SQL][Branch-1.6].Staging directory fail to...
Github user merlintang commented on the issue: https://github.com/apache/spark/pull/15819 @gatorsmile what is going on this patch? this is a backport code, thus, can you merge this patch into 1.6.x ? more than one users are running into this issue in the spark 1.6.x. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15819: [SPARK-18372][SQL][Branch-1.6].Staging directory fail to...
Github user merlintang commented on the issue: https://github.com/apache/spark/pull/15819 do you exit the spark shell ? I have tested on this, and this staging file would be removed after we exit the spark shell under spark 2.0.x. meanwhile, the staging file are used for hive to write data, and if one hive insert data fail in the middle, the staging file could be used. On Tue, Dec 6, 2016 at 5:09 PM, lichenglin wrote: > here is some result for du -h --max-depth=1 . > 3.3G ./.hive-staging_hive_2016-12-06_18-17-48_899_1400956608265117052-5 > 13G ./.hive-staging_hive_2016-12-06_15-43-35_928_6647980494630196053-5 > 8.6G ./.hive-staging_hive_2016-12-06_17-05-51_951_8422682528744006964-5 > 9.7G ./.hive-staging_hive_2016-12-06_17-14-44_748_6947381677226271245-5 > 9.2G ./day=2016-12-01 > 8.5G ./day=2016-11-19 > > I run a sql like insert overwrite db.table partition(day='2016-12-06') > select * from tmpview everyday > each sql create a "hive-staging folder". > > Can I delete the folders manually?? > > â > You are receiving this because you were mentioned. > Reply to this email directly, view it on GitHub > <https://github.com/apache/spark/pull/15819#issuecomment-265324884>, or mute > the thread > <https://github.com/notifications/unsubscribe-auth/ABXY-cCebx3piETzStocxtvovCRPX6Ukks5rFgdYgaJpZM4KtFSt> > . > --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13670: [SPARK-15951] Change Executors Page to use datatables to...
Github user merlintang commented on the issue: https://github.com/apache/spark/pull/13670 @kishorvpatil you provided the function allexecutors, which is used to return the dead and active executor information. For the document http://spark.apache.org/docs/latest/monitoring.html for /applications/[app-id]/executorsA list of all executors for the given application. We had better document more clearly what is meaning of functions for the 2.1 version. /applications/[app-id]/executors xxx /applications/[app-id]/allexecutors This confused people, because our test already run into this issue. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15819: [SPARK-18372][SQL][Branch-1.6].Staging directory fail to...
Github user merlintang commented on the issue: https://github.com/apache/spark/pull/15819 @cloud-fan this is related to this PR in the 2.0.x https://github.com/apache/spark/pull/12770 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15819: [SPARK-18372][SQL][Branch-1.6].Staging directory fail to...
Github user merlintang commented on the issue: https://github.com/apache/spark/pull/15819 Ok. On Sun, Dec 4, 2016 at 6:25 PM, Reynold Xin wrote: > We have stopped making new releases for 1.5 so it makes no sense to > backport. > > â > You are receiving this because you were mentioned. > Reply to this email directly, view it on GitHub > <https://github.com/apache/spark/pull/15819#issuecomment-264754120>, or mute > the thread > <https://github.com/notifications/unsubscribe-auth/ABXY-VjuhfvucqSwiSitncO_gIX_7G-wks5rE3YogaJpZM4KtFSt> > . > --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15819: [SPARK-18372][SQL][Branch-1.6].Staging directory fail to...
Github user merlintang commented on the issue: https://github.com/apache/spark/pull/15819 this bug is related to 1.5.x as well as 1.6.x. please backport to 1.5.x as well. On Sun, Dec 4, 2016 at 6:20 PM, Reynold Xin wrote: > If it is a bug fix and low risk, sure. > > â > You are receiving this because you were mentioned. > Reply to this email directly, view it on GitHub > <https://github.com/apache/spark/pull/15819#issuecomment-264753604>, or mute > the thread > <https://github.com/notifications/unsubscribe-auth/ABXY-fX25g3sjKumkkWJPXjq1Wq2jMqvks5rE3T8gaJpZM4KtFSt> > . > --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15819: [SPARK-18372][SQL][Branch-1.6].Staging directory fail to...
Github user merlintang commented on the issue: https://github.com/apache/spark/pull/15819 it is updated. On Sun, Dec 4, 2016 at 11:23 AM, Xiao Li wrote: > @merlintang <https://github.com/merlintang> Could you please add > [Branch-1.6] in your PR title? > > â > You are receiving this because you were mentioned. > Reply to this email directly, view it on GitHub > <https://github.com/apache/spark/pull/15819#issuecomment-264724542>, or mute > the thread > <https://github.com/notifications/unsubscribe-auth/ABXY-Q749lhH4-ePuwIlqR_-AjMhdlDIks5rExNAgaJpZM4KtFSt> > . > --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15819: [SPARK-18372][SQL].Staging directory fail to be removed
Github user merlintang commented on the issue: https://github.com/apache/spark/pull/15819 yes, exactly. This path is only for spark 1.x. what i proposed here is that we need to use the code of spark 2.0.x o fix the bug of spark 1.x. you can see this message from the my previous replies. I do not want to change the code, since it will make the 1.x and 2.x in great different. On Sun, Dec 4, 2016 at 10:08 AM, Xiao Li wrote: > *@gatorsmile* commented on this pull request. > -- > > In sql/hive/src/main/scala/org/apache/spark/sql/hive/ > execution/InsertIntoHiveTable.scala > <https://github.com/apache/spark/pull/15819>: > > > + } else { > +inputPathName.substring(0, inputPathName.indexOf(stagingDir) + stagingDir.length) > + } > +val dir: Path = > + fs.makeQualified( > +new Path(stagingPathName + "_" + executionId + "-" + TaskRunner.getTaskRunnerID)) > +logDebug("Created staging dir = " + dir + " for path = " + inputPath) > +try { > + if (!FileUtils.mkdir(fs, dir, true, hadoopConf)) { > +throw new IllegalStateException("Cannot create staging directory '" + dir.toString + "'") > + } > + fs.deleteOnExit(dir) > +} > +catch { > + case e: IOException => > +throw new RuntimeException( > > Almost all the codes in this PR are copied from the existing master. This > PR is just for branch 1.6 > > â > You are receiving this because you were mentioned. > Reply to this email directly, view it on GitHub > <https://github.com/apache/spark/pull/15819>, or mute the thread > <https://github.com/notifications/unsubscribe-auth/ABXY-aaIs7Wx6ha3mvqrTVIxehcxGkaYks5rEwGKgaJpZM4KtFSt> > . > --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15819: [SPARK-18372][SQL].Staging directory fail to be r...
Github user merlintang commented on a diff in the pull request: https://github.com/apache/spark/pull/15819#discussion_r88778830 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/InsertIntoHiveTable.scala --- @@ -54,6 +61,61 @@ case class InsertIntoHiveTable( @transient private lazy val hiveContext = new Context(sc.hiveconf) @transient private lazy val catalog = sc.catalog + val stagingDir = new HiveConf().getVar(HiveConf.ConfVars.STAGINGDIR) + + private def executionId: String = { +val rand: Random = new Random +val format: SimpleDateFormat = new SimpleDateFormat("-MM-dd_HH-mm-ss_SSS") +val executionId: String = "hive_" + format.format(new Date) + "_" + Math.abs(rand.nextLong) +return executionId + } + + private def getStagingDir(inputPath: Path, hadoopConf: Configuration): Path = { +val inputPathUri: URI = inputPath.toUri +val inputPathName: String = inputPathUri.getPath +val fs: FileSystem = inputPath.getFileSystem(hadoopConf) +val stagingPathName: String = + if (inputPathName.indexOf(stagingDir) == -1) { +new Path(inputPathName, stagingDir).toString + } else { +inputPathName.substring(0, inputPathName.indexOf(stagingDir) + stagingDir.length) + } +val dir: Path = + fs.makeQualified( +new Path(stagingPathName + "_" + executionId + "-" + TaskRunner.getTaskRunnerID)) +logDebug("Created staging dir = " + dir + " for path = " + inputPath) +try { + if (!FileUtils.mkdir(fs, dir, true, hadoopConf)) { +throw new IllegalStateException("Cannot create staging directory '" + dir.toString + "'") + } + fs.deleteOnExit(dir) +} +catch { + case e: IOException => +throw new RuntimeException( --- End diff -- You can find the reason that we use this code is because (1) the old version need to use the hive package to create the staging directory, in the hive code, this staging directory is storied in a hash map, and then these staging directories would be removed when the session is closed. however, our spark code do not trigger the hive session close, then, these directories will not be removed. (2) you can find the pushed code just simulate the hive way to create the staging directory inside the spark rather than based on the hive. Then, the staging directory will be removed. (3) I will fix the return type issue, thanks for your comments @srowen --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15819: [SPARK-18372][SQL].Staging directory fail to be r...
Github user merlintang commented on a diff in the pull request: https://github.com/apache/spark/pull/15819#discussion_r88778781 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/InsertIntoHiveTable.scala --- @@ -54,6 +61,61 @@ case class InsertIntoHiveTable( @transient private lazy val hiveContext = new Context(sc.hiveconf) @transient private lazy val catalog = sc.catalog + val stagingDir = new HiveConf().getVar(HiveConf.ConfVars.STAGINGDIR) + + private def executionId: String = { +val rand: Random = new Random +val format: SimpleDateFormat = new SimpleDateFormat("-MM-dd_HH-mm-ss_SSS") +val executionId: String = "hive_" + format.format(new Date) + "_" + Math.abs(rand.nextLong) --- End diff -- yes, it is. I am working on this way because I want to code is exactly the same as the spark 2.0.x version. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15819: [SPARK-18372][SQL].Staging directory fail to be removed
Github user merlintang commented on the issue: https://github.com/apache/spark/pull/15819 @cloud-fan @rxin can you review this code? since several customers are complaining about the hive generated empty staging files in the HDFS. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15819: [SPARK-18372][SQL].Staging directory fail to be r...
Github user merlintang commented on a diff in the pull request: https://github.com/apache/spark/pull/15819#discussion_r88345264 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/InsertIntoHiveTable.scala --- @@ -54,6 +61,61 @@ case class InsertIntoHiveTable( @transient private lazy val hiveContext = new Context(sc.hiveconf) @transient private lazy val catalog = sc.catalog + val stagingDir = new HiveConf().getVar(HiveConf.ConfVars.STAGINGDIR) + + private def executionId: String = { +val rand: Random = new Random +val format: SimpleDateFormat = new SimpleDateFormat("-MM-dd_HH-mm-ss_SSS") +val executionId: String = "hive_" + format.format(new Date) + "_" + Math.abs(rand.nextLong) +return executionId --- End diff -- hi @fidato13 this is ok, since the part of this code is reused from spark 2.0.2. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15819: [SPARK-18372][SQL].Staging directory fail to be removed
Github user merlintang commented on the issue: https://github.com/apache/spark/pull/15819 Actually, I do not have the unit test, but the code list below (same as we posted in the JIRA) can reproduce this bug. The related code would be this way: val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc) sqlContext.sql("CREATE TABLE IF NOT EXISTS T1 (key INT, value STRING)") sqlContext.sql("LOAD DATA LOCAL INPATH '../examples/src/main/resources/kv1.txt' INTO TABLE T1") sqlContext.sql("CREATE TABLE IF NOT EXISTS T2 (key INT, value STRING)") val sparktestdf = sqlContext.table("T1") val dfw = sparktestdf.write dfw.insertInto("T2") val sparktestcopypydfdf = sqlContext.sql("""SELECT * from T2 """) sparktestcopypydfdf.show Our customer and ourself also have manually reproduced this bug for spark 1.6.x and 1.5.x. For the unit test, because we do not know how to find the hive directory for the related table in the test case, we can not check the computed directory in the end. The solution is that we reuse three functions in the 2.0.2 to create the staging directory, then this bug is fixed. On Wed, Nov 9, 2016 at 10:26 PM, Wenchen Fan wrote: > do you have a unit test to reproduce this bug? > > â > You are receiving this because you authored the thread. > Reply to this email directly, view it on GitHub > <https://github.com/apache/spark/pull/15819#issuecomment-259611432>, or mute > the thread > <https://github.com/notifications/unsubscribe-auth/ABXY-YcT4gOF3RyXk0YhQTVZpHYVDSHRks5q8rj6gaJpZM4KtFSt> > . > --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15819: [SPARK-18372][SQL].Staging directory fail to be r...
GitHub user merlintang opened a pull request: https://github.com/apache/spark/pull/15819 [SPARK-18372][SQL].Staging directory fail to be removed ## What changes were proposed in this pull request? This fix is related to be bug: https://issues.apache.org/jira/browse/SPARK-18372 . The insertIntoHiveTable would generate a .staging directory, but this directory fail to be removed in the end. ## How was this patch tested? manual tests Author: Mingjie Tang You can merge this pull request into a Git repository by running: $ git pull https://github.com/merlintang/spark branch-1.6 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/15819.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #15819 commit ac65375a64c2a8a2fe019dc0e2c031f413df74b8 Author: Mingjie Tang Date: 2016-11-09T00:41:32Z SPARK-18372 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org