[GitHub] spark pull request: [SPARK-4057] Use -agentlib instead of -Xdebug ...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2904#issuecomment-60196011 [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/22063/consoleFull) for PR 2904 at commit [`26b4af8`](https://github.com/apache/spark/commit/26b4af8ffc82aca784df6c4b4fd38e9083babc54). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [WIP][SPARK-3795] Heuristics for dynamically s...
Github user pwendell commented on the pull request: https://github.com/apache/spark/pull/2746#issuecomment-60196383 @sryza just so I understand. I tell YARN I want 10 executors to be pending. Then say YARN grants me two executors. Does it internally decrement the pending number to 8 (and can I read back that state?). Or could we just infer that it has decremented the counter based on getting new executors? How would it work? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4055][MLlib] Inconsistent spelling 'MLl...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2903#issuecomment-60196521 [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/22064/consoleFull) for PR 2903 at commit [`b031640`](https://github.com/apache/spark/commit/b0316405074a617b1573bdd1c8285fc043835f82). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3988][SQL] add public API for date type
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2901#issuecomment-60196729 [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/22058/consoleFull) for PR 2901 at commit [`444f100`](https://github.com/apache/spark/commit/444f10018326ca47676b46f5801eb7ee83b62241). * This patch **passes all tests**. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `class DateType(PrimitiveType):` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3988][SQL] add public API for date type
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/2901#issuecomment-60196733 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22058/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [BUILD] Fixed resolver for scalastyle plugin a...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/2877 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4058] [PySpark] Log file name is hard c...
GitHub user sarutak opened a pull request: https://github.com/apache/spark/pull/2905 [SPARK-4058] [PySpark] Log file name is hard coded even though there is a variable '$LOG_FILE ' In a script 'python/run-tests', log file name is represented by a variable 'LOG_FILE' and it is used in run-tests. But, there are some hard-coded log file name in the script. You can merge this pull request into a Git repository by running: $ git pull https://github.com/sarutak/spark SPARK-4058 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/2905.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #2905 commit 7710490e2c38e202c29e35445a77f1a070fbd678 Author: Kousuke Saruta saru...@oss.nttdata.co.jp Date: 2014-10-23T06:15:04Z Fixed python/run-tests not to use hard-coded log file name --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [WIP][SPARK-3795] Heuristics for dynamically s...
Github user sryza commented on the pull request: https://github.com/apache/spark/pull/2746#issuecomment-60197171 So yeah it internally decrements the pending number to 8. The app can and is expected to infer YARN has decremented the counter. Maybe TMI, but for getting a grasp on it, it might be helpful to understand the race conditions this approach exposes - i.e. there are situations where YARN can overallocate. For example imagine you requested 10 and then you decide you want 11. YARN just got 2 for you and decremented its counter to 8. You might tell YARN you want 11 before finding out about the 2 YARN is giving to you, which means you would overwrite the 8 with 11. In the brief period before you can go back to YARN and tell it you only want 9 now, it could conceivably give you 11 containers, for a total of 13, which is more than you ever asked for. The app is expected to handle these situations and release allocated containers that it doesn't need. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3988][SQL] add public API for date type
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/2901#issuecomment-60197221 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22059/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3988][SQL] add public API for date type
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2901#issuecomment-60197217 [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/22059/consoleFull) for PR 2901 at commit [`f760d8e`](https://github.com/apache/spark/commit/f760d8e6344a7bbfa49dbfb9324cf5b0cdba9223). * This patch **passes all tests**. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `class DateType(PrimitiveType):` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4058] [PySpark] Log file name is hard c...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2905#issuecomment-60197430 [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/22065/consoleFull) for PR 2905 at commit [`7710490`](https://github.com/apache/spark/commit/7710490e2c38e202c29e35445a77f1a070fbd678). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-2621. Update task InputMetrics increment...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2087#issuecomment-60198186 [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/22061/consoleFull) for PR 2087 at commit [`23010b8`](https://github.com/apache/spark/commit/23010b850b28fccd9b33b0352c4bc2cb5f5dd45c). * This patch **passes all tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-2621. Update task InputMetrics increment...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/2087#issuecomment-60198189 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22061/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [Spark-4041][SQL]attributes names in table sca...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2884#issuecomment-60198477 [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/411/consoleFull) for PR 2884 at commit [`3ff3a80`](https://github.com/apache/spark/commit/3ff3a8094f0d5c6aa50a53ac6b08345c1c7a3f69). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4055][MLlib] Inconsistent spelling 'MLl...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2903#issuecomment-60198488 [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/22062/consoleFull) for PR 2903 at commit [`272e41e`](https://github.com/apache/spark/commit/272e41e6ce363a4c6386a9aff7c11a03df525281). * This patch **passes all tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [WIP][SPARK-3795] Heuristics for dynamically s...
Github user pwendell commented on the pull request: https://github.com/apache/spark/pull/2746#issuecomment-60198472 Yep - that's exactly what I was wondering about. If YARN doesn't expose the internal counter it seems like there is a race (actually even if it does expose it, there still is a minor race where you could read it and then reset it but it changes in the middle). I guess we just live with it... On Wed, Oct 22, 2014 at 11:21 PM, Sandy Ryza notificati...@github.com wrote: So yeah it internally decrements the pending number to 8. The app can and is expected to infer YARN has decremented the counter. Maybe TMI, but for getting a grasp on it, it might be helpful to understand the race conditions this approach exposes - i.e. there are situations where YARN can overallocate. For example imagine you requested 10 and then you decide you want 11. YARN just got 2 for you and decremented its counter to 8. You might tell YARN you want 11 before finding out about the 2 YARN is giving to you, which means you would overwrite the 8 with 11. In the brief period before you can go back to YARN and tell it you only want 9 now, it could conceivably give you 11 containers, for a total of 13, which is more than you ever asked for. The app is expected to handle these situations and release allocated containers that it doesn't need. â Reply to this email directly or view it on GitHub https://github.com/apache/spark/pull/2746#issuecomment-60197171. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3677] [BUILD] [YARN] pom.xml and SparkB...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2520#issuecomment-60198448 [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/22066/consoleFull) for PR 2520 at commit [`f5400bd`](https://github.com/apache/spark/commit/f5400bd1d06198d9b4ad02b8974957174c9668cb). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4055][MLlib] Inconsistent spelling 'MLl...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/2903#issuecomment-60198490 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22062/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4032] Deprecate YARN alpha support in S...
Github user pwendell commented on the pull request: https://github.com/apache/spark/pull/2878#issuecomment-60198553 Yeah, maybe the output here is too noisy for it to be noticable. I agree having something in the Client itself is a good idea. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4037][SQL] Removes the SessionState ins...
Github user liancheng commented on a diff in the pull request: https://github.com/apache/spark/pull/2887#discussion_r19261442 --- Diff: sql/hive-thriftserver/src/test/scala/org/apache/spark/sql/hive/thriftserver/HiveThriftServer2Suite.scala --- @@ -150,10 +150,12 @@ class HiveThriftServer2Suite extends FunSuite with Logging { val dataFilePath = Thread.currentThread().getContextClassLoader.getResource(data/files/small_kv.txt) - val queries = Seq( -CREATE TABLE test(key INT, val STRING), -sLOAD DATA LOCAL INPATH '$dataFilePath' OVERWRITE INTO TABLE test, -CACHE TABLE test) + val queries = +sSET spark.sql.shuffle.partitions=3; --- End diff -- This SET command is used as a regression test of SPARK-4037. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [Spark-4041][SQL]attributes names in table sca...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2884#issuecomment-60198722 [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/411/consoleFull) for PR 2884 at commit [`3ff3a80`](https://github.com/apache/spark/commit/3ff3a8094f0d5c6aa50a53ac6b08345c1c7a3f69). * This patch **fails to build**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [ SPARK-1812] Adjust build system and tests to...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/2615#issuecomment-60200516 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22060/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [ SPARK-1812] Adjust build system and tests to...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2615#issuecomment-60200512 **[Tests timed out](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/22060/consoleFull)** for PR 2615 at commit [`897ec60`](https://github.com/apache/spark/commit/897ec603b3e07cb9ce4dda1fea4abdf30466493e) after a configured wait of `120m`. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4026][Streaming] Write ahead log manage...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2882#issuecomment-60201254 [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/22067/consoleFull) for PR 2882 at commit [`3881706`](https://github.com/apache/spark/commit/38817069e66cc8c161cc2a8033873a3342cff4e2). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4019] [SPARK-3740] Fix MapStatus compre...
Github user rxin commented on the pull request: https://github.com/apache/spark/pull/2866#issuecomment-60201281 @JoshRosen thanks for doing this. There is a chance that a normal hashset is much slower than a bitmap. Can you test that? It might make a lot more sense to use an uncompressed bitmap to track after deserialization instead. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4055][MLlib] Inconsistent spelling 'MLl...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2903#issuecomment-60201398 [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/22064/consoleFull) for PR 2903 at commit [`b031640`](https://github.com/apache/spark/commit/b0316405074a617b1573bdd1c8285fc043835f82). * This patch **passes all tests**. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * ` throw new SparkException(Failed to load class to register with Kryo, e)` * `class RankingMetrics[T: ClassTag](predictionAndLabels: RDD[(Array[T], Array[T])])` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4055][MLlib] Inconsistent spelling 'MLl...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/2903#issuecomment-60201403 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22064/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [Spark-4041][SQL]attributes names in table sca...
Github user scwf commented on the pull request: https://github.com/apache/spark/pull/2884#issuecomment-60201488 test failed due to streaming compile error, can you retest this? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4058] [PySpark] Log file name is hard c...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2905#issuecomment-60202537 [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/22065/consoleFull) for PR 2905 at commit [`7710490`](https://github.com/apache/spark/commit/7710490e2c38e202c29e35445a77f1a070fbd678). * This patch **passes all tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4058] [PySpark] Log file name is hard c...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/2905#issuecomment-60202542 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22065/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3677] [BUILD] [YARN] pom.xml and SparkB...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2520#issuecomment-60203866 [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/22066/consoleFull) for PR 2520 at commit [`f5400bd`](https://github.com/apache/spark/commit/f5400bd1d06198d9b4ad02b8974957174c9668cb). * This patch **passes all tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3677] [BUILD] [YARN] pom.xml and SparkB...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/2520#issuecomment-60203876 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22066/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4026][Streaming] Write ahead log manage...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2882#issuecomment-60208450 [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/22067/consoleFull) for PR 2882 at commit [`3881706`](https://github.com/apache/spark/commit/38817069e66cc8c161cc2a8033873a3342cff4e2). * This patch **passes all tests**. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * ` case class LogInfo(startTime: Long, endTime: Long, path: String)` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4026][Streaming] Write ahead log manage...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/2882#issuecomment-60208456 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22067/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [ SPARK-1812] Adjust build system and tests to...
Github user ScrapCodes commented on a diff in the pull request: https://github.com/apache/spark/pull/2615#discussion_r19265089 --- Diff: dev/change-version-to-2.10.sh --- @@ -0,0 +1,20 @@ +#!/usr/bin/env bash + +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the License); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +#http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an AS IS BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# + +find -name 'pom.xml' -exec sed -i 's|\(artifactId.*\)_2.11|\1_2.10|g' {} \; --- End diff -- I tried that, unfortunately in effective pom(s) that stays as is (i.e. $scala.version is not changed to 2.10). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [Spark-4041][SQL]attributes names in table sca...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2884#issuecomment-60209325 [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/412/consoleFull) for PR 2884 at commit [`3ff3a80`](https://github.com/apache/spark/commit/3ff3a8094f0d5c6aa50a53ac6b08345c1c7a3f69). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [Spark-4041][SQL]attributes names in table sca...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2884#issuecomment-60209693 [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/412/consoleFull) for PR 2884 at commit [`3ff3a80`](https://github.com/apache/spark/commit/3ff3a8094f0d5c6aa50a53ac6b08345c1c7a3f69). * This patch **fails to build**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4026][Streaming] Write ahead log manage...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2882#issuecomment-60213859 [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/22068/consoleFull) for PR 2882 at commit [`9514dc8`](https://github.com/apache/spark/commit/9514dc833c9c30be12eeb64fb4580c2e6f1adb4f). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2429] [MLlib] Hierarchical Implementati...
GitHub user yu-iskw opened a pull request: https://github.com/apache/spark/pull/2906 [SPARK-2429] [MLlib] Hierarchical Implementation of KMeans I want to add a divisive hierarchical clustering algorithm implementation to MLlib. I don't support distance metrics other Euclidean distance metric yet. It would be nice to support it at other issue. Could you review it? Thanks! You can merge this pull request into a Git repository by running: $ git pull https://github.com/yu-iskw/spark hierarchical Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/2906.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #2906 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4026][Streaming] Write ahead log manage...
Github user tdas commented on the pull request: https://github.com/apache/spark/pull/2882#issuecomment-60213891 @JoshRosen @harishreedharan addressed all your comments, and also simplified the writer code I did some further cleanups, and also added two new unit tests that test the writer and manager with corrupted writes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2429] [MLlib] Hierarchical Implementati...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/2906#issuecomment-60214129 Can one of the admins verify this patch? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3812] [BUILD] Adapt maven build to publ...
Github user witgo commented on the pull request: https://github.com/apache/spark/pull/2673#issuecomment-60214941 @ScrapCodes @pwendell This patch will cause `maven-assembly-plugin` error: ` ./make-distribution.sh -Dhadoop.version=2.3.0-cdh5.0.1 -Dyarn.version=2.3.0-cdh5.0.1 -Phadoop-2.3 -Pyarn -Pnetlib-lgpl` `du -sh dist/lib/*` ``` 4.0Kdist/lib/spark-assembly-1.2.0-SNAPSHOT-hadoop2.3.0-cdh5.0.1.jar 928Kdist/lib/spark-examples-1.2.0-SNAPSHOT-hadoop2.3.0-cdh5.0.1.jar ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4057] Use -agentlib instead of -Xdebug ...
Github user srowen commented on the pull request: https://github.com/apache/spark/pull/2904#issuecomment-60215591 +1 I can confirm that `-Xdebug` went away in Java 5 I think and this is the modern invocation of the debugger. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2429] [MLlib] Hierarchical Implementati...
Github user rnowling commented on a diff in the pull request: https://github.com/apache/spark/pull/2906#discussion_r19267797 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/HierarchicalClusteringModel.scala --- @@ -0,0 +1,79 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.clustering + +import breeze.linalg.{DenseVector = BDV, Vector = BV, norm = breezeNorm} +import org.apache.spark.api.java.JavaRDD +import org.apache.spark.mllib.linalg.Vector +import org.apache.spark.rdd.RDD + +/** + * this class is used for the model of the hierarchical clustering + * + * @param clusterTree a cluster as a tree node + * @param trainTime the milliseconds for executing a training + * @param predictTime the milliseconds for executing a prediction + * @param isTrained if the model has been trained, the flag is true + */ +class HierarchicalClusteringModel private ( + val clusterTree: ClusterTree, + var trainTime: Int, + var predictTime: Int, + var isTrained: Boolean) extends Serializable { + + def this(clusterTree: ClusterTree) = this(clusterTree, 0, 0, false) + + def getClusters(): Array[ClusterTree] = clusterTree.getClusters().toArray + + def getCenters(): Array[Vector] = getClusters().map(_.center) + + /** + * Predicts the closest cluster of each point + */ + def predict(vector: Vector): Int = { +// TODO Supports distance metrics other Euclidean distance metric +val metric = (bv1: BV[Double], bv2: BV[Double]) = breezeNorm(bv1 - bv2, 2.0) +this.clusterTree.assignClusterIndex(metric)(vector) + } + + /** + * Predicts the closest cluster of each point + */ + def predict(data: RDD[Vector]): RDD[(Int, Vector)] = { +val startTime = System.currentTimeMillis() // to measure the execution time + +// TODO Supports distance metrics other Euclidean distance metric +val metric = (bv1: BV[Double], bv2: BV[Double]) = breezeNorm(bv1 - bv2, 2.0) +val centers = getClusters().map(_.center.toBreeze) +val treeRoot = this.clusterTree +val closestClusterIndexFinder = treeRoot.assignClusterIndex(metric) _ +data.sparkContext.broadcast(closestClusterIndexFinder) +val predicted = data.map(point = (closestClusterIndexFinder(point), point)) --- End diff -- I don't think you're using the broadcast variable correctly: http://spark.apache.org/docs/latest/programming-guide.html#broadcast-variables --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [Spark-4041][SQL]attributes names in table sca...
Github user liancheng commented on the pull request: https://github.com/apache/spark/pull/2884#issuecomment-60216081 Hm, the failure was caused by a known Jenkins configuration issue. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2429] [MLlib] Hierarchical Implementati...
Github user rnowling commented on a diff in the pull request: https://github.com/apache/spark/pull/2906#discussion_r19267891 --- Diff: python/pyspark/mllib/clustering.py --- @@ -91,6 +99,58 @@ def train(cls, rdd, k, maxIterations=100, runs=1, initializationMode=k-means|| return KMeansModel([c.toArray() for c in centers]) +class HierarchicalClusteringModel(ClusteringModel): --- End diff -- The predict method seems to be O(kN) but you can do assignment in O(Nlog k) time with the tree, right? (N is the number of data points, k is the number of cluster centers). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3954][Streaming] promote the speed of c...
Github user surq commented on the pull request: https://github.com/apache/spark/pull/2811#issuecomment-60217711 Does someone take notice of this patch? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: MLlib, exposing special rdd functions to the p...
GitHub user numbnut opened a pull request: https://github.com/apache/spark/pull/2907 MLlib, exposing special rdd functions to the public You can merge this pull request into a Git repository by running: $ git pull https://github.com/numbnut/spark master Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/2907.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #2907 commit b3d8945d6fa0bc28b90a8409ced29fd78b34e752 Author: Niklas Wilcke 1wil...@informatik.uni-hamburg.de Date: 2014-10-23T09:43:27Z expose mllib specific rdd functions to the public --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [Spark-4060] [MLlib] exposing special rdd func...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/2907#issuecomment-60218336 Can one of the admins verify this patch? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4061] We cannot use EOL character in th...
GitHub user sarutak opened a pull request: https://github.com/apache/spark/pull/2908 [SPARK-4061] We cannot use EOL character in the operand of LIKE predicate. We cannot use EOL character like \n or \r in the operand of LIKE predicate. So following condition is never true. -- someStr is 'hoge\nfuga' where someStr LIKE 'hoge_fuga' You can merge this pull request into a Git repository by running: $ git pull https://github.com/sarutak/spark spark-sql-like-match-modification Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/2908.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #2908 commit 38f66519ae95ec5d41705fc499e2cd658de4 Author: Kousuke Saruta saru...@oss.nttdata.co.jp Date: 2014-10-23T10:07:14Z Fixed LIKE predicate so that we can use EOL character as in a operand --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4061] We cannot use EOL character in th...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2908#issuecomment-60218997 [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/22069/consoleFull) for PR 2908 at commit [`38f6651`](https://github.com/apache/spark/commit/38f66519ae95ec5d41705fc499e2cd658de4). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4026][Streaming] Write ahead log manage...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/2882#issuecomment-60219275 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22068/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4026][Streaming] Write ahead log manage...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2882#issuecomment-60219269 [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/22068/consoleFull) for PR 2882 at commit [`9514dc8`](https://github.com/apache/spark/commit/9514dc833c9c30be12eeb64fb4580c2e6f1adb4f). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * ` case class LogInfo(startTime: Long, endTime: Long, path: String)` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4006] In long running contexts, we enco...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2886#issuecomment-60220879 [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/22070/consoleFull) for PR 2886 at commit [`df9d98f`](https://github.com/apache/spark/commit/df9d98fe6703f6cc37fb0187fa55d140f37bb50e). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3900][YARN] ApplicationMaster's shutdow...
Github user sarutak commented on the pull request: https://github.com/apache/spark/pull/2755#issuecomment-60220899 /CC @tgravescs --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: specify unidocGenjavadocVersion of 0.8
Github user srowen commented on the pull request: https://github.com/apache/spark/pull/2893#issuecomment-60221337 This is for SPARK-3359. LGTM, thank you. This gets past some errors, and turns up more, which I'll comment on in the JIRA. But this is a step forward. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4006] In long running contexts, we enco...
Github user tsliwowicz commented on the pull request: https://github.com/apache/spark/pull/2886#issuecomment-60221362 @andrewor14 - thanks for the comments. I believe I fixed them all. Let me know! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4006] In long running contexts, we enco...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2886#issuecomment-60221739 [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/22072/consoleFull) for PR 2886 at commit [`094d508`](https://github.com/apache/spark/commit/094d508fed9aa57beb60d7a571cbe7c1e3b334c1). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4006] In long running contexts, we enco...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/2886#issuecomment-60222452 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22071/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4006] In long running contexts, we enco...
Github user tsliwowicz commented on the pull request: https://github.com/apache/spark/pull/2886#issuecomment-60222733 the failure seems technical (not related to my fix), I think. Local maven build works fine for me. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4061] We cannot use EOL character in th...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/2908#issuecomment-60223794 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22069/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4061] We cannot use EOL character in th...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2908#issuecomment-60223791 [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/22069/consoleFull) for PR 2908 at commit [`38f6651`](https://github.com/apache/spark/commit/38f66519ae95ec5d41705fc499e2cd658de4). * This patch **passes all tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4006] In long running contexts, we enco...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2886#issuecomment-60227754 [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/22070/consoleFull) for PR 2886 at commit [`df9d98f`](https://github.com/apache/spark/commit/df9d98fe6703f6cc37fb0187fa55d140f37bb50e). * This patch **passes all tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4006] In long running contexts, we enco...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/2886#issuecomment-60227762 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22070/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4006] In long running contexts, we enco...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/2886#issuecomment-60228517 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22072/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4006] In long running contexts, we enco...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2886#issuecomment-60228510 [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/22072/consoleFull) for PR 2886 at commit [`094d508`](https://github.com/apache/spark/commit/094d508fed9aa57beb60d7a571cbe7c1e3b334c1). * This patch **passes all tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: Clarify docstring for Pyspark's foreachPartiti...
Github user tdhopper commented on the pull request: https://github.com/apache/spark/pull/2895#issuecomment-60234425 Oh. Now that I look at master, @JoshRosen, I see that it's already been fixed by @davis [here](https://github.com/apache/spark/commit/1789cd46e38d1426deb6a4b14bddcbb8c751f585). The fix just isn't in 1.1. I guess we should close this? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-732][SPARK-3628][CORE][RESUBMIT] make i...
Github user CodingCat commented on the pull request: https://github.com/apache/spark/pull/2524#issuecomment-60237457 ping --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3900][YARN] ApplicationMaster's shutdow...
Github user tgravescs commented on the pull request: https://github.com/apache/spark/pull/2755#issuecomment-60237621 Jenkins, test this please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3900][YARN] ApplicationMaster's shutdow...
Github user tgravescs commented on the pull request: https://github.com/apache/spark/pull/2755#issuecomment-60237859 Changes look good. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3904] [SQL] add constant objectinspecto...
Github user chenghao-intel commented on the pull request: https://github.com/apache/spark/pull/2762#issuecomment-60242669 Thank you @liancheng, I've updated the code accordingly. You're right, the conversion is not so efficient, probably we need to add some Expression nodes for the data conversion, let's do that in the follow-ups. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3900][YARN] ApplicationMaster's shutdow...
Github user sarutak commented on the pull request: https://github.com/apache/spark/pull/2755#issuecomment-60250844 Hm... test wouldn't start... --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4061][SQL] We cannot use EOL character ...
Github user liancheng commented on the pull request: https://github.com/apache/spark/pull/2908#issuecomment-60256774 Good catch! Would you mind to add a unit test for this? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4061][SQL] We cannot use EOL character ...
Github user liancheng commented on a diff in the pull request: https://github.com/apache/spark/pull/2908#discussion_r19284542 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringOperations.scala --- @@ -103,21 +103,21 @@ case class Like(left: Expression, right: Expression) // replace the _ with .{1} exactly match 1 time of any character // replace the % with .*, match 0 or more times with any character override def escape(v: String) = { -val sb = new StringBuilder() -var i = 0; +val sb = new StringBuilder((?s)) +var i = 0 while (i v.length) { // Make a special case for \\_ and \\% - val n = v.charAt(i); + val n = v.charAt(i) if (n == '\\' i + 1 v.length (v.charAt(i + 1) == '_' || v.charAt(i + 1) == '%')) { sb.append(v.charAt(i + 1)) i += 1 } else { if (n == '_') { - sb.append(.); + sb.append(.) } else if (n == '%') { - sb.append(.*); + sb.append(.*) } else { - sb.append(Pattern.quote(Character.toString(n))); + sb.append(Pattern.quote(Character.toString(n))) } } --- End diff -- I feel a little complicated about this... This function is not on critical path, so I'd like to refactor it in a more functional and readable (but less efficient) way, for example: ```scala override def escape(v: String) = (?s) + (' ' +: v.init).zip(v).flatMap { case (prefix, '_') = if (prefix == '\\') _ else . case (prefix, '%') = if (prefix == '\\') % else .* case (_, ch) = Character.toString(ch) }.mkString ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2663] [SQL] Support the Grouping Set
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1567#issuecomment-60257141 [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/22073/consoleFull) for PR 1567 at commit [`76f474e`](https://github.com/apache/spark/commit/76f474e41a172d5128f99c9ae71c7b802b9114fa). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-3359 [DOCS] sbt/sbt unidoc doesn't work ...
GitHub user srowen opened a pull request: https://github.com/apache/spark/pull/2909 SPARK-3359 [DOCS] sbt/sbt unidoc doesn't work with Java 8 This follows https://github.com/apache/spark/pull/2893 , but does not completely fix SPARK-3359 either. This fixes minor scaladoc/javadoc issues that Javadoc 8 will treat as errors. You can merge this pull request into a Git repository by running: $ git pull https://github.com/srowen/spark SPARK-3359 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/2909.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #2909 commit f62c347e2df9d7e63653c2bf42004e86f7a80b27 Author: Sean Owen so...@cloudera.com Date: 2014-10-23T15:55:22Z Fix some javadoc issues that javadoc 8 considers errors. This is not all of the errors turned up when javadoc 8 runs on output of genjavadoc. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-3359 [DOCS] sbt/sbt unidoc doesn't work ...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2909#issuecomment-60262260 [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/22074/consoleFull) for PR 2909 at commit [`f62c347`](https://github.com/apache/spark/commit/f62c347e2df9d7e63653c2bf42004e86f7a80b27). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4055][MLlib] Inconsistent spelling 'MLl...
Github user mengxr commented on the pull request: https://github.com/apache/spark/pull/2903#issuecomment-60265674 LGTM. Merged into master. Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2429] [MLlib] Hierarchical Implementati...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/2906#discussion_r19288355 --- Diff: docs/mllib-clustering.md --- @@ -153,3 +157,152 @@ provided in the [Self-Contained Applications](quick-start.html#self-contained-ap section of the Spark Quick Start guide. Be sure to also include *spark-mllib* to your build file as a dependency. + + +### Hierarchical Clustering + +MLlib supports +[hierarchical clustering](http://en.wikipedia.org/wiki/Hierarchical_clustering), one of the most commonly used clustering algorithm which seeks to build a hierarchy of clusters. +Strategies for hierarchical clustering generally fall into two types. +One is the agglomerative clustering which is a bottom up approach: each observation starts in its own cluster, and pairs of clusters are merged as one moves up the hierarchy. +The other is the divisive clustering which is a top down approach: all observations start in one cluster, and splits are performed recursively as one moves down the hierarchy. +The MLlib implementation only includes a divisive hierarchical clustering algorithm. + +The implementation in MLlib has the following parameters: + +* *k* is the number of maximum desired clusters. +* *subIterations* is the maximum number of iterations to split a cluster to its 2 sub clusters. +* *numRetries* is the maximum number of retries if a splitting doesn't work as expected. +* *epsilon* determines the saturate threshold to consider the splitting to have converged. + + + +### Hierarchical Clustering Example + +div class=codetabs + +div data-lang=scala markdown=1 +The following code snippets can be executed in `spark-shell`. + +In the following example after loading and parsing data, +we use the hierarchical clustering object to cluster the sample data into three clusters. +The number of desired clusters is passed to the algorithm. +Hoerver, even though the number of clusters is less than *k* in the middle of the clustering, --- End diff -- Horever - However, and 'not be splitted' - 'not be split' --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2663] [SQL] Support the Grouping Set
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1567#issuecomment-60266002 [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/22073/consoleFull) for PR 1567 at commit [`76f474e`](https://github.com/apache/spark/commit/76f474e41a172d5128f99c9ae71c7b802b9114fa). * This patch **passes all tests**. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `case class GroupingSet(bitmasks: Seq[Int], ` * `case class Cube(groupByExprs: Seq[Expression],` * `case class Rollup(groupByExprs: Seq[Expression],` * `case class VirtualColumn(name: String, dataType: DataType = StringType, nullable: Boolean = false)` * `case class GroupingSetExpansion(` * `case class GroupingSetExpansion(` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2663] [SQL] Support the Grouping Set
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/1567#issuecomment-60266012 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22073/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4052][SQL] Use scala.collection.Map for...
Github user liancheng commented on a diff in the pull request: https://github.com/apache/spark/pull/2899#discussion_r19288406 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/InsertIntoHiveTable.scala --- @@ -18,6 +18,7 @@ package org.apache.spark.sql.hive.execution import scala.collection.JavaConversions._ +import scala.collection.Map --- End diff -- I think it's better to use `scala.collection.Map` explicitly in the code below, and add comment to explain. Another reason that makes putting this line here dangerous is that imports can be easily reorganized automatically by IDEs, which are sometimes not smart enough. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2429] [MLlib] Hierarchical Implementati...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/2906#discussion_r19288604 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/HierarchicalClustering.scala --- @@ -0,0 +1,549 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.clustering + +import breeze.linalg.{DenseVector = BDV, Vector = BV, norm = breezeNorm} +import org.apache.spark.Logging +import org.apache.spark.SparkContext._ +import org.apache.spark.mllib.linalg.{Vector, Vectors} +import org.apache.spark.rdd.RDD +import org.apache.spark.util.random.XORShiftRandom + +/** + * the configuration for a hierarchical clustering algorithm + * + * @param numClusters the number of clusters you want + * @param subIterations the number of iterations at digging + * @param epsilon the threshold to stop the sub-iterations + * @param randomSeed uses in sampling data for initializing centers in each sub iterations + * @param randomRange the range coefficient to generate random points in each clustering step + */ +class HierarchicalClusteringConf( + private var numClusters: Int, + private var subIterations: Int, + private var numRetries: Int, + private var epsilon: Double, + private var randomSeed: Int, + private[mllib] var randomRange: Double) extends Serializable { + + def this() = this(20, 5, 20, 10E-6, 1, 0.1) + + def setNumClusters(numClusters: Int): this.type = { --- End diff -- This may be my Scala ignorance, but if the constructor params aren't private, don't you get setters for free? I see you're going for a fluent style, and that makes sense, but I don't know of the other conf-like or algo-like classes do this. Pretty minor and I could be wrong but consider whether it's worth the code and consistency issue. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2429] [MLlib] Hierarchical Implementati...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/2906#discussion_r19288634 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/HierarchicalClustering.scala --- @@ -0,0 +1,549 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.clustering + +import breeze.linalg.{DenseVector = BDV, Vector = BV, norm = breezeNorm} +import org.apache.spark.Logging +import org.apache.spark.SparkContext._ +import org.apache.spark.mllib.linalg.{Vector, Vectors} +import org.apache.spark.rdd.RDD +import org.apache.spark.util.random.XORShiftRandom + +/** + * the configuration for a hierarchical clustering algorithm + * + * @param numClusters the number of clusters you want + * @param subIterations the number of iterations at digging + * @param epsilon the threshold to stop the sub-iterations + * @param randomSeed uses in sampling data for initializing centers in each sub iterations + * @param randomRange the range coefficient to generate random points in each clustering step + */ +class HierarchicalClusteringConf( + private var numClusters: Int, + private var subIterations: Int, + private var numRetries: Int, + private var epsilon: Double, + private var randomSeed: Int, + private[mllib] var randomRange: Double) extends Serializable { + + def this() = this(20, 5, 20, 10E-6, 1, 0.1) + + def setNumClusters(numClusters: Int): this.type = { +this.numClusters = numClusters +this + } + + def getNumClusters(): Int = this.numClusters + + def setSubIterations(iterations: Int): this.type = { +this.subIterations = iterations +this + } + + def setNumRetries(numRetries: Int): this.type = { +this.numRetries = numRetries +this + } + + def getNumRetries(): Int = this.numRetries + + def getSubIterations(): Int = this.subIterations + + def setEpsilon(epsilon: Double): this.type = { +this.epsilon = epsilon +this + } + + def getEpsilon(): Double = this.epsilon + + def setRandomSeed(seed: Int): this.type = { +this.randomSeed = seed +this + } + + def getRandomSeed(): Int = this.randomSeed + + def setRandomRange(range: Double): this.type = { +this.randomRange = range +this + } +} + + +/** + * This is a divisive hierarchical clustering algorithm based on bi-sect k-means algorithm. + * + * @param conf the configuration class for the hierarchical clustering + */ +class HierarchicalClustering(val conf: HierarchicalClusteringConf) +extends Serializable with Logging { + + /** + * Constructs with the default configuration + */ + def this() = this(new HierarchicalClusteringConf()) + + /** + * Trains a hierarchical clustering model with the given configuration + * + * @param data training points + * @return a model for hierarchical clustering + */ + def run(data: RDD[Vector]): HierarchicalClusteringModel = { +validateData(data) +logInfo(sRun with ${conf.toString}) --- End diff -- Trivial but can this be just `$conf`? and similarly for other format strings --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2429] [MLlib] Hierarchical Implementati...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/2906#discussion_r19288713 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/HierarchicalClustering.scala --- @@ -0,0 +1,549 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.clustering + +import breeze.linalg.{DenseVector = BDV, Vector = BV, norm = breezeNorm} +import org.apache.spark.Logging +import org.apache.spark.SparkContext._ +import org.apache.spark.mllib.linalg.{Vector, Vectors} +import org.apache.spark.rdd.RDD +import org.apache.spark.util.random.XORShiftRandom + +/** + * the configuration for a hierarchical clustering algorithm + * + * @param numClusters the number of clusters you want + * @param subIterations the number of iterations at digging + * @param epsilon the threshold to stop the sub-iterations + * @param randomSeed uses in sampling data for initializing centers in each sub iterations + * @param randomRange the range coefficient to generate random points in each clustering step + */ +class HierarchicalClusteringConf( + private var numClusters: Int, + private var subIterations: Int, + private var numRetries: Int, + private var epsilon: Double, + private var randomSeed: Int, + private[mllib] var randomRange: Double) extends Serializable { + + def this() = this(20, 5, 20, 10E-6, 1, 0.1) + + def setNumClusters(numClusters: Int): this.type = { +this.numClusters = numClusters +this + } + + def getNumClusters(): Int = this.numClusters + + def setSubIterations(iterations: Int): this.type = { +this.subIterations = iterations +this + } + + def setNumRetries(numRetries: Int): this.type = { +this.numRetries = numRetries +this + } + + def getNumRetries(): Int = this.numRetries + + def getSubIterations(): Int = this.subIterations + + def setEpsilon(epsilon: Double): this.type = { +this.epsilon = epsilon +this + } + + def getEpsilon(): Double = this.epsilon + + def setRandomSeed(seed: Int): this.type = { +this.randomSeed = seed +this + } + + def getRandomSeed(): Int = this.randomSeed + + def setRandomRange(range: Double): this.type = { +this.randomRange = range +this + } +} + + +/** + * This is a divisive hierarchical clustering algorithm based on bi-sect k-means algorithm. + * + * @param conf the configuration class for the hierarchical clustering + */ +class HierarchicalClustering(val conf: HierarchicalClusteringConf) +extends Serializable with Logging { + + /** + * Constructs with the default configuration + */ + def this() = this(new HierarchicalClusteringConf()) + + /** + * Trains a hierarchical clustering model with the given configuration + * + * @param data training points + * @return a model for hierarchical clustering + */ + def run(data: RDD[Vector]): HierarchicalClusteringModel = { +validateData(data) +logInfo(sRun with ${conf.toString}) + +val startTime = System.currentTimeMillis() // to measure the execution time +val clusterTree = ClusterTree.fromRDD(data) // make the root node +val model = new HierarchicalClusteringModel(clusterTree) +val statsUpdater = new ClusterTreeStatsUpdater() + +var node: Option[ClusterTree] = Some(model.clusterTree) +statsUpdater(node.get) + +// If the followed conditions are satisfied, and then stop the training. +// 1. There is no splittable cluster +// 2. The number of the splitted clusters is greater than that of given clusters +// 3. The total variance of all clusters increases, when a cluster is splitted +var totalVariance = Double.MaxValue +var newTotalVariance =
[GitHub] spark pull request: [SPARK-2429] [MLlib] Hierarchical Implementati...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/2906#discussion_r19288686 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/HierarchicalClustering.scala --- @@ -0,0 +1,549 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.clustering + +import breeze.linalg.{DenseVector = BDV, Vector = BV, norm = breezeNorm} +import org.apache.spark.Logging +import org.apache.spark.SparkContext._ +import org.apache.spark.mllib.linalg.{Vector, Vectors} +import org.apache.spark.rdd.RDD +import org.apache.spark.util.random.XORShiftRandom + +/** + * the configuration for a hierarchical clustering algorithm + * + * @param numClusters the number of clusters you want + * @param subIterations the number of iterations at digging + * @param epsilon the threshold to stop the sub-iterations + * @param randomSeed uses in sampling data for initializing centers in each sub iterations + * @param randomRange the range coefficient to generate random points in each clustering step + */ +class HierarchicalClusteringConf( + private var numClusters: Int, + private var subIterations: Int, + private var numRetries: Int, + private var epsilon: Double, + private var randomSeed: Int, + private[mllib] var randomRange: Double) extends Serializable { + + def this() = this(20, 5, 20, 10E-6, 1, 0.1) + + def setNumClusters(numClusters: Int): this.type = { +this.numClusters = numClusters +this + } + + def getNumClusters(): Int = this.numClusters + + def setSubIterations(iterations: Int): this.type = { +this.subIterations = iterations +this + } + + def setNumRetries(numRetries: Int): this.type = { +this.numRetries = numRetries +this + } + + def getNumRetries(): Int = this.numRetries + + def getSubIterations(): Int = this.subIterations + + def setEpsilon(epsilon: Double): this.type = { +this.epsilon = epsilon +this + } + + def getEpsilon(): Double = this.epsilon + + def setRandomSeed(seed: Int): this.type = { +this.randomSeed = seed +this + } + + def getRandomSeed(): Int = this.randomSeed + + def setRandomRange(range: Double): this.type = { +this.randomRange = range +this + } +} + + +/** + * This is a divisive hierarchical clustering algorithm based on bi-sect k-means algorithm. + * + * @param conf the configuration class for the hierarchical clustering + */ +class HierarchicalClustering(val conf: HierarchicalClusteringConf) +extends Serializable with Logging { + + /** + * Constructs with the default configuration + */ + def this() = this(new HierarchicalClusteringConf()) + + /** + * Trains a hierarchical clustering model with the given configuration + * + * @param data training points + * @return a model for hierarchical clustering + */ + def run(data: RDD[Vector]): HierarchicalClusteringModel = { +validateData(data) +logInfo(sRun with ${conf.toString}) + +val startTime = System.currentTimeMillis() // to measure the execution time +val clusterTree = ClusterTree.fromRDD(data) // make the root node +val model = new HierarchicalClusteringModel(clusterTree) +val statsUpdater = new ClusterTreeStatsUpdater() + +var node: Option[ClusterTree] = Some(model.clusterTree) +statsUpdater(node.get) + +// If the followed conditions are satisfied, and then stop the training. +// 1. There is no splittable cluster +// 2. The number of the splitted clusters is greater than that of given clusters +// 3. The total variance of all clusters increases, when a cluster is splitted +var totalVariance = Double.MaxValue +var newTotalVariance =
[GitHub] spark pull request: [SPARK-4055][MLlib] Inconsistent spelling 'MLl...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/2903 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2429] [MLlib] Hierarchical Implementati...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/2906#discussion_r19288793 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/HierarchicalClustering.scala --- @@ -0,0 +1,549 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.clustering + +import breeze.linalg.{DenseVector = BDV, Vector = BV, norm = breezeNorm} +import org.apache.spark.Logging +import org.apache.spark.SparkContext._ +import org.apache.spark.mllib.linalg.{Vector, Vectors} +import org.apache.spark.rdd.RDD +import org.apache.spark.util.random.XORShiftRandom + +/** + * the configuration for a hierarchical clustering algorithm + * + * @param numClusters the number of clusters you want + * @param subIterations the number of iterations at digging + * @param epsilon the threshold to stop the sub-iterations + * @param randomSeed uses in sampling data for initializing centers in each sub iterations + * @param randomRange the range coefficient to generate random points in each clustering step + */ +class HierarchicalClusteringConf( + private var numClusters: Int, + private var subIterations: Int, + private var numRetries: Int, + private var epsilon: Double, + private var randomSeed: Int, + private[mllib] var randomRange: Double) extends Serializable { + + def this() = this(20, 5, 20, 10E-6, 1, 0.1) + + def setNumClusters(numClusters: Int): this.type = { +this.numClusters = numClusters +this + } + + def getNumClusters(): Int = this.numClusters + + def setSubIterations(iterations: Int): this.type = { +this.subIterations = iterations +this + } + + def setNumRetries(numRetries: Int): this.type = { +this.numRetries = numRetries +this + } + + def getNumRetries(): Int = this.numRetries + + def getSubIterations(): Int = this.subIterations + + def setEpsilon(epsilon: Double): this.type = { +this.epsilon = epsilon +this + } + + def getEpsilon(): Double = this.epsilon + + def setRandomSeed(seed: Int): this.type = { +this.randomSeed = seed +this + } + + def getRandomSeed(): Int = this.randomSeed + + def setRandomRange(range: Double): this.type = { +this.randomRange = range +this + } +} + + +/** + * This is a divisive hierarchical clustering algorithm based on bi-sect k-means algorithm. + * + * @param conf the configuration class for the hierarchical clustering + */ +class HierarchicalClustering(val conf: HierarchicalClusteringConf) +extends Serializable with Logging { + + /** + * Constructs with the default configuration + */ + def this() = this(new HierarchicalClusteringConf()) + + /** + * Trains a hierarchical clustering model with the given configuration + * + * @param data training points + * @return a model for hierarchical clustering + */ + def run(data: RDD[Vector]): HierarchicalClusteringModel = { +validateData(data) +logInfo(sRun with ${conf.toString}) + +val startTime = System.currentTimeMillis() // to measure the execution time +val clusterTree = ClusterTree.fromRDD(data) // make the root node +val model = new HierarchicalClusteringModel(clusterTree) +val statsUpdater = new ClusterTreeStatsUpdater() + +var node: Option[ClusterTree] = Some(model.clusterTree) +statsUpdater(node.get) + +// If the followed conditions are satisfied, and then stop the training. +// 1. There is no splittable cluster +// 2. The number of the splitted clusters is greater than that of given clusters +// 3. The total variance of all clusters increases, when a cluster is splitted +var totalVariance = Double.MaxValue +var newTotalVariance =
[GitHub] spark pull request: [SPARK-2429] [MLlib] Hierarchical Implementati...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/2906#discussion_r19288871 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/HierarchicalClustering.scala --- @@ -0,0 +1,549 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.clustering + +import breeze.linalg.{DenseVector = BDV, Vector = BV, norm = breezeNorm} +import org.apache.spark.Logging +import org.apache.spark.SparkContext._ +import org.apache.spark.mllib.linalg.{Vector, Vectors} +import org.apache.spark.rdd.RDD +import org.apache.spark.util.random.XORShiftRandom + +/** + * the configuration for a hierarchical clustering algorithm + * + * @param numClusters the number of clusters you want + * @param subIterations the number of iterations at digging + * @param epsilon the threshold to stop the sub-iterations + * @param randomSeed uses in sampling data for initializing centers in each sub iterations + * @param randomRange the range coefficient to generate random points in each clustering step + */ +class HierarchicalClusteringConf( + private var numClusters: Int, + private var subIterations: Int, + private var numRetries: Int, + private var epsilon: Double, + private var randomSeed: Int, + private[mllib] var randomRange: Double) extends Serializable { + + def this() = this(20, 5, 20, 10E-6, 1, 0.1) + + def setNumClusters(numClusters: Int): this.type = { +this.numClusters = numClusters +this + } + + def getNumClusters(): Int = this.numClusters + + def setSubIterations(iterations: Int): this.type = { +this.subIterations = iterations +this + } + + def setNumRetries(numRetries: Int): this.type = { +this.numRetries = numRetries +this + } + + def getNumRetries(): Int = this.numRetries + + def getSubIterations(): Int = this.subIterations + + def setEpsilon(epsilon: Double): this.type = { +this.epsilon = epsilon +this + } + + def getEpsilon(): Double = this.epsilon + + def setRandomSeed(seed: Int): this.type = { +this.randomSeed = seed +this + } + + def getRandomSeed(): Int = this.randomSeed + + def setRandomRange(range: Double): this.type = { +this.randomRange = range +this + } +} + + +/** + * This is a divisive hierarchical clustering algorithm based on bi-sect k-means algorithm. + * + * @param conf the configuration class for the hierarchical clustering + */ +class HierarchicalClustering(val conf: HierarchicalClusteringConf) +extends Serializable with Logging { + + /** + * Constructs with the default configuration + */ + def this() = this(new HierarchicalClusteringConf()) + + /** + * Trains a hierarchical clustering model with the given configuration + * + * @param data training points + * @return a model for hierarchical clustering + */ + def run(data: RDD[Vector]): HierarchicalClusteringModel = { +validateData(data) +logInfo(sRun with ${conf.toString}) + +val startTime = System.currentTimeMillis() // to measure the execution time +val clusterTree = ClusterTree.fromRDD(data) // make the root node +val model = new HierarchicalClusteringModel(clusterTree) +val statsUpdater = new ClusterTreeStatsUpdater() + +var node: Option[ClusterTree] = Some(model.clusterTree) +statsUpdater(node.get) + +// If the followed conditions are satisfied, and then stop the training. +// 1. There is no splittable cluster +// 2. The number of the splitted clusters is greater than that of given clusters +// 3. The total variance of all clusters increases, when a cluster is splitted +var totalVariance = Double.MaxValue +var newTotalVariance =
[GitHub] spark pull request: [SPARK-2429] [MLlib] Hierarchical Implementati...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/2906#discussion_r19289138 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/HierarchicalClustering.scala --- @@ -0,0 +1,549 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.clustering + +import breeze.linalg.{DenseVector = BDV, Vector = BV, norm = breezeNorm} +import org.apache.spark.Logging +import org.apache.spark.SparkContext._ +import org.apache.spark.mllib.linalg.{Vector, Vectors} +import org.apache.spark.rdd.RDD +import org.apache.spark.util.random.XORShiftRandom + +/** + * the configuration for a hierarchical clustering algorithm + * + * @param numClusters the number of clusters you want + * @param subIterations the number of iterations at digging + * @param epsilon the threshold to stop the sub-iterations + * @param randomSeed uses in sampling data for initializing centers in each sub iterations + * @param randomRange the range coefficient to generate random points in each clustering step + */ +class HierarchicalClusteringConf( + private var numClusters: Int, + private var subIterations: Int, + private var numRetries: Int, + private var epsilon: Double, + private var randomSeed: Int, + private[mllib] var randomRange: Double) extends Serializable { + + def this() = this(20, 5, 20, 10E-6, 1, 0.1) + + def setNumClusters(numClusters: Int): this.type = { +this.numClusters = numClusters +this + } + + def getNumClusters(): Int = this.numClusters + + def setSubIterations(iterations: Int): this.type = { +this.subIterations = iterations +this + } + + def setNumRetries(numRetries: Int): this.type = { +this.numRetries = numRetries +this + } + + def getNumRetries(): Int = this.numRetries + + def getSubIterations(): Int = this.subIterations + + def setEpsilon(epsilon: Double): this.type = { +this.epsilon = epsilon +this + } + + def getEpsilon(): Double = this.epsilon + + def setRandomSeed(seed: Int): this.type = { +this.randomSeed = seed +this + } + + def getRandomSeed(): Int = this.randomSeed + + def setRandomRange(range: Double): this.type = { +this.randomRange = range +this + } +} + + +/** + * This is a divisive hierarchical clustering algorithm based on bi-sect k-means algorithm. + * + * @param conf the configuration class for the hierarchical clustering + */ +class HierarchicalClustering(val conf: HierarchicalClusteringConf) +extends Serializable with Logging { + + /** + * Constructs with the default configuration + */ + def this() = this(new HierarchicalClusteringConf()) + + /** + * Trains a hierarchical clustering model with the given configuration + * + * @param data training points + * @return a model for hierarchical clustering + */ + def run(data: RDD[Vector]): HierarchicalClusteringModel = { +validateData(data) +logInfo(sRun with ${conf.toString}) + +val startTime = System.currentTimeMillis() // to measure the execution time +val clusterTree = ClusterTree.fromRDD(data) // make the root node +val model = new HierarchicalClusteringModel(clusterTree) +val statsUpdater = new ClusterTreeStatsUpdater() + +var node: Option[ClusterTree] = Some(model.clusterTree) +statsUpdater(node.get) + +// If the followed conditions are satisfied, and then stop the training. +// 1. There is no splittable cluster +// 2. The number of the splitted clusters is greater than that of given clusters +// 3. The total variance of all clusters increases, when a cluster is splitted +var totalVariance = Double.MaxValue +var newTotalVariance =
[GitHub] spark pull request: [SPARK-2429] [MLlib] Hierarchical Implementati...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/2906#discussion_r19289245 --- Diff: mllib/src/test/java/org/apache/spark/mllib/clustering/JavaHierarchicalClusteringSuite.java --- @@ -0,0 +1,78 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.clustering; + +import com.google.common.collect.Lists; +import org.apache.spark.api.java.JavaRDD; +import org.apache.spark.api.java.JavaSparkContext; +import org.apache.spark.mllib.linalg.Vector; +import org.apache.spark.mllib.linalg.Vectors; +import org.junit.After; +import org.junit.Before; +import org.junit.Test; + +import java.io.Serializable; +import java.util.List; + +import static org.junit.Assert.assertEquals; + +public class JavaHierarchicalClusteringSuite implements Serializable { +private transient JavaSparkContext sc; --- End diff -- Looks like this is using 4-space indent but should be 2. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2429] [MLlib] Hierarchical Implementati...
Github user srowen commented on the pull request: https://github.com/apache/spark/pull/2906#issuecomment-60268305 I just gave this a quick read-through, and the structure makes sense. I left several small comments. I see the chunks of logic I would expect, but did not evaluate it in detail. The existence of some tests suggests this probably basically works :) I am wondering about performance too as this relies on Scala idioms in many places; it might be worth a quick look with jprofiler if you can to see if there are any easy-win optimizations. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3911] [SQL] HiveSimpleUdf can not be op...
Github user liancheng commented on a diff in the pull request: https://github.com/apache/spark/pull/2771#discussion_r19289432 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/QueryTest.scala --- @@ -74,4 +76,30 @@ class QueryTest extends FunSuite { .stripMargin) } } + + // The following copy is copied from org.apache.spark.sql.catalyst.plans.PlanTest --- End diff -- How about making `QueryTest` inherit from `PlanTest` instead? Just like what we did in another `PlanTest` in `sql/core`. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4026][Streaming] Write ahead log manage...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2882#issuecomment-60274484 [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/22075/consoleFull) for PR 2882 at commit [`d29fddd`](https://github.com/apache/spark/commit/d29fddd880fd7efec8ed05017a12600bcb2aa829). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-3883 SSL support for HttpServer and Akka
Github user vanzin commented on the pull request: https://github.com/apache/spark/pull/2739#issuecomment-60275209 Hi @jacek-lewandowski, Now that I finally noticed you built this on top of branch-1.1, some of the choices you made make a lot more sense. (I always assume people are working on master, since it's generally preferable to add new features to master first.) One huge difference in master, which lead to a lot of my comments, is SPARK-2098. That fix added the ability of all daemons - including Master and Worker - to read the spark-defaults.conf file. So, if you build on top of that, you need zero code dealing with loading config data, and can rely on SparkConf for everything. Then, you could have something like: class SSLOptions(conf: SparkConf, module: String) That would load options like this: sslEnabled = conf.getOption(sspark.$module.ssl.enabled) .orElse(conf.getOption(spark.ssl.enabled)) .getOrElse(false) Then you have module-specific configuration and a global fallback. What do you think? On the subject of distributing the configuration, I think it's sort of ok to rely on that, for the time being, for standalone mode. Long term, it would be better to allow each job to be able to distribute its own configuration, so that it's easy for admins and users to use different certificates for the daemons and for the jobs, for example. On Yarn, I still believe we should not have this requirement - since when using Spark-on-Yarn, Spark is kind of a client-side thing and shouldn't require any changes in the cluster . The needed files should be distributed automatically by Spark and made available to executors. That should be doable by disabling certificate validation (so that the hostnames don't matter) or using wildcard certificates (assuming everything is in the same sub-domain). If that's not enough to cover all user cases, we can leave other enhancements for later. I'm not familiar enough with mesos to be able to suggest anything. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-3359 [DOCS] sbt/sbt unidoc doesn't work ...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/2909#issuecomment-60276342 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22074/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-3359 [DOCS] sbt/sbt unidoc doesn't work ...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2909#issuecomment-60276331 [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/22074/consoleFull) for PR 2909 at commit [`f62c347`](https://github.com/apache/spark/commit/f62c347e2df9d7e63653c2bf42004e86f7a80b27). * This patch **passes all tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4065] Add check for IPython on Windows
GitHub user msjgriffiths opened a pull request: https://github.com/apache/spark/pull/2910 [SPARK-4065] Add check for IPython on Windows This issue employs logic similar to the bash launcher (pyspark) to check if IPTYHON=1, and if so launch ipython with options in IPYTHON_OPTS. This fix assumes that ipython is available in the system Path, and can be invoked with a plain ipython command. You can merge this pull request into a Git repository by running: $ git pull https://github.com/msjgriffiths/spark pyspark-windows Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/2910.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #2910 commit f076d3b0c4de62001be449c5ce22cae399bf6bde Author: Michael Griffiths msjgriffi...@gmail.com Date: 2014-10-23T17:45:13Z [SPARK-4065] Add check for IPython on Windows This issue employs logic similar to the bash launcher (pyspark) to check if IPTYHON=1, and if so launch ipython with options in IPYTHON_OPTS. This fix assumes that ipython is available in the system Path, and can be invoked with a plain ipython command. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4065] Add check for IPython on Windows
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/2910#issuecomment-60278572 Can one of the admins verify this patch? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4006] In long running contexts, we enco...
Github user andrewor14 commented on a diff in the pull request: https://github.com/apache/spark/pull/2886#discussion_r19294252 --- Diff: core/src/main/scala/org/apache/spark/storage/BlockManagerMasterActor.scala --- @@ -325,22 +325,23 @@ class BlockManagerMasterActor(val isLocal: Boolean, conf: SparkConf, listenerBus private def register(id: BlockManagerId, maxMemSize: Long, slaveActor: ActorRef) { val time = System.currentTimeMillis() + if (!blockManagerInfo.contains(id)) { blockManagerIdByExecutor.get(id.executorId) match { case Some(manager) = - // A block manager of the same executor already exists. - // This should never happen. Let's just quit. - logError(Got two different block manager registrations on + id.executorId) - System.exit(1) + // A block manager of the same executor already exists so remove it (assumed dead). --- End diff -- actually what I meant was to add a comma between exists and so... It's ok I can fix this myself when I merge it --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org