[GitHub] spark pull request: [SPARK-3481][SQL] removes the evil MINOR HACK
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2505#issuecomment-56480427 [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20690/consoleFull) for PR 2505 at commit [`4874ec8`](https://github.com/apache/spark/commit/4874ec83912e1b885b4b7e4cdbc0dfbdf5c83a45). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-1767: Prefer HDFS-cached replicas when s...
Github user mridulm commented on the pull request: https://github.com/apache/spark/pull/1486#issuecomment-56480392 Are we proposing to introduce hdfs caching tags/idioms directly into TaskSetManager in this pr ? That does not look right. We need to generalize this so that any rdd can specify process/host (maybe rack also ?) annotations. Once done, HadoopRdd can leverage that. Depending on underscore not being in name, etc is fragile. One option would be to define our uri's: with default reverting to host only. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: Modify default YARN memory_overhead-- from an ...
Github user sryza commented on the pull request: https://github.com/apache/spark/pull/2485#issuecomment-56480298 This looks good to me. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3481][SQL] removes the evil MINOR HACK
Github user marmbrus commented on the pull request: https://github.com/apache/spark/pull/2505#issuecomment-56480117 ok to test --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3481][SQL] removes the evil MINOR HACK
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2505#issuecomment-56480056 Can one of the admins verify this patch? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3481][SQL] removes the evil MINOR HACK
GitHub user scwf opened a pull request: https://github.com/apache/spark/pull/2505 [SPARK-3481][SQL] removes the evil MINOR HACK a follow up of https://github.com/apache/spark/pull/2377 and https://github.com/apache/spark/pull/2352, see detail there. You can merge this pull request into a Git repository by running: $ git pull https://github.com/scwf/spark patch-6 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/2505.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #2505 commit 4874ec83912e1b885b4b7e4cdbc0dfbdf5c83a45 Author: wangfei Date: 2014-09-23T06:07:48Z removes the evil MINOR HACK --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-3172 and SPARK-3577
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2504#issuecomment-56479827 [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20689/consoleFull) for PR 2504 at commit [`c854514`](https://github.com/apache/spark/commit/c854514d81b4830ce1f1109662a713c51e6c8023). * This patch **fails** unit tests. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `class WriteMetrics extends Serializable ` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-3172 and SPARK-3577
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2504#issuecomment-56479829 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/20689/ --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-3172 and SPARK-3577
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2504#issuecomment-56479669 [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20689/consoleFull) for PR 2504 at commit [`c854514`](https://github.com/apache/spark/commit/c854514d81b4830ce1f1109662a713c51e6c8023). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-3172 and SPARK-3577
GitHub user sryza opened a pull request: https://github.com/apache/spark/pull/2504 SPARK-3172 and SPARK-3577 The posted patch addresses both SPARK-3172 and SPARK-3577. It renames ShuffleWriteMetrics to WriteMetrics and uses it for tracking all three of shuffle write, spilling on the fetch side, and spilling on the write side (which only occurs during sort-based shuffle). I'll fix and add tests if people think restructuring the metrics in this way makes sense. I'm a little unsure about the name shuffleReadSpillMetrics, as spilling happens during aggregation, not read, but I had trouble coming up with something better. I'm also unsure on what the most useful columns would be to display in the UI - I remember some pushback on adding new columns. Ultimately these metrics will be most helpful if they can inform users whether and how much they need to increase the number of partitions / increase spark.shuffle.memoryFraction. Reporting spill time informs users whether spilling is a significant impacting performance. Reporting memory size can help with understanding how much needs to be done to avoid spilling. @pwendell any thoughts on this? You can merge this pull request into a Git repository by running: $ git pull https://github.com/sryza/spark sandy-spark-3172 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/2504.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #2504 commit c854514d81b4830ce1f1109662a713c51e6c8023 Author: Sandy Ryza Date: 2014-09-23T05:58:18Z SPARK-3172 and SPARK-3577 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3649] Remove GraphX custom serializers
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2503#issuecomment-56478252 [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20688/consoleFull) for PR 2503 at commit [`a49c2ad`](https://github.com/apache/spark/commit/a49c2ad67f2bf79ae10e9ef696605c64b0c0ed97). * This patch **fails** unit tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3649] Remove GraphX custom serializers
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2503#issuecomment-56478260 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/20688/ --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: add a util method for changing the log level w...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2433#issuecomment-56478099 [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20687/consoleFull) for PR 2433 at commit [`cdb3bfc`](https://github.com/apache/spark/commit/cdb3bfc1ab74d3b2c3dfec38dc23118bc05ed922). * This patch **passes** unit tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2017] [SPARK-2016] Web UI responsivenes...
Github user JoshRosen commented on the pull request: https://github.com/apache/spark/pull/1682#issuecomment-56478100 I've opened [SPARK-3644](https://issues.apache.org/jira/browse/SPARK-3644) as a forum for discussing the design of a REST API; sorry for the delay (got busy with other work / bug fixing). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: add a util method for changing the log level w...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2433#issuecomment-56478104 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/20687/ --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3634] [PySpark] User's module should ta...
Github user JoshRosen commented on the pull request: https://github.com/apache/spark/pull/2492#issuecomment-56477603 > Understood, this side-effect is bit dangerous. The third-package could appear in sys.path in any order Are you worried about a user adding a Python module whose name conflicts with a built-in module, thereby shadowing it? I think this is a general Python problem that can occur even without `sys.path` manipulation, which is why it's bad to have top-level modules that have the same name as built-in ones (and also why relative imports can be bad): http://www.evanjones.ca/python-name-clashes.html --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3649] Remove GraphX custom serializers
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2503#issuecomment-56476058 [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20688/consoleFull) for PR 2503 at commit [`a49c2ad`](https://github.com/apache/spark/commit/a49c2ad67f2bf79ae10e9ef696605c64b0c0ed97). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3649] Remove GraphX custom serializers
GitHub user ankurdave opened a pull request: https://github.com/apache/spark/pull/2503 [SPARK-3649] Remove GraphX custom serializers As [reported][1] on the mailing list, GraphX throws ``` java.lang.ClassCastException: java.lang.Long cannot be cast to scala.Tuple2 at org.apache.spark.graphx.impl.RoutingTableMessageSerializer$$anon$1$$anon$2.writeObject(Serializers.scala:39) at org.apache.spark.storage.DiskBlockObjectWriter.write(BlockObjectWriter.scala:195) at org.apache.spark.util.collection.ExternalSorter.spillToMergeableFile(ExternalSorter.scala:329) ``` when sort-based shuffle attempts to spill to disk. This is because GraphX defines custom serializers for shuffling pair RDDs that assume Spark will always serialize the entire pair object rather than breaking it up into its components. However, the spill code path in sort-based shuffle [violates this assumption][2]. GraphX uses the custom serializers to compress vertex ID keys using variable-length integer encoding. However, since the serializer can no longer rely on the key and value being serialized and deserialized together, performing such encoding would require writing a tag byte. Instead, this PR simply removes the custom serializers. This causes a 10% slowdown for PageRank (494 s to 543 s, PageRank, 3 trials, 10 iterations per trial, uk-2007-05 graph, 16 r3.2xlarge nodes). [1]: http://apache-spark-user-list.1001560.n3.nabble.com/java-lang-ClassCastException-java-lang-Long-cannot-be-cast-to-scala-Tuple2-td13926.html#a14501 [2]: https://github.com/apache/spark/blob/f9d6220c792b779be385f3022d146911a22c2130/core/src/main/scala/org/apache/spark/util/collection/ExternalSorter.scala#L329 You can merge this pull request into a Git repository by running: $ git pull https://github.com/ankurdave/spark SPARK-3649 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/2503.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #2503 commit a49c2ad67f2bf79ae10e9ef696605c64b0c0ed97 Author: Ankur Dave Date: 2014-09-22T22:05:30Z [SPARK-3649] Remove GraphX custom serializers --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [WIP][SPARK-1405][MLLIB] topic modeling on Gra...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2388#issuecomment-56475771 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/20686/ --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [WIP][SPARK-1405][MLLIB] topic modeling on Gra...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2388#issuecomment-56475768 [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20686/consoleFull) for PR 2388 at commit [`bf84e7b`](https://github.com/apache/spark/commit/bf84e7b87306dbe453077727be4a94fec40da417). * This patch **passes** unit tests. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `class TopicModeling(@transient val docs: RDD[(TopicModeling.DocId, SSV)],` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: add a util method for changing the log level w...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2433#issuecomment-56475233 [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20687/consoleFull) for PR 2433 at commit [`cdb3bfc`](https://github.com/apache/spark/commit/cdb3bfc1ab74d3b2c3dfec38dc23118bc05ed922). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [MLLIB] [spark-2352] Implementation of an 1-hi...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1290#issuecomment-56473764 [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20684/consoleFull) for PR 1290 at commit [`a28aa4a`](https://github.com/apache/spark/commit/a28aa4a7c91b402c95f81aaad254661cdf06607d). * This patch **passes** unit tests. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `class OutputCanvas2D(wd: Int, ht: Int) extends Canvas ` * `class OutputFrame2D( title: String ) extends Frame( title ) ` * `class OutputCanvas3D(wd: Int, ht: Int, shadowFrac: Double) extends Canvas ` * `class OutputFrame3D(title: String, shadowFrac: Double) extends Frame(title) ` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [MLLIB] [spark-2352] Implementation of an 1-hi...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1290#issuecomment-56473769 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/20684/ --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3007][SQL]Add "Dynamic Partition" suppo...
Github user baishuo commented on the pull request: https://github.com/apache/spark/pull/2226#issuecomment-56473456 thanks a lot to @liancheng :) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [MLLIB] [spark-2352] Implementation of an 1-hi...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1290#issuecomment-56473120 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/20683/ --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [MLLIB] [spark-2352] Implementation of an 1-hi...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1290#issuecomment-56473117 [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20683/consoleFull) for PR 1290 at commit [`b3531d6`](https://github.com/apache/spark/commit/b3531d68dc36832115fd721a1a2efc0f99851661). * This patch **passes** unit tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [WIP][SPARK-1405][MLLIB] topic modeling on Gra...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2388#issuecomment-56472988 [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20686/consoleFull) for PR 2388 at commit [`bf84e7b`](https://github.com/apache/spark/commit/bf84e7b87306dbe453077727be4a94fec40da417). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [WIP][SPARK-3212][SQL] Use logical plan matchi...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2501#issuecomment-56472896 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/20685/ --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [WIP][SPARK-3212][SQL] Use logical plan matchi...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2501#issuecomment-56472893 [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20685/consoleFull) for PR 2501 at commit [`80f26ac`](https://github.com/apache/spark/commit/80f26acffa8e234434fb8e080c499e6cae9fe6e4). * This patch **fails** unit tests. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `case class LogicalRDD(output: Seq[Attribute], rdd: RDD[Row])(sqlContext: SQLContext)` * `case class PhysicalRDD(output: Seq[Attribute], rdd: RDD[Row]) extends LeafNode ` * `case class ExistingRdd(output: Seq[Attribute], rdd: RDD[Row]) extends LeafNode ` * `case class SparkLogicalPlan(alreadyPlanned: SparkPlan)(@transient sqlContext: SQLContext)` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3007][SQL]Add "Dynamic Partition" suppo...
Github user liancheng commented on the pull request: https://github.com/apache/spark/pull/2226#issuecomment-56472453 LGTM @marmbrus This is finally good to go :) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3606] [yarn] Correctly configure AmIpFi...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2497#issuecomment-56472123 [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20680/consoleFull) for PR 2497 at commit [`b3b3e50`](https://github.com/apache/spark/commit/b3b3e50a0c13df08a607f036592f83e566cded39). * This patch **passes** unit tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3606] [yarn] Correctly configure AmIpFi...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2497#issuecomment-56472127 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/20680/ --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [WIP][SPARK-3212][SQL] Use logical plan matchi...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2501#issuecomment-56471636 [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20685/consoleFull) for PR 2501 at commit [`80f26ac`](https://github.com/apache/spark/commit/80f26acffa8e234434fb8e080c499e6cae9fe6e4). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: Merge pull request #1 from apache/master
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2502#issuecomment-56471527 Can one of the admins verify this patch? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-1720][SPARK-1719] Add the value of LD_L...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1031#issuecomment-56471514 [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20681/consoleFull) for PR 1031 at commit [`f44c221`](https://github.com/apache/spark/commit/f44c221aceb2f246eec335f4b7a1cd6f0c2b0080). * This patch **passes** unit tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-1720][SPARK-1719] Add the value of LD_L...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1031#issuecomment-56471518 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/20681/ --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: Merge pull request #1 from apache/master
GitHub user ceys opened a pull request: https://github.com/apache/spark/pull/2502 Merge pull request #1 from apache/master Update from original You can merge this pull request into a Git repository by running: $ git pull https://github.com/ceys/spark master Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/2502.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #2502 commit 3fc6550d41409f26e6c54bac1914ed5cbf80c879 Author: ceys Date: 2014-09-23T02:56:32Z Merge pull request #1 from apache/master Update from original --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [WIP][SPARK-3212][SQL] Use logical plan matchi...
GitHub user marmbrus opened a pull request: https://github.com/apache/spark/pull/2501 [WIP][SPARK-3212][SQL] Use logical plan matching instead of temporary tables for table caching _Also addresses: SPARK-1379 and SPARK-3641_ This PR introduces a new trait, `CacheManger`, which replaces the previous temporary table based caching system. Instead of creating a temporary table, which shadows an existing table but provides a cached representation, the cached manager maintains a separate list of cached data. After optimization, this list is searched for any matching plan fragments. When a matching plan fragment is found it is replaced with the cached data. There are several advantages to this approach: - Calling .cache() on a SchemaRDD now works as you would expect, and uses the more efficient columnar representation. - Its now possible to provide a list of temporary tables, without having to decide if a given table is actually just a cached persistent table. (To be done in a follow-up PR) - In some cases it is possible that cached data will be used, even if a cached table was not explicitly requested. This is because we now look at the logical structure instead of the table name. TODO: - [ ] Finish cleanup of caching specific pattern matching code - [ ] More test cases for `sameResult` function You can merge this pull request into a Git repository by running: $ git pull https://github.com/marmbrus/spark caching Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/2501.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #2501 commit 80f26acffa8e234434fb8e080c499e6cae9fe6e4 Author: Michael Armbrust Date: 2014-09-23T02:41:57Z First draft of improved semantics for Spark SQL caching. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: Adds json api for stages, storage and executor...
Github user praveenr019 closed the pull request at: https://github.com/apache/spark/pull/882 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: Adds json api for stages, storage and executor...
Github user praveenr019 commented on the pull request: https://github.com/apache/spark/pull/882#issuecomment-56470825 Closing this pull request since its committed on a old branch. Thanks @JoshRosen, would be glad to see this feature in Spark. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [MLLIB] [spark-2352] Implementation of an 1-hi...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1290#issuecomment-56470743 [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20684/consoleFull) for PR 1290 at commit [`a28aa4a`](https://github.com/apache/spark/commit/a28aa4a7c91b402c95f81aaad254661cdf06607d). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-1545] [mllib] Add Random Forests
Github user manishamde commented on a diff in the pull request: https://github.com/apache/spark/pull/2435#discussion_r17889907 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/RandomForest.scala --- @@ -0,0 +1,430 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.tree + +import scala.collection.JavaConverters._ +import scala.collection.mutable + +import org.apache.spark.Logging +import org.apache.spark.annotation.Experimental +import org.apache.spark.api.java.JavaRDD +import org.apache.spark.mllib.regression.LabeledPoint +import org.apache.spark.mllib.tree.configuration.Algo._ +import org.apache.spark.mllib.tree.configuration.QuantileStrategy._ +import org.apache.spark.mllib.tree.configuration.Strategy +import org.apache.spark.mllib.tree.impl.{BaggedPoint, TreePoint, DecisionTreeMetadata, TimeTracker} +import org.apache.spark.mllib.tree.impurity.Impurities +import org.apache.spark.mllib.tree.model._ +import org.apache.spark.rdd.RDD +import org.apache.spark.storage.StorageLevel +import org.apache.spark.util.Utils + +/** + * :: Experimental :: + * A class which implements a random forest learning algorithm for classification and regression. + * It supports both continuous and categorical features. + * + * @param strategy The configuration parameters for the random forest algorithm which specify + * the type of algorithm (classification, regression, etc.), feature type + * (continuous, categorical), depth of the tree, quantile calculation strategy, + * etc. + * @param numTrees If 1, then no bootstrapping is used. If > 1, then bootstrapping is done. + * @param featureSubsetStrategy Number of features to consider for splits at each node. + * Supported: "auto" (default), "all", "sqrt", "log2", "onethird". + * If "auto" is set, this parameter is set based on numTrees: + * if numTrees == 1, then featureSubsetStrategy = "all"; + * if numTrees > 1, then featureSubsetStrategy = "sqrt". + * @param seed Random seed for bootstrapping and choosing feature subsets. + */ +@Experimental +private class RandomForest ( +private val strategy: Strategy, +private val numTrees: Int, +featureSubsetStrategy: String, +private val seed: Int) + extends Serializable with Logging { + + strategy.assertValid() + require(numTrees > 0, s"RandomForest requires numTrees > 0, but was given numTrees = $numTrees.") + require(RandomForest.supportedFeatureSubsetStrategies.contains(featureSubsetStrategy), +s"RandomForest given invalid featureSubsetStrategy: $featureSubsetStrategy." + +s" Supported values: ${RandomForest.supportedFeatureSubsetStrategies.mkString(", ")}.") + + /** + * Method to train a decision tree model over an RDD + * @param input Training data: RDD of [[org.apache.spark.mllib.regression.LabeledPoint]] + * @return RandomForestModel that can be used for prediction + */ + def train(input: RDD[LabeledPoint]): RandomForestModel = { + +val timer = new TimeTracker() + +timer.start("total") + +timer.start("init") + +val retaggedInput = input.retag(classOf[LabeledPoint]) +val metadata = + DecisionTreeMetadata.buildMetadata(retaggedInput, strategy, numTrees, featureSubsetStrategy) +logDebug("algo = " + strategy.algo) +logDebug("numTrees = " + numTrees) +logDebug("seed = " + seed) +logDebug("maxBins = " + metadata.maxBins) +logDebug("featureSubsetStrategy = " + featureSubsetStrategy) +logDebug("numFeaturesPerNode = " + metadata.numFeaturesPerNode) + +// Find the splits and the corresponding bins (interval between the splits) using a sample +// of the input data. +timer.start("findSpli
[GitHub] spark pull request: SPARK-1767: Prefer HDFS-cached replicas when s...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1486#issuecomment-56470414 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/20678/ --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-1767: Prefer HDFS-cached replicas when s...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1486#issuecomment-56470411 [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20678/consoleFull) for PR 1486 at commit [`9c4933c`](https://github.com/apache/spark/commit/9c4933c6e18db8bf2e0cbd0deb85b46c2ca0d2b2). * This patch **fails** unit tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: stop, start and destroy require the EC2_REGION
Github user nchammas commented on a diff in the pull request: https://github.com/apache/spark/pull/2473#discussion_r17889871 --- Diff: docs/ec2-scripts.md --- @@ -137,11 +146,11 @@ cost you any EC2 cycles, but ***will*** continue to cost money for EBS storage. - To stop one of your clusters, go into the `ec2` directory and run -`./spark-ec2 stop `. +`./spark-ec2 --region= stop `. - To restart it later, run -`./spark-ec2 -i start `. +`./spark-ec2 -i --region= start `. - To ultimately destroy the cluster and stop consuming EBS space, run -`./spark-ec2 destroy ` as described in the previous +`./spark-ec2 --region= destroy ` as described in the previous --- End diff -- Ah, right. It's set as the default. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [MLLIB] [spark-2352] Implementation of an 1-hi...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1290#issuecomment-56470205 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/20682/ --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: stop, start and destroy require the EC2_REGION
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/2473#discussion_r17889800 --- Diff: docs/ec2-scripts.md --- @@ -137,11 +146,11 @@ cost you any EC2 cycles, but ***will*** continue to cost money for EBS storage. - To stop one of your clusters, go into the `ec2` directory and run -`./spark-ec2 stop `. +`./spark-ec2 --region= stop `. - To restart it later, run -`./spark-ec2 -i start `. +`./spark-ec2 -i --region= start `. - To ultimately destroy the cluster and stop consuming EBS space, run -`./spark-ec2 destroy ` as described in the previous +`./spark-ec2 --region= destroy ` as described in the previous --- End diff -- it does require it unless the region is us-east. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [MLLIB] [spark-2352] Implementation of an 1-hi...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1290#issuecomment-56470085 [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20683/consoleFull) for PR 1290 at commit [`b3531d6`](https://github.com/apache/spark/commit/b3531d68dc36832115fd721a1a2efc0f99851661). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3007][SQL]Add "Dynamic Partition" suppo...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2226#issuecomment-56470082 [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20679/consoleFull) for PR 2226 at commit [`e69ce88`](https://github.com/apache/spark/commit/e69ce883ee9d337a81d4aae3a63943937f771e84). * This patch **passes** unit tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3007][SQL]Add "Dynamic Partition" suppo...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2226#issuecomment-56470087 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/20679/ --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-1545] [mllib] Add Random Forests
Github user manishamde commented on a diff in the pull request: https://github.com/apache/spark/pull/2435#discussion_r17889650 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/RandomForest.scala --- @@ -0,0 +1,430 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.tree + +import scala.collection.JavaConverters._ +import scala.collection.mutable + +import org.apache.spark.Logging +import org.apache.spark.annotation.Experimental +import org.apache.spark.api.java.JavaRDD +import org.apache.spark.mllib.regression.LabeledPoint +import org.apache.spark.mllib.tree.configuration.Algo._ +import org.apache.spark.mllib.tree.configuration.QuantileStrategy._ +import org.apache.spark.mllib.tree.configuration.Strategy +import org.apache.spark.mllib.tree.impl.{BaggedPoint, TreePoint, DecisionTreeMetadata, TimeTracker} +import org.apache.spark.mllib.tree.impurity.Impurities +import org.apache.spark.mllib.tree.model._ +import org.apache.spark.rdd.RDD +import org.apache.spark.storage.StorageLevel +import org.apache.spark.util.Utils + +/** + * :: Experimental :: + * A class which implements a random forest learning algorithm for classification and regression. + * It supports both continuous and categorical features. + * + * @param strategy The configuration parameters for the random forest algorithm which specify + * the type of algorithm (classification, regression, etc.), feature type + * (continuous, categorical), depth of the tree, quantile calculation strategy, + * etc. + * @param numTrees If 1, then no bootstrapping is used. If > 1, then bootstrapping is done. + * @param featureSubsetStrategy Number of features to consider for splits at each node. + * Supported: "auto" (default), "all", "sqrt", "log2", "onethird". + * If "auto" is set, this parameter is set based on numTrees: + * if numTrees == 1, then featureSubsetStrategy = "all"; + * if numTrees > 1, then featureSubsetStrategy = "sqrt". + * @param seed Random seed for bootstrapping and choosing feature subsets. + */ +@Experimental +private class RandomForest ( +private val strategy: Strategy, +private val numTrees: Int, +featureSubsetStrategy: String, +private val seed: Int) + extends Serializable with Logging { + + strategy.assertValid() + require(numTrees > 0, s"RandomForest requires numTrees > 0, but was given numTrees = $numTrees.") + require(RandomForest.supportedFeatureSubsetStrategies.contains(featureSubsetStrategy), +s"RandomForest given invalid featureSubsetStrategy: $featureSubsetStrategy." + +s" Supported values: ${RandomForest.supportedFeatureSubsetStrategies.mkString(", ")}.") + + /** + * Method to train a decision tree model over an RDD + * @param input Training data: RDD of [[org.apache.spark.mllib.regression.LabeledPoint]] + * @return RandomForestModel that can be used for prediction + */ + def train(input: RDD[LabeledPoint]): RandomForestModel = { + +val timer = new TimeTracker() + +timer.start("total") + +timer.start("init") + +val retaggedInput = input.retag(classOf[LabeledPoint]) +val metadata = + DecisionTreeMetadata.buildMetadata(retaggedInput, strategy, numTrees, featureSubsetStrategy) +logDebug("algo = " + strategy.algo) +logDebug("numTrees = " + numTrees) +logDebug("seed = " + seed) +logDebug("maxBins = " + metadata.maxBins) +logDebug("featureSubsetStrategy = " + featureSubsetStrategy) +logDebug("numFeaturesPerNode = " + metadata.numFeaturesPerNode) + +// Find the splits and the corresponding bins (interval between the splits) using a sample +// of the input data. +timer.start("findSpli
[GitHub] spark pull request: [SPARK-1545] [mllib] Add Random Forests
Github user manishamde commented on a diff in the pull request: https://github.com/apache/spark/pull/2435#discussion_r17889641 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/RandomForest.scala --- @@ -0,0 +1,430 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.tree + +import scala.collection.JavaConverters._ +import scala.collection.mutable + +import org.apache.spark.Logging +import org.apache.spark.annotation.Experimental +import org.apache.spark.api.java.JavaRDD +import org.apache.spark.mllib.regression.LabeledPoint +import org.apache.spark.mllib.tree.configuration.Algo._ +import org.apache.spark.mllib.tree.configuration.QuantileStrategy._ +import org.apache.spark.mllib.tree.configuration.Strategy +import org.apache.spark.mllib.tree.impl.{BaggedPoint, TreePoint, DecisionTreeMetadata, TimeTracker} +import org.apache.spark.mllib.tree.impurity.Impurities +import org.apache.spark.mllib.tree.model._ +import org.apache.spark.rdd.RDD +import org.apache.spark.storage.StorageLevel +import org.apache.spark.util.Utils + +/** + * :: Experimental :: + * A class which implements a random forest learning algorithm for classification and regression. + * It supports both continuous and categorical features. + * + * @param strategy The configuration parameters for the random forest algorithm which specify + * the type of algorithm (classification, regression, etc.), feature type + * (continuous, categorical), depth of the tree, quantile calculation strategy, + * etc. + * @param numTrees If 1, then no bootstrapping is used. If > 1, then bootstrapping is done. + * @param featureSubsetStrategy Number of features to consider for splits at each node. + * Supported: "auto" (default), "all", "sqrt", "log2", "onethird". + * If "auto" is set, this parameter is set based on numTrees: + * if numTrees == 1, then featureSubsetStrategy = "all"; + * if numTrees > 1, then featureSubsetStrategy = "sqrt". + * @param seed Random seed for bootstrapping and choosing feature subsets. + */ +@Experimental +private class RandomForest ( +private val strategy: Strategy, +private val numTrees: Int, +featureSubsetStrategy: String, +private val seed: Int) + extends Serializable with Logging { + + strategy.assertValid() + require(numTrees > 0, s"RandomForest requires numTrees > 0, but was given numTrees = $numTrees.") + require(RandomForest.supportedFeatureSubsetStrategies.contains(featureSubsetStrategy), +s"RandomForest given invalid featureSubsetStrategy: $featureSubsetStrategy." + +s" Supported values: ${RandomForest.supportedFeatureSubsetStrategies.mkString(", ")}.") + + /** + * Method to train a decision tree model over an RDD + * @param input Training data: RDD of [[org.apache.spark.mllib.regression.LabeledPoint]] + * @return RandomForestModel that can be used for prediction + */ + def train(input: RDD[LabeledPoint]): RandomForestModel = { + +val timer = new TimeTracker() + +timer.start("total") + +timer.start("init") + +val retaggedInput = input.retag(classOf[LabeledPoint]) +val metadata = + DecisionTreeMetadata.buildMetadata(retaggedInput, strategy, numTrees, featureSubsetStrategy) +logDebug("algo = " + strategy.algo) +logDebug("numTrees = " + numTrees) +logDebug("seed = " + seed) +logDebug("maxBins = " + metadata.maxBins) +logDebug("featureSubsetStrategy = " + featureSubsetStrategy) +logDebug("numFeaturesPerNode = " + metadata.numFeaturesPerNode) + +// Find the splits and the corresponding bins (interval between the splits) using a sample +// of the input data. +timer.start("findSpli
[GitHub] spark pull request: [SPARK-3653] Respect SPARK_*_MEMORY for cluste...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2500#issuecomment-56469714 [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20677/consoleFull) for PR 2500 at commit [`6217b38`](https://github.com/apache/spark/commit/6217b38e5a71e4ef98b82a2968b8da7df5df94a1). * This patch **passes** unit tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3653] Respect SPARK_*_MEMORY for cluste...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2500#issuecomment-56469716 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/20677/ --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: stop, start and destroy require the EC2_REGION
Github user nchammas commented on a diff in the pull request: https://github.com/apache/spark/pull/2473#discussion_r17889468 --- Diff: docs/ec2-scripts.md --- @@ -48,6 +48,15 @@ by looking for the "Name" tag of the instance in the Amazon EC2 Console. key pair, `` is the number of slave nodes to launch (try 1 at first), and `` is the name to give to your cluster. + +For Example: --- End diff -- Minor nit: "For example:" (lower case "E") --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: stop, start and destroy require the EC2_REGION
Github user nchammas commented on a diff in the pull request: https://github.com/apache/spark/pull/2473#discussion_r17889459 --- Diff: docs/ec2-scripts.md --- @@ -137,11 +146,11 @@ cost you any EC2 cycles, but ***will*** continue to cost money for EBS storage. - To stop one of your clusters, go into the `ec2` directory and run -`./spark-ec2 stop `. +`./spark-ec2 --region= stop `. - To restart it later, run -`./spark-ec2 -i start `. +`./spark-ec2 -i --region= start `. - To ultimately destroy the cluster and stop consuming EBS space, run -`./spark-ec2 destroy ` as described in the previous +`./spark-ec2 --region= destroy ` as described in the previous --- End diff -- Hmm, are you sure `destroy` requires `ec2-region`? I've been successfully destroying EC2 clusters without specifying it. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-1720][SPARK-1719] Add the value of LD_L...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1031#issuecomment-56468425 [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20681/consoleFull) for PR 1031 at commit [`f44c221`](https://github.com/apache/spark/commit/f44c221aceb2f246eec335f4b7a1cd6f0c2b0080). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3606] [yarn] Correctly configure AmIpFi...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2497#issuecomment-56468434 [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20680/consoleFull) for PR 2497 at commit [`b3b3e50`](https://github.com/apache/spark/commit/b3b3e50a0c13df08a607f036592f83e566cded39). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3606] [yarn] Correctly configure AmIpFi...
Github user andrewor14 commented on the pull request: https://github.com/apache/spark/pull/2497#issuecomment-56468116 retest this please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3606] [yarn] Correctly configure AmIpFi...
Github user andrewor14 commented on the pull request: https://github.com/apache/spark/pull/2469#issuecomment-56468138 Thanks for fixing this @vanzin. I will look at it shortly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3652] [SQL] upgrade spark sql hive vers...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2499#issuecomment-56468081 [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/142/consoleFull) for PR 2499 at commit [`6d5d071`](https://github.com/apache/spark/commit/6d5d0710eb2ab1c14208deb158c2f4b018ddbf33). * This patch **fails** unit tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3652] [SQL] upgrade spark sql hive vers...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2499#issuecomment-56468023 [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/142/consoleFull) for PR 2499 at commit [`6d5d071`](https://github.com/apache/spark/commit/6d5d0710eb2ab1c14208deb158c2f4b018ddbf33). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3614][MLLIB] Add minimumOccurence filte...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2494#issuecomment-56468001 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/20676/ --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3614][MLLIB] Add minimumOccurence filte...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2494#issuecomment-56467997 [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20676/consoleFull) for PR 2494 at commit [`1801fd2`](https://github.com/apache/spark/commit/1801fd2e9518c610b3657c6a9cb9239fedd43847). * This patch **fails** unit tests. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `class IDF(val minimumOccurence: Long) ` * ` class DocumentFrequencyAggregator(val minimumOccurence: Long) extends Serializable ` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3007][SQL]Add "Dynamic Partition" suppo...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2226#issuecomment-56467600 [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20679/consoleFull) for PR 2226 at commit [`e69ce88`](https://github.com/apache/spark/commit/e69ce883ee9d337a81d4aae3a63943937f771e84). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3606] [yarn] Correctly configure AmIpFi...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2497#issuecomment-56467574 [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20673/consoleFull) for PR 2497 at commit [`b3b3e50`](https://github.com/apache/spark/commit/b3b3e50a0c13df08a607f036592f83e566cded39). * This patch **fails** unit tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3606] [yarn] Correctly configure AmIpFi...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2497#issuecomment-56467583 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/20673/ --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3007][SQL]Add "Dynamic Partition" suppo...
Github user liancheng commented on a diff in the pull request: https://github.com/apache/spark/pull/2226#discussion_r1720 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveQuerySuite.scala --- @@ -522,6 +523,52 @@ class HiveQuerySuite extends HiveComparisonTest { case class LogEntry(filename: String, message: String) case class LogFile(name: String) + createQueryTest("dynamic_partition", +""" + |DROP TABLE IF EXISTS dynamic_part_table; + |CREATE TABLE dynamic_part_table(intcol INT) PARTITIONED BY (partcol1 INT, partcol2 INT); + | + |SET hive.exec.dynamic.partition.mode=nonstrict; + | + |INSERT INTO TABLE dynamic_part_table PARTITION(partcol1, partcol2) + |SELECT 1, 1, 1 FROM src WHERE key=150; + | + |INSERT INTO TABLE dynamic_part_table PARTITION(partcol1, partcol2) + |SELECT 1, NULL, 1 FROM src WHERE key=150; + | + |INSERT INTO TABLE dynamic_part_table PARTITION(partcol1, partcol2) + |SELECT 1, 1, NULL FROM src WHERE key=150; + | + |INSERT INTO TABLe dynamic_part_table PARTITION(partcol1, partcol2) + |SELECT 1, NULL, NULL FROM src WHERE key=150; + | + |DROP TABLE IF EXISTS dynamic_part_table; +""".stripMargin) --- End diff -- Added a test to validate dynamic partitioning folder layout by loading each partition from specific partition folder and check the contents. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-1767: Prefer HDFS-cached replicas when s...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1486#issuecomment-56467045 [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20678/consoleFull) for PR 1486 at commit [`9c4933c`](https://github.com/apache/spark/commit/9c4933c6e18db8bf2e0cbd0deb85b46c2ca0d2b2). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3653] Respect SPARK_*_MEMORY for cluste...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2500#issuecomment-56466221 [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20677/consoleFull) for PR 2500 at commit [`6217b38`](https://github.com/apache/spark/commit/6217b38e5a71e4ef98b82a2968b8da7df5df94a1). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3653] Respect SPARK_*_MEMORY for cluste...
GitHub user andrewor14 opened a pull request: https://github.com/apache/spark/pull/2500 [SPARK-3653] Respect SPARK_*_MEMORY for cluster mode `SPARK_DRIVER_MEMORY` was only used to start the `SparkSubmit` JVM, which becomes the driver only in client mode but not cluster mode. In cluster mode, this property is simply not propagated to the worker nodes. `SPARK_EXECUTOR_MEMORY` is picked up from `SparkContext`, but in cluster mode the driver runs on one of the worker machines, where this environment variable may not be set. You can merge this pull request into a Git repository by running: $ git pull https://github.com/andrewor14/spark memory-env-vars Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/2500.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #2500 commit 6217b38e5a71e4ef98b82a2968b8da7df5df94a1 Author: Andrew Or Date: 2014-09-23T01:06:23Z Respect SPARK_*_MEMORY for cluster mode --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-1767: Prefer HDFS-cached replicas when s...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1486#issuecomment-56465510 [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20674/consoleFull) for PR 1486 at commit [`8f9c5d6`](https://github.com/apache/spark/commit/8f9c5d66d7a630ebfee64afee7fa922c22f838ee). * This patch **fails** unit tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3652] [SQL] upgrade spark sql hive vers...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2499#issuecomment-56465504 Can one of the admins verify this patch? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-1767: Prefer HDFS-cached replicas when s...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1486#issuecomment-56465520 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/20674/ --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3652] [SQL] upgrade spark sql hive vers...
GitHub user scwf opened a pull request: https://github.com/apache/spark/pull/2499 [SPARK-3652] [SQL] upgrade spark sql hive version to 0.13.1 Now spark sql hive version is 0.12.0 and do not support 0.13.1 because of some api level changes in hive new version. Since hive has backwards compatibility, this PR just upgrade the hive version to 0.13.1(compile this PR against 0.12.0 will get error), i think this is ok for users and we also do not need to support different version of hive . Notes: 1. package cmd not changed, sbt/sbt -Phive assembly will get the assembly jar with hive 0.13.1 2. this PR use org.apache.hive since there is not a shaded one of org.spark-project.hive for 0.13.1 3. i regenerate golden answer since change of sql query result You can merge this pull request into a Git repository by running: $ git pull https://github.com/scwf/spark hive-0.13.1-clean Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/2499.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #2499 commit 5d9de8ec6145d286ca05906d5e1dd1cfd9760e71 Author: scwf Date: 2014-09-21T15:45:00Z update pom to org.apache.hive 0.13.1 version commit c3aa95f9861a541df249f29ddb35b5ad9e6a4751 Author: w00228970 Date: 2014-09-22T04:56:44Z fix errors of hive/hive-thriftserver when update to org.apache.hive 0.13.1 commit 22f648655aa5941e53e65cbeadb097a32e0af8cf Author: w00228970 Date: 2014-09-22T06:00:55Z fix StatisticsSuite error commit f9fdc1ca944e14a986d910b7093da3ae4586cc68 Author: w00228970 Date: 2014-09-22T06:38:01Z loginFromKeyTab when set hive.server2.authentication commit 2afcaa1e6f579b209ecc07d98a990520cdb81350 Author: w00228970 Date: 2014-09-22T08:30:30Z delete invalid set fs.default.name, this will lead to query error since SessionStat.start changed in hive0.13.1 commit a09fc4e37d54fda41c8cbf6afc6d577ece51ec55 Author: w00228970 Date: 2014-09-22T08:42:28Z fix Operation cancelled commit 8b9309014e4e76560378a543fdddec51c874092c Author: w00228970 Date: 2014-09-22T09:09:08Z regenerate golden answer commit 9bee908fdc4ee947e2c96e8c0e9006f2023eb870 Author: w00228970 Date: 2014-09-22T10:09:39Z ignore stats_empty_partition commit 0b15b748e94fd6afbc19cd4397cf9f74adf9064b Author: w00228970 Date: 2014-09-22T10:11:07Z add logic for case VoidObjectInspector in method inspectorToDataType commit eab2354187ce88c051ffc6c149847b08e532804b Author: w00228970 Date: 2014-09-22T10:39:51Z reset TestHive in CachedTablesuite commit 853632d71bdb16a6792776c951257524d728c8eb Author: w00228970 Date: 2014-09-22T14:34:52Z fix Hivequerysuite commit 6d5d0710eb2ab1c14208deb158c2f4b018ddbf33 Author: w00228970 Date: 2014-09-22T14:59:41Z fix analyze MetastoreRelations --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3614][MLLIB] Add minimumOccurence filte...
Github user rnowling commented on the pull request: https://github.com/apache/spark/pull/2494#issuecomment-56464302 @Ishiihara Thanks for pointing out the style check -- I found and fixed the style error in IDF.scala. Thanks for mentioning options for the mimimumOccurence members. I decided to add the val keyword over adding a setter. Earlier, I had considered several approaches including making it an optional parameter and adding a Scala-style setter, however I found that neither provided clean Java interoperability. As a result, I settled on the overloaded constructor approach, which also matches the emphasis on immutability within Scala better. Since creating IDF's is inexpensive, I don't think performance will be an issue. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3614][MLLIB] Add minimumOccurence filte...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2494#issuecomment-56464144 [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20676/consoleFull) for PR 2494 at commit [`1801fd2`](https://github.com/apache/spark/commit/1801fd2e9518c610b3657c6a9cb9239fedd43847). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3614][MLLIB] Add minimumOccurence filte...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2494#issuecomment-56463906 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/20675/ --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3614][MLLIB] Add minimumOccurence filte...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2494#issuecomment-56463904 [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20675/consoleFull) for PR 2494 at commit [`6897252`](https://github.com/apache/spark/commit/689725201b3fbfa1232f4b5f74dc5002c8950b3f). * This patch **fails** unit tests. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `class IDF(val minimumOccurence: Long) ` * ` class DocumentFrequencyAggregator(val minimumOccurence: Long) extends Serializable ` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3614][MLLIB] Add minimumOccurence filte...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2494#issuecomment-56463855 [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20675/consoleFull) for PR 2494 at commit [`6897252`](https://github.com/apache/spark/commit/689725201b3fbfa1232f4b5f74dc5002c8950b3f). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-1767: Prefer HDFS-cached replicas when s...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1486#issuecomment-56463517 [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20674/consoleFull) for PR 1486 at commit [`8f9c5d6`](https://github.com/apache/spark/commit/8f9c5d66d7a630ebfee64afee7fa922c22f838ee). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3634] [PySpark] User's module should ta...
Github user davies commented on the pull request: https://github.com/apache/spark/pull/2492#issuecomment-56463455 > Maybe my JIRA was misleadingly named; my motivation here is allowing users to specify versions of packages that take precedence over other versions of that same package that might be installed on the system, not in overriding modules included in Python's standard library (although the ability to do that is a side-effect of this change). Understood, this side-effect is bit dangerous. The third-package could appear in sys.path in any order, such as ```python >>> import sys >>> sys.path ['', '//anaconda/lib/python2.7/site-packages/DPark-0.1-py2.7.egg', '//anaconda/lib/python2.7/site-packages/protobuf-2.5.0-py2.7.egg', '//anaconda/lib/python2.7/site-packages/msgpack_python-0.4.2-py2.7-macosx-10.5-x86_64.egg', '//anaconda/lib/python2.7/site-packages/setuptools-3.6-py2.7.egg', '/Users/daviesliu/work/spark/python/lib', '/Users/daviesliu/work/spark/python/lib/py4j-0.8.2.1-src.zip', '/Users/daviesliu/work/spark/python', '//anaconda/lib/python27.zip', '//anaconda/lib/python2.7', '//anaconda/lib/python2.7/plat-darwin', '//anaconda/lib/python2.7/plat-mac', '//anaconda/lib/python2.7/plat-mac/lib-scriptpackages', '//anaconda/lib/python2.7/lib-tk', '//anaconda/lib/python2.7/lib-old', '//anaconda/lib/python2.7/lib-dynload', '//anaconda/lib/python2.7/site-packages', '//anaconda/lib/python2.7/site-packages/PIL', '//anaconda/lib/python2.7/site-packages/runipy-0.1.0-py2.7.egg'] ``` it's not easy to find a position which is before third-package but after standard module. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: WHITESPACE CHANGE DO NOT MERGE
Github user shaneknapp closed the pull request at: https://github.com/apache/spark/pull/2498 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3606] [yarn] Correctly configure AmIpFi...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2497#issuecomment-56463199 [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20673/consoleFull) for PR 2497 at commit [`b3b3e50`](https://github.com/apache/spark/commit/b3b3e50a0c13df08a607f036592f83e566cded39). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: WHITESPACE CHANGE DO NOT MERGE
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2498#issuecomment-56463163 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/20672/ --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: WHITESPACE CHANGE DO NOT MERGE
Github user shaneknapp commented on the pull request: https://github.com/apache/spark/pull/2498#issuecomment-56463049 jenkins, test this please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: WHITESPACE CHANGE DO NOT MERGE
GitHub user shaneknapp opened a pull request: https://github.com/apache/spark/pull/2498 WHITESPACE CHANGE DO NOT MERGE WHITESPACE CHANGE DO NOT MERGE You can merge this pull request into a Git repository by running: $ git pull https://github.com/shaneknapp/spark sknapptest Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/2498.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #2498 commit c15f44ae06b71ccc0ed629771206760ab1c57797 Author: shane knapp Date: 2014-09-11T15:33:50Z DO NOT MERGE, TESTING ONLY commit 4e0747f7fd60a61f429b3e623072616690769d67 Author: shane knapp Date: 2014-09-23T00:18:03Z WHITESPACE CHANGE DO NOT MERGE --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3606] [yarn] Correctly configure AmIpFi...
Github user vanzin commented on the pull request: https://github.com/apache/spark/pull/2497#issuecomment-56462789 Backport of #2469 to branch-1.1. Sending now to speed up the review process, since the original PR doesn't merge cleanly into this branch. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3606] [yarn] Correctly configure AmIpFi...
GitHub user vanzin opened a pull request: https://github.com/apache/spark/pull/2497 [SPARK-3606] [yarn] Correctly configure AmIpFilter for Yarn HA (1.1 vers... ...ion). This is a backport of SPARK-3606 to branch-1.1. Some of the code had to be duplicated since branch-1.1 doesn't have the cleanup work that was done to the Yarn codebase. I don't know whether the version issue in yarn/alpha/pom.xml was intentional, but I couldn't compile the code without fixing it. You can merge this pull request into a Git repository by running: $ git pull https://github.com/vanzin/spark SPARK-3606-1.1 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/2497.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #2497 commit b3b3e50a0c13df08a607f036592f83e566cded39 Author: Marcelo Vanzin Date: 2014-09-19T23:40:43Z [SPARK-3606] [yarn] Correctly configure AmIpFilter for Yarn HA (1.1 version). This is a backport of SPARK-3606 to branch-1.1. Some of the code had to be duplicated since branch-1.1 doesn't have the cleanup work that was done to the Yarn codebase. I don't know whether the version issue in yarn/alpha/pom.xml was intentional, but I couldn't compile the code without fixing it. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3614][MLLIB] Add minimumOccurence filte...
Github user Ishiihara commented on a diff in the pull request: https://github.com/apache/spark/pull/2494#discussion_r17886499 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/IDF.scala --- @@ -30,9 +30,20 @@ import org.apache.spark.rdd.RDD * Inverse document frequency (IDF). * The standard formulation is used: `idf = log((m + 1) / (d(t) + 1))`, where `m` is the total * number of documents and `d(t)` is the number of documents that contain term `t`. + * + * This implementation supports filtering out terms which do not appear in a minimum number + * of documents (controlled by the variable minimumOccurence). For terms that are not in + * at least `minimumOccurence` documents, the IDF is found as 0, resulting in TF-IDFs of 0. + * + * @param minimumOccurence minimum of documents in which a term + * should appear for filtering + * + * */ @Experimental -class IDF { +class IDF(minimumOccurence: Long) { --- End diff -- You can add a val before minimumOccurence. Alternatively, if you want to set set minimumOccurence after new IDF(), you can define a private field and use a setter to set the value. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-1767: Prefer HDFS-cached replicas when s...
Github user cmccabe commented on a diff in the pull request: https://github.com/apache/spark/pull/1486#discussion_r17886353 --- Diff: core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala --- @@ -309,4 +323,42 @@ private[spark] object HadoopRDD { f(inputSplit, firstParent[T].iterator(split, context)) } } + + private[spark] class SplitInfoReflections { +val inputSplitWithLocationInfo = + Class.forName("org.apache.hadoop.mapred.InputSplitWithLocationInfo") +val getLocationInfo = inputSplitWithLocationInfo.getMethod("getLocationInfo") +val newInputSplit = Class.forName("org.apache.hadoop.mapreduce.InputSplit") +val newGetLocationInfo = newInputSplit.getMethod("getLocationInfo") +val splitLocationInfo = Class.forName("org.apache.hadoop.mapred.SplitLocationInfo") +val isInMemory = splitLocationInfo.getMethod("isInMemory") +val getLocation = splitLocationInfo.getMethod("getLocation") + } + + private[spark] val SPLIT_INFO_REFLECTIONS = try { --- End diff -- Sorry, I forgot about this one. I added a type annotation. to SPLIT_INFO_REFLECTIONS. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-2621. Update task InputMetrics increment...
Github user sryza commented on the pull request: https://github.com/apache/spark/pull/2087#issuecomment-56461580 MapReduce doesn't use getPos, but it does look like it might be helpful in some situations. One caveat is that pos only means # bytes for file input formats. For example, for DBInputFormat, it means the number of records. If we choose to use getPos for pre-2.5 Hadoop, my preference would be to make that change in a separate patch. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3614][MLLIB] Add minimumOccurence filte...
Github user Ishiihara commented on the pull request: https://github.com/apache/spark/pull/2494#issuecomment-56461303 @rnowling Please run sbt/sbt scalastyle on your local machine to clear out style issues. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3650] Fix TriangleCount handling of rev...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2495#issuecomment-56461187 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/20670/ --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3650] Fix TriangleCount handling of rev...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2495#issuecomment-56461185 [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20670/consoleFull) for PR 2495 at commit [`d054d33`](https://github.com/apache/spark/commit/d054d33181486e3b90222e5e30b2f20648434673). * This patch **passes** unit tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3614][MLLIB] Add minimumOccurence filte...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2494#issuecomment-56461090 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/20671/ --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-1767: Prefer HDFS-cached replicas when s...
Github user cmccabe commented on a diff in the pull request: https://github.com/apache/spark/pull/1486#discussion_r17886029 --- Diff: core/src/main/scala/org/apache/spark/scheduler/TaskLocation.scala --- @@ -22,13 +22,35 @@ package org.apache.spark.scheduler * In the latter case, we will prefer to launch the task on that executorID, but our next level * of preference will be executors on the same host if this is not possible. */ -private[spark] -class TaskLocation private (val host: String, val executorId: Option[String]) extends Serializable { - override def toString: String = "TaskLocation(" + host + ", " + executorId + ")" +private[spark] sealed abstract class TaskLocation(val host: String) { +} + +private [spark] case class ExecutorCacheTaskLocation(override val host: String, +val executorId: String) extends TaskLocation(host) { +} + +private [spark] case class HDFSCachedTaskLocation(override val host: String) +extends TaskLocation(host) { + override def toString = TaskLocation.in_memory_location_tag + host +} + +private [spark] case class HostTaskLocation(override val host: String) extends TaskLocation(host) { + override def toString = host } private[spark] object TaskLocation { - def apply(host: String, executorId: String) = new TaskLocation(host, Some(executorId)) + // We identify hosts on which the block is cached with this prefix. Because this prefix contains + // underscores, which are not legal characters in hostnames, there should be no potential for + // confusion. See RFC 952 and RFC 1123 for information about the format of hostnames. + val in_memory_location_tag = "_hdfs_cache_" + + def apply(host: String, executorId: String) = new ExecutorCacheTaskLocation(host, executorId) - def apply(host: String) = new TaskLocation(host, None) + def apply(str: String) = { +if (str.startsWith(in_memory_location_tag)) { + new HDFSCachedTaskLocation(str.substring(in_memory_location_tag.length)) --- End diff -- ok --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3614][MLLIB] Add minimumOccurence filte...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2494#issuecomment-56461087 [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20671/consoleFull) for PR 2494 at commit [`a200bab`](https://github.com/apache/spark/commit/a200babbad7280d3a20f05abb84140b0b8d51b85). * This patch **fails** unit tests. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `class IDF(minimumOccurence: Long) ` * ` class DocumentFrequencyAggregator(minimumOccurence: Long) extends Serializable ` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3614][MLLIB] Add minimumOccurence filte...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2494#issuecomment-56461020 [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20671/consoleFull) for PR 2494 at commit [`a200bab`](https://github.com/apache/spark/commit/a200babbad7280d3a20f05abb84140b0b8d51b85). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-1767: Prefer HDFS-cached replicas when s...
Github user cmccabe commented on a diff in the pull request: https://github.com/apache/spark/pull/1486#discussion_r17886024 --- Diff: core/src/main/scala/org/apache/spark/scheduler/TaskLocation.scala --- @@ -22,13 +22,35 @@ package org.apache.spark.scheduler * In the latter case, we will prefer to launch the task on that executorID, but our next level * of preference will be executors on the same host if this is not possible. */ -private[spark] -class TaskLocation private (val host: String, val executorId: Option[String]) extends Serializable { - override def toString: String = "TaskLocation(" + host + ", " + executorId + ")" +private[spark] sealed abstract class TaskLocation(val host: String) { +} + +private [spark] case class ExecutorCacheTaskLocation(override val host: String, +val executorId: String) extends TaskLocation(host) { +} + +private [spark] case class HDFSCachedTaskLocation(override val host: String) +extends TaskLocation(host) { + override def toString = TaskLocation.in_memory_location_tag + host +} + +private [spark] case class HostTaskLocation(override val host: String) extends TaskLocation(host) { + override def toString = host } private[spark] object TaskLocation { - def apply(host: String, executorId: String) = new TaskLocation(host, Some(executorId)) + // We identify hosts on which the block is cached with this prefix. Because this prefix contains + // underscores, which are not legal characters in hostnames, there should be no potential for + // confusion. See RFC 952 and RFC 1123 for information about the format of hostnames. + val in_memory_location_tag = "_hdfs_cache_" + + def apply(host: String, executorId: String) = new ExecutorCacheTaskLocation(host, executorId) - def apply(host: String) = new TaskLocation(host, None) + def apply(str: String) = { --- End diff -- added --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org