[GitHub] spark pull request: [SPARK-3952] [Streaming] [PySpark] add Python ...

2014-10-17 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2808#issuecomment-59600543
  
  [QA tests have 
started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/397/consoleFull)
 for   PR 2808 at commit 
[`26a7e37`](https://github.com/apache/spark/commit/26a7e379f6135a51478552bfd7c85af449c4bb69).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3916] [Streaming] discover new appended...

2014-10-17 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2806#issuecomment-59600547
  
  [QA tests have 
started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/398/consoleFull)
 for   PR 2806 at commit 
[`09561e8`](https://github.com/apache/spark/commit/09561e8f99c6969c071223778cbadeba1b77292e).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3720][SQL]initial support ORC in spark ...

2014-10-17 Thread scwf
Github user scwf commented on a diff in the pull request:

https://github.com/apache/spark/pull/2576#discussion_r19051903
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/parquet/ParquetTableOperations.scala
 ---
@@ -504,19 +505,41 @@ private[parquet] object FileSystemHelper {
 fs.listStatus(path).map(_.getPath)
   }
 
-/**
- * Finds the maximum taskid in the output file names at the given path.
- */
-  def findMaxTaskId(pathStr: String, conf: Configuration): Int = {
+  /**
+   *  List files with special extension
+   */
+  def listFiles(origPath: Path, conf: Configuration, extension: String): 
Seq[Path] = {
+val fs = origPath.getFileSystem(conf)
+if (fs == null) {
+  throw new IllegalArgumentException(
+s"OrcTableOperations: Path $origPath is incorrectly formatted")
+}
+val path = origPath.makeQualified(fs)
+if (fs.exists(path) && fs.getFileStatus(path).isDir) {
+  fs.listStatus(path).map(_.getPath).filter(p => 
p.getName.endsWith(extension))
--- End diff --

But ```globStatus``` does not list the files of the path, here we should 
list the *.orc or *.parquet files under this dir.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [WIP][SPARK-3822] Executor scaling mechanism f...

2014-10-17 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2840#issuecomment-59598147
  
  [QA tests have 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21874/consoleFull)
 for   PR 2840 at commit 
[`d987b3e`](https://github.com/apache/spark/commit/d987b3e9e33165a482189343c324a0babc6ff3f9).
 * This patch **passes all tests**.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `  case class AddWebUIFilter(filterName:String, filterParams: 
Map[String, String], proxyBase: String)`
  * `  case class RequestExecutors(numExecutors: Int) extends 
CoarseGrainedClusterMessage`
  * `  case class KillExecutor(executorId: String) extends 
CoarseGrainedClusterMessage`
  * `class CoarseGrainedSchedulerBackend(scheduler: TaskSchedulerImpl, val 
actorSystem: ActorSystem)`



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [WIP][SPARK-3822] Executor scaling mechanism f...

2014-10-17 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/2840#issuecomment-59598148
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21874/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3736] Workers reconnect when disassocia...

2014-10-17 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/2828#issuecomment-59598071
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21873/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3736] Workers reconnect when disassocia...

2014-10-17 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2828#issuecomment-59598067
  
  [QA tests have 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21873/consoleFull)
 for   PR 2828 at commit 
[`fe0e02f`](https://github.com/apache/spark/commit/fe0e02feaa8ac3e01ea7e90240e46a3d5a276864).
 * This patch **passes all tests**.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `  case class ReconnectWorker(masterUrl: String) extends DeployMessage`



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3993] [PySpark] fix bug while reuse wor...

2014-10-17 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2838#issuecomment-59597958
  
  [QA tests have 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21872/consoleFull)
 for   PR 2838 at commit 
[`8872914`](https://github.com/apache/spark/commit/88729145f6504397cc51a8851ea99e92f7a8938e).
 * This patch **passes all tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3993] [PySpark] fix bug while reuse wor...

2014-10-17 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/2838#issuecomment-59597959
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21872/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [WIP][SPARK-3822] Executor scaling mechanism f...

2014-10-17 Thread PraveenSeluka
Github user PraveenSeluka commented on the pull request:

https://github.com/apache/spark/pull/2840#issuecomment-59597063
  
Hey @andrewor14, One quick comment on the API. Instead of 
`killExecutor(executorId: String)`. It will be better to have 
`killExecutors(executorIds: List[String])`. There is a need to release a list 
of executors (in auto-scaling code), and having this will make it a single API 
call (and 1 message passing).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [WIP][SPARK-3822] Executor scaling mechanism f...

2014-10-17 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2840#issuecomment-59596681
  
  [QA tests have 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21874/consoleFull)
 for   PR 2840 at commit 
[`d987b3e`](https://github.com/apache/spark/commit/d987b3e9e33165a482189343c324a0babc6ff3f9).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [Spark-3822] Ability to add/delete executors f...

2014-10-17 Thread andrewor14
Github user andrewor14 commented on the pull request:

https://github.com/apache/spark/pull/2798#issuecomment-59596617
  
Hey @PraveenSeluka I opened a PR at #2840. Let me know if you have any 
questions or comments. Thanks for your work!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3736] Workers reconnect when disassocia...

2014-10-17 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2828#issuecomment-59596562
  
  [QA tests have 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21873/consoleFull)
 for   PR 2828 at commit 
[`fe0e02f`](https://github.com/apache/spark/commit/fe0e02feaa8ac3e01ea7e90240e46a3d5a276864).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [WIP][SPARK-3822] Executor scaling mechanism f...

2014-10-17 Thread andrewor14
GitHub user andrewor14 opened a pull request:

https://github.com/apache/spark/pull/2840

[WIP][SPARK-3822] Executor scaling mechanism for Yarn

This is part of a broader effort to enable dynamic scaling of executors 
([SPARK-3174](https://issues.apache.org/jira/browse/SPARK-3174)). This is 
intended to work alongside with SPARK-3795 (#2746), SPARK-3796 and SPARK-3797.

The logic is built on top of @PraveenSeluka at #2798. This is different 
from the changes there in that the mechanism is implemented within the existing 
scheduler backend framework rather than in new `Actor` classes. This also 
introduces a parent abstract class `YarnSchedulerBackend` to encapsulate common 
logic to communicate with the Yarn `ApplicationMaster`.

I have tested this on a stable Yarn cluster. This is still WIP because when 
an executor is removed, `SparkContext` and its components react as if it has 
failed, resulting in many scary error messages and eventual timeouts. While 
it's not strictly necessary to fix this as of the first-cut implementation of 
this mechanism, it would be good to add logic to distinguish this case if it 
doesn't require too much more work.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/andrewor14/spark yarn-scaling-mechanism

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/2840.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #2840


commit bbee669d414bca292380028070ebd684ddca6c88
Author: Andrew Or 
Date:   2014-10-16T20:33:09Z

Add mechanism in yarn client mode

This commit allows the YarnClientSchedulerBackend to communicate
with the AM to request or kill executors dynamically. This extends
the existing MonitorActor in ApplicationMaster to also accept
messages to scale the number of executors up/down.

TODO: extend this to the yarn cluster backend as well, and put
the common functionality in its own class.

commit 53e81454a65c2543afdfbd602f2fec3b850872c4
Author: Andrew Or 
Date:   2014-10-17T01:04:08Z

Start yarn actor early to listen for AM registration message

Previously it was dropped, and the feature was never successfully
enabled. This is an easy fix but it took a long time to understand
what is wrong. This is Akka in its finest glory.

commit c4dfaac0c73689b79663e924e875e8ff93d8e5c6
Author: Andrew Or 
Date:   2014-10-17T01:45:30Z

Avoid thrashing when removing executors

Previously, we immediately add an executor back upon removing it.
This is simply because we don't keep a relevant counter updated.

An important TODO at this point is to inform the SparkContext
sooner about the executor successfully being killed. Otherwise
we have to wait for the BlockManager timeout, which may take a
long time.

commit 47466cd9d0f4a4f2f060e82796b25867092f5de0
Author: Andrew Or 
Date:   2014-10-18T01:32:24Z

Refactor common Yarn scheduler backend logic

As of this commit the mechanism is accessible from cluster mode
in addition to just client mode.

commit 7b76d0a1b0724f0c6572b8ffa9a13135d6d63b5f
Author: Andrew Or 
Date:   2014-10-18T02:04:30Z

Expose mechanism in SparkContext as developer API

commit d987b3e9e33165a482189343c324a0babc6ff3f9
Author: Andrew Or 
Date:   2014-10-18T02:33:54Z

Move addWebUIFilters to Yarn scheduler backend

It's only used by Yarn.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3993] [PySpark] fix bug while reuse wor...

2014-10-17 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2838#issuecomment-59596462
  
  [QA tests have 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21872/consoleFull)
 for   PR 2838 at commit 
[`8872914`](https://github.com/apache/spark/commit/88729145f6504397cc51a8851ea99e92f7a8938e).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3877][YARN] Throw an exception when app...

2014-10-17 Thread zsxwing
Github user zsxwing commented on the pull request:

https://github.com/apache/spark/pull/2732#issuecomment-59596417
  
Already updated the docs and the failure message.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3993] [PySpark] fix bug while reuse wor...

2014-10-17 Thread davies
Github user davies commented on the pull request:

https://github.com/apache/spark/pull/2838#issuecomment-59596407
  
@aarondav Yes, before reuse workers, every python task will fork a new 
python worker.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3888] [PySpark] limit the memory used b...

2014-10-17 Thread davies
Github user davies commented on a diff in the pull request:

https://github.com/apache/spark/pull/2743#discussion_r19051286
  
--- Diff: core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala 
---
@@ -63,9 +64,12 @@ private[spark] class PythonRDD(
 val localdir = env.blockManager.diskBlockManager.localDirs.map(
   f => f.getPath()).mkString(",")
 envVars += ("SPARK_LOCAL_DIRS" -> localdir) // it's also used in 
monitor thread
-if (reuse_worker) {
+if (reuseWorker) {
   envVars += ("SPARK_REUSE_WORKER" -> "1")
 }
+if (!memoryLimit.isEmpty) {
+  envVars += ("PYSPARK_WORKER_MEMORY_LIMIT" -> memoryLimit)
--- End diff --

In python worker, it can not access JVM, so can not access conf. The 
environment variable is used internal, not public.

Right now, the only way to pass configuration to python worker is 
environment variable.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3888] [PySpark] limit the memory used b...

2014-10-17 Thread davies
Github user davies commented on a diff in the pull request:

https://github.com/apache/spark/pull/2743#discussion_r19051269
  
--- Diff: python/pyspark/conf.py ---
@@ -57,6 +57,22 @@
 __all__ = ['SparkConf']
 
 
+def _parse_memory(s):
+"""
+Parse a memory string in the format supported by Java (e.g. 1g, 200m) 
and
+return the value in MB
+
+>>> _parse_memory("256m")
+256
+>>> _parse_memory("2g")
+2048
+"""
+units = {'g': 1024, 'm': 1, 't': 1 << 20, 'k': 1.0 / 1024}
+if s[-1] not in units:
+raise ValueError("invalid format: " + s)
+return int(float(s[:-1]) * units[s[-1].lower()])
+
+
--- End diff --

This is used by two modules, and related to conf, so it's better to put in 
pyspark.conf.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: Minor change in the comment of spark-defaults....

2014-10-17 Thread andrewor14
Github user andrewor14 commented on the pull request:

https://github.com/apache/spark/pull/2709#issuecomment-59596156
  
Hey @dbtsai can you update this now that #2379 has gone in? In particular 
this is now used by the Spark daemons too (i.e. Worker, Master, HistoryServer). 
I'm don't feel strongly about commenting that this applies to spark-shell and 
pyspark one way or the other. If you're occupied with other things I can also 
take this over if you wish.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3888] [PySpark] limit the memory used b...

2014-10-17 Thread andrewor14
Github user andrewor14 commented on a diff in the pull request:

https://github.com/apache/spark/pull/2743#discussion_r19051237
  
--- Diff: python/pyspark/conf.py ---
@@ -57,6 +57,22 @@
 __all__ = ['SparkConf']
 
 
+def _parse_memory(s):
+"""
+Parse a memory string in the format supported by Java (e.g. 1g, 200m) 
and
+return the value in MB
+
+>>> _parse_memory("256m")
+256
+>>> _parse_memory("2g")
+2048
+"""
+units = {'g': 1024, 'm': 1, 't': 1 << 20, 'k': 1.0 / 1024}
+if s[-1] not in units:
+raise ValueError("invalid format: " + s)
+return int(float(s[:-1]) * units[s[-1].lower()])
+
+
--- End diff --

Any reason to move this?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3888] [PySpark] limit the memory used b...

2014-10-17 Thread andrewor14
Github user andrewor14 commented on a diff in the pull request:

https://github.com/apache/spark/pull/2743#discussion_r19051234
  
--- Diff: core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala 
---
@@ -63,9 +64,12 @@ private[spark] class PythonRDD(
 val localdir = env.blockManager.diskBlockManager.localDirs.map(
   f => f.getPath()).mkString(",")
 envVars += ("SPARK_LOCAL_DIRS" -> localdir) // it's also used in 
monitor thread
-if (reuse_worker) {
+if (reuseWorker) {
   envVars += ("SPARK_REUSE_WORKER" -> "1")
 }
+if (!memoryLimit.isEmpty) {
+  envVars += ("PYSPARK_WORKER_MEMORY_LIMIT" -> memoryLimit)
--- End diff --

Why do we need both an environment variable and the config? Can the python 
get the config value? Elsewhere in Spark we've had to worry about the 
precedence order if people set both, and it would be good if we could avoid 
this.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3970] Remove duplicate removal of local...

2014-10-17 Thread andrewor14
Github user andrewor14 commented on the pull request:

https://github.com/apache/spark/pull/2826#issuecomment-59595864
  
LGTM, other comments @srowen?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3994] Use standard Aggregator code path...

2014-10-17 Thread andrewor14
Github user andrewor14 commented on the pull request:

https://github.com/apache/spark/pull/2839#issuecomment-59595776
  
LGTM. I think @mateiz wrote the original code so maybe he can take a look.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3993] [PySpark] fix bug while reuse wor...

2014-10-17 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2838#issuecomment-59595763
  
  [QA tests have 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21871/consoleFull)
 for   PR 2838 at commit 
[`660875b`](https://github.com/apache/spark/commit/660875b5cafb86e231d9d0b3a3f44ba9a13790a3).
 * This patch **fails PySpark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3993] [PySpark] fix bug while reuse wor...

2014-10-17 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/2838#issuecomment-59595770
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21871/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3994] Use standard Aggregator code path...

2014-10-17 Thread andrewor14
Github user andrewor14 commented on a diff in the pull request:

https://github.com/apache/spark/pull/2839#discussion_r19051180
  
--- Diff: core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala 
---
@@ -238,8 +238,15 @@ class PairRDDFunctions[K, V](self: RDD[(K, V)])
   @deprecated("Use reduceByKeyLocally", "1.0.0")
   def reduceByKeyToDriver(func: (V, V) => V): Map[K, V] = 
reduceByKeyLocally(func)
 
-  /** Count the number of elements for each key, and return the result to 
the master as a Map. */
-  def countByKey(): Map[K, Long] = self.map(_._1).countByValue()
+  /** 
+   * Count the number of elements for each key, collecting the results to 
a local Map.
+   *
+   * Note that this method should only be used if the resulting map is 
expected to be small, as
+   * the whole thing is loaded into the driver's memory.
+   * To handle very large results, consider using rdd.mapValues(_ => 
1).reduceByKey(_ + _), which
+   * returns an RDD[T, Long] instead of a map.
+   */
+  def countByKey(): Map[K, Long] = self.mapValues(_ => 1L).reduceByKey(_ + 
_).collect.toMap
--- End diff --

think you need `collect()`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3994] Use standard Aggregator code path...

2014-10-17 Thread andrewor14
Github user andrewor14 commented on a diff in the pull request:

https://github.com/apache/spark/pull/2839#discussion_r19051176
  
--- Diff: core/src/main/scala/org/apache/spark/rdd/RDD.scala ---
@@ -911,32 +911,15 @@ abstract class RDD[T: ClassTag](
   }
 
   /**
-   * Return the count of each unique value in this RDD as a map of (value, 
count) pairs. The final
-   * combine step happens locally on the master, equivalent to running a 
single reduce task.
+   * Return the count of each unique value in this RDD as a local map of 
(value, count) pairs.
+   *
+   * Note that this method should only be used if the resulting map is 
expected to be small, as
+   * the whole thing is loaded into the driver's memory.
+   * To handle very large results, consider using rdd.map(x => (x, 
1)).reduceByKey(_ + _), which
--- End diff --

1L


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3994] Use standard Aggregator code path...

2014-10-17 Thread andrewor14
Github user andrewor14 commented on a diff in the pull request:

https://github.com/apache/spark/pull/2839#discussion_r19051173
  
--- Diff: core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala 
---
@@ -238,8 +238,15 @@ class PairRDDFunctions[K, V](self: RDD[(K, V)])
   @deprecated("Use reduceByKeyLocally", "1.0.0")
   def reduceByKeyToDriver(func: (V, V) => V): Map[K, V] = 
reduceByKeyLocally(func)
 
-  /** Count the number of elements for each key, and return the result to 
the master as a Map. */
-  def countByKey(): Map[K, Long] = self.map(_._1).countByValue()
+  /** 
+   * Count the number of elements for each key, collecting the results to 
a local Map.
+   *
+   * Note that this method should only be used if the resulting map is 
expected to be small, as
+   * the whole thing is loaded into the driver's memory.
+   * To handle very large results, consider using rdd.mapValues(_ => 
1).reduceByKey(_ + _), which
--- End diff --

1L


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3994] Use standard Aggregator code path...

2014-10-17 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/2839#issuecomment-59594861
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21869/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3994] Use standard Aggregator code path...

2014-10-17 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2839#issuecomment-59594858
  
  [QA tests have 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21869/consoleFull)
 for   PR 2839 at commit 
[`e1f06d3`](https://github.com/apache/spark/commit/e1f06d3c550e5419bb3b20745da7cbfdedbb2841).
 * This patch **passes all tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3969][SQL] Optimizer should have a supe...

2014-10-17 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2825#issuecomment-59594302
  
  [QA tests have 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21870/consoleFull)
 for   PR 2825 at commit 
[`abbc53c`](https://github.com/apache/spark/commit/abbc53cc9b1e02d19c2f2200947bcb86bf33511c).
 * This patch **passes all tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3969][SQL] Optimizer should have a supe...

2014-10-17 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/2825#issuecomment-59594304
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21870/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3993] [PySpark] fix bug while reuse wor...

2014-10-17 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2838#issuecomment-59594214
  
  [QA tests have 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21871/consoleFull)
 for   PR 2838 at commit 
[`660875b`](https://github.com/apache/spark/commit/660875b5cafb86e231d9d0b3a3f44ba9a13790a3).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3993] [PySpark] fix bug while reuse wor...

2014-10-17 Thread aarondav
Github user aarondav commented on the pull request:

https://github.com/apache/spark/pull/2838#issuecomment-59594220
  
Just for my understanding, is this solution that take() will cause workers 
to die rather than be reused with bad data in the socket?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3993] [PySpark] fix bug while reuse wor...

2014-10-17 Thread JoshRosen
Github user JoshRosen commented on the pull request:

https://github.com/apache/spark/pull/2838#issuecomment-59594023
  
Jenkins, retest this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3935][Core] log the number of records t...

2014-10-17 Thread andrewor14
Github user andrewor14 commented on the pull request:

https://github.com/apache/spark/pull/2791#issuecomment-59593868
  
Ok, I'll put you under wangfei


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3994] Use standard Aggregator code path...

2014-10-17 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2839#issuecomment-59592259
  
  [QA tests have 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21869/consoleFull)
 for   PR 2839 at commit 
[`e1f06d3`](https://github.com/apache/spark/commit/e1f06d3c550e5419bb3b20745da7cbfdedbb2841).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3969][SQL] Optimizer should have a supe...

2014-10-17 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2825#issuecomment-59592260
  
  [QA tests have 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21870/consoleFull)
 for   PR 2825 at commit 
[`abbc53c`](https://github.com/apache/spark/commit/abbc53cc9b1e02d19c2f2200947bcb86bf33511c).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3935][Core] log the number of records t...

2014-10-17 Thread jackylk
Github user jackylk commented on the pull request:

https://github.com/apache/spark/pull/2791#issuecomment-59592210
  
For this PR, I used wangfei's account in JIRA. I will create my account 
next time.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [WIP][SPARK-3795] Heuristics for dynamically s...

2014-10-17 Thread andrewor14
Github user andrewor14 commented on the pull request:

https://github.com/apache/spark/pull/2746#issuecomment-59592148
  
Hi all. I have discussed the design offline with @kayousterhout and 
@pwendell and we have come to the following high level consensus:

- We should treat add as a best-effort thing. This means there is no need 
to retry it, and we shouldn't  wait for the new ones to register before asking 
for more. The latter point here means an exponential increase policy can become 
an add-to-max policy if we set the add interval to a small value.
- The approach of determining the number of executors to add based on the 
number of pending tasks will be under consideration in the future, but will not 
be a part of this release. This is mainly because this add policy is more 
opaque to the user and the number added may be unpredictable depending on when 
the timer is triggered.
- In a future release, we will make the scaling policies pluggable. Until 
then, for this release, we will expose a `@developerApi` `sc.addExecutors` and 
`sc.removeExecutors` in case the application wants to use this feature on their 
own (they won't have to enable `spark.dynamicAllocation.enabled` to use these).
- We should assume that removes will always succeed for simplicity. This 
means there is no need to retry them.
- To simplify the timer logic, we will make the variables hold the 
expiration time of the timer instead of a counter that is reset to 0 every time 
the timer triggers. This makes the semantics of the timer more easily 
understandable.
- Use the listener API to identify when tasks are built up for testability.

I should emphasize that this design is only for the first-cut 
implementation of this feature. We will make an effort to generalize this and 
expose the ability for the user to implement his/her own heuristics for 1.3 
(tentative). Lastly, I will implement all of these shortly, and the new code 
will likely be quite different. Please kindly hold back your reviews until then.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3994] Use standard Aggregator code path...

2014-10-17 Thread aarondav
Github user aarondav commented on the pull request:

https://github.com/apache/spark/pull/2839#issuecomment-59591925
  
Jenkins, retest this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3969][SQL] Optimizer should have a supe...

2014-10-17 Thread ueshin
Github user ueshin commented on a diff in the pull request:

https://github.com/apache/spark/pull/2825#discussion_r19050137
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
 ---
@@ -28,7 +28,9 @@ import org.apache.spark.sql.catalyst.plans.logical._
 import org.apache.spark.sql.catalyst.rules._
 import org.apache.spark.sql.catalyst.types._
 
-object Optimizer extends RuleExecutor[LogicalPlan] {
+abstract class Optimizer extends RuleExecutor[LogicalPlan]
+
+object SparkOptimizer extends Optimizer {
--- End diff --

Thank you for your suggestion.
I agree that Catalyst should not tightly coupled with Spark.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3994] Use standard Aggregator code path...

2014-10-17 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2839#issuecomment-59591794
  
**[Tests timed 
out](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21867/consoleFull)**
 for PR 2839 at commit 
[`e1f06d3`](https://github.com/apache/spark/commit/e1f06d3c550e5419bb3b20745da7cbfdedbb2841)
 after a configured wait of `120m`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3994] Use standard Aggregator code path...

2014-10-17 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/2839#issuecomment-59591796
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21867/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3969][SQL] Optimizer should have a supe...

2014-10-17 Thread ueshin
Github user ueshin commented on a diff in the pull request:

https://github.com/apache/spark/pull/2825#discussion_r19050063
  
--- Diff: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/ExpressionOptimizationSuite.scala
 ---
@@ -30,7 +30,7 @@ class ExpressionOptimizationSuite extends 
ExpressionEvaluationSuite {
   expected: Any,
   inputRow: Row = EmptyRow): Unit = {
 val plan = Project(Alias(expression, s"Optimized($expression)")() :: 
Nil, NoRelation)
-val optimizedPlan = Optimizer(plan)
+val optimizedPlan = SparkOptimizer(plan)
 super.checkEvaluation(optimizedPlan.expressions.head, expected, 
inputRow)
   }
 }
--- End diff --

Of course not. I'll add a new line.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3207][MLLIB]Choose splits for continuou...

2014-10-17 Thread jkbradley
Github user jkbradley commented on the pull request:

https://github.com/apache/spark/pull/2780#issuecomment-59590801
  
@chouqin  Sorry for the slow response!

About the RandomForestSuite failure: The change to fix the failure 
(maxBins) is OK with me.  It is a somewhat brittle test.  Good point about the 
first threshold being wasted.

About the histogram method’s speed: I would guess that the extra 
computation will not be that bad.  Even if maxBins grows, I would expect the 
runtime of the whole algorithm to slow down as well, and the number of samples 
is capped at 1.  I will run some tests though to make sure.

About the histogram method’s references: The PLANET paper uses 
“equidepth” histograms, citing the paper below.  Looking at that paper, 
“equidepth” means the same method which @manishamde implemented previously. 
 I will look into this a little more to see if I find a match for the method 
you implemented.
* PLANET paper: “PLANET: Massively Parallel Learning of Tree Ensembles 
with MapReduce”
* Paper they cite for histograms: G. S. Manku, S. Rajagopalan, and B. G. 
Lindsay. Random sampling techniques for space efficient online computation of 
order statistics of large datasets. In International Conference on ACM Special 
Interest Group on Management of Data (SIGMOD), pages 251–262, 1999.

I’ll make a pass now and add comments.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3453] Netty-based BlockTransferService,...

2014-10-17 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/2753#issuecomment-59590626
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21868/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3453] Netty-based BlockTransferService,...

2014-10-17 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2753#issuecomment-59590621
  
  [QA tests have 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21868/consoleFull)
 for   PR 2753 at commit 
[`ccd4959`](https://github.com/apache/spark/commit/ccd49595e8d0a730489e577b1152ad67027a5687).
 * This patch **passes all tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [MLLIB] [WIP] SPARK-1547: Adding Gradient Boos...

2014-10-17 Thread jkbradley
Github user jkbradley commented on the pull request:

https://github.com/apache/spark/pull/2607#issuecomment-59589703
  
@manishamde  Sorry for the delay; the code is looking good!  I made some 
small comments inline.  My main overall comment is about specifying parameters. 
 How would it be if we started mimicking the coming API update (as much as 
possible)?  Parameter specification would work as follows:

In DecisionTree, add a static “defaultParams” method so users can 
construct a tree.Strategy instance without having to worry about importing 
Strategy (and remembering its name).  Likewise for GradientBoosting.

Change GradientBoostingStrategy to store tree params in a field 
weakLearnerParams: tree.Strategy

Here’s the use pattern I envision:

val treeParams = DecisionTree.defaultParams()
treeParams.maxDepth = ...
val boostingParams = GradientBoosting.defaultParams()
boostingParams.weakLearnerParams = treeParams
val model = GradientBoosting.train(myData, boostingParams)

This API should work for Scala and Python right away.  (Though a Python API 
can be another PR.)

For Java, this API should almost work; I believe the only issue will be 
setting fields which take special types (e.g., quantileCalculationStrategy and 
categoricalFeaturesInfo).  For those, there is a nice annotation you can use 
which will automatically add getParamName and setParamName methods for Java 
users to call, and you can override them as needed.  For the special params 
like categoricalFeaturesInfo, you can overload them with versions which take 
Java-friendly types (such as a Java map for categoricalFeaturesInfo and a 
string for quantileCalculationStrategy).  Here’s the BeanProperty doc:
[http://www.scala-lang.org/api/current/scala/beans/BeanProperty.html]

Does that sound reasonable?

Let me know when it’s ready for another pass and for testing.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [MLLIB] [WIP] SPARK-1547: Adding Gradient Boos...

2014-10-17 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/2607#discussion_r19049181
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/tree/impl/BaggedPoint.scala ---
@@ -73,7 +115,8 @@ private[tree] object BaggedPoint {
 }
   }
 
-  def convertToBaggedRDDWithoutSampling[Datum](input: RDD[Datum]): 
RDD[BaggedPoint[Datum]] = {
+  private[tree] def convertToBaggedRDDWithoutSampling[Datum]
+  (input: RDD[Datum]): RDD[BaggedPoint[Datum]] = {
--- End diff --

parenthesis on first line:
```
private[tree] def convertToBaggedRDDWithoutSampling[Datum](
input: RDD[Datum]): RDD[BaggedPoint[Datum]] = {
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [MLLIB] [WIP] SPARK-1547: Adding Gradient Boos...

2014-10-17 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/2607#discussion_r19049189
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/tree/loss/LogLoss.scala ---
@@ -0,0 +1,47 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.tree.loss
+
+import org.apache.spark.annotation.DeveloperApi
+import org.apache.spark.mllib.regression.LabeledPoint
+import org.apache.spark.mllib.tree.model.DecisionTreeModel
+
+/**
+ * Class for least squares error loss calculation.
+ */
+object LogLoss extends Loss {
+
+  /**
+   * Method to calculate the loss gradients for the gradient boosting 
calculation for binary
+   * classification
+   * @param model Model of the weak learner
+   * @param point Instance of the training dataset
+   * @param learningRate Learning rate parameter for regularization
+   * @return Loss gradient
+   */
+  @DeveloperApi
+  override def lossGradient(
+ model: DecisionTreeModel,
+ point: LabeledPoint,
+ learningRate: Double): Double = {
+val prediction = model.predict(point.features)
--- End diff --

predict() will return the class.  We'll need a predictRaw() method.  
Perhaps that can be implemented in a separate, small PR first.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [MLLIB] [WIP] SPARK-1547: Adding Gradient Boos...

2014-10-17 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/2607#discussion_r19049198
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/tree/model/GradientBoostingModel.scala
 ---
@@ -0,0 +1,74 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.tree.model
+
+import org.apache.spark.mllib.linalg.Vector
+import org.apache.spark.mllib.tree.configuration.Algo._
+import org.apache.spark.rdd.RDD
+
+
+class GradientBoostingModel(trees: Array[DecisionTreeModel], algo: Algo) 
extends Serializable {
+  /**
+   * Predict values for a single data point using the model trained.
+   *
+   * @param features array representing a single data point
+   * @return predicted category from the trained model
+   */
+  def predict(features: Vector): Double = {
+trees.map(tree => tree.predict(features)).sum
--- End diff --

Could this be changed to match the predict() method for RandomForest?  
Currently, this is more of a predictRaw() method.  (Also, should this be mean, 
not sum?)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [MLLIB] [WIP] SPARK-1547: Adding Gradient Boos...

2014-10-17 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/2607#discussion_r19049194
  
--- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/loss/Loss.scala 
---
@@ -0,0 +1,42 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.tree.loss
+
+import org.apache.spark.annotation.DeveloperApi
+import org.apache.spark.mllib.regression.LabeledPoint
+import org.apache.spark.mllib.tree.model.DecisionTreeModel
+
+/**
+ * Trait for adding "pluggable" loss functions for the gradient boosting 
algorithm
+ */
+trait Loss extends Serializable {
+
+  /**
+   * Method to calculate the loss gradients for the gradient boosting 
calculation
+   * @param model Model of the weak learner
+   * @param point Instance of the training dataset
+   * @param learningRate Learning rate parameter for regularization
+   * @return Loss gradient
+   */
+  @DeveloperApi
+  def lossGradient(
--- End diff --

Can we name this "gradient"?  (Calling Loss.lossGradient seems redundant)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [MLLIB] [WIP] SPARK-1547: Adding Gradient Boos...

2014-10-17 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/2607#discussion_r19049186
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/tree/loss/LeastSquaresError.scala 
---
@@ -0,0 +1,45 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.tree.loss
+
+import org.apache.spark.annotation.DeveloperApi
+import org.apache.spark.mllib.regression.LabeledPoint
+import org.apache.spark.mllib.tree.model.DecisionTreeModel
+
+/**
+ * Class for least squares error loss calculation.
+ */
+object LeastSquaresError extends Loss {
--- End diff --

Can this be named Squared Error?  ("least" is not really needed.)
Also, can the doc include a mathematical statement of the form of the error?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [MLLIB] [WIP] SPARK-1547: Adding Gradient Boos...

2014-10-17 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/2607#discussion_r19049179
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/tree/impl/BaggedPoint.scala ---
@@ -47,20 +48,61 @@ private[tree] object BaggedPoint {
* Convert an input dataset into its BaggedPoint representation,
* choosing subsample counts for each instance.
* Each subsample has the same number of instances as the original 
dataset,
-   * and is created by subsampling with replacement.
+   * and is created by subsampling without replacement.
* @param input Input dataset.
+   * @param subsample Fraction of the training data used for learning 
decision tree.
* @param numSubsamples  Number of subsamples of this RDD to take.
-   * @param seed   Random seed.
+   * @param withReplacement Sampling with/without replacement.
* @return  BaggedPoint dataset representation
*/
   def convertToBaggedRDD[Datum](
   input: RDD[Datum],
+  subsample: Double,
   numSubsamples: Int,
-  seed: Int = Utils.random.nextInt()): RDD[BaggedPoint[Datum]] = {
+  withReplacement: Boolean): RDD[BaggedPoint[Datum]] = {
+if (withReplacement) {
+  convertToBaggedRDDSamplingWithReplacement(input, subsample, 
numSubsamples)
+} else {
+  if (numSubsamples == 1 && subsample == 1.0) {
+convertToBaggedRDDWithoutSampling(input)
+  } else {
+convertToBaggedRDDSamplingWithoutReplacement(input, subsample, 
numSubsamples)
+  }
+}
+  }
+
+  private[tree] def convertToBaggedRDDSamplingWithoutReplacement[Datum](
--- End diff --

space before (
(same issue elsewhere too)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [MLLIB] [WIP] SPARK-1547: Adding Gradient Boos...

2014-10-17 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/2607#discussion_r19049183
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/tree/loss/LeastAbsoluteError.scala 
---
@@ -0,0 +1,45 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.tree.loss
+
+import org.apache.spark.annotation.DeveloperApi
+import org.apache.spark.mllib.regression.LabeledPoint
+import org.apache.spark.mllib.tree.model.DecisionTreeModel
+
+/**
+ * Class for least absolute error loss calculation.
+ */
+object LeastAbsoluteError extends Loss {
--- End diff --

Can this be named Absolute Error?  ("least" is not really needed.)
Also, can the doc include a mathematical statement of the form of the error?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [MLLIB] [WIP] SPARK-1547: Adding Gradient Boos...

2014-10-17 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/2607#discussion_r19049168
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/tree/GradientBoosting.scala ---
@@ -0,0 +1,480 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.tree
+
+import org.apache.spark.SparkContext._
+import scala.collection.JavaConverters._
+
+import org.apache.spark.annotation.Experimental
+import org.apache.spark.api.java.JavaRDD
+import org.apache.spark.mllib.tree.configuration.{QuantileStrategy, 
BoostingStrategy}
+import org.apache.spark.Logging
+import org.apache.spark.mllib.tree.impl.TimeTracker
+import org.apache.spark.mllib.tree.impurity.Impurities
+import org.apache.spark.mllib.tree.loss.Losses
+import org.apache.spark.rdd.RDD
+import org.apache.spark.mllib.regression.LabeledPoint
+import org.apache.spark.mllib.tree.model.{GradientBoostingModel, 
DecisionTreeModel}
+import org.apache.spark.mllib.tree.configuration.Algo._
+import org.apache.spark.storage.StorageLevel
+
+/**
+ * :: Experimental ::
+ * A class that implements gradient boosting for regression problems.
+ * @param boostingStrategy Parameters for the gradient boosting algorithm
+ */
+@Experimental
+class GradientBoosting (
+private val boostingStrategy: BoostingStrategy) extends Serializable 
with Logging {
+
+  /**
+   * Method to train a gradient boosting model
+   * @param input Training dataset: RDD of 
[[org.apache.spark.mllib.regression.LabeledPoint]].
+   * @return GradientBoostingModel that can be used for prediction
+   */
+  def train(input: RDD[LabeledPoint]): GradientBoostingModel = {
+val strategy = boostingStrategy.strategy
--- End diff --

"strategy" is not used


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [MLLIB] [WIP] SPARK-1547: Adding Gradient Boos...

2014-10-17 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/2607#discussion_r19049191
  
--- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/loss/Loss.scala 
---
@@ -0,0 +1,42 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.tree.loss
+
+import org.apache.spark.annotation.DeveloperApi
+import org.apache.spark.mllib.regression.LabeledPoint
+import org.apache.spark.mllib.tree.model.DecisionTreeModel
+
+/**
+ * Trait for adding "pluggable" loss functions for the gradient boosting 
algorithm
+ */
+trait Loss extends Serializable {
--- End diff --

Can this also include a "loss" or "compute" method?  That would allow 
tracking the actual objective in boosting (instead of just MSE as is done now).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [MLLIB] [WIP] SPARK-1547: Adding Gradient Boos...

2014-10-17 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/2607#discussion_r19049196
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/tree/model/GradientBoostingModel.scala
 ---
@@ -0,0 +1,74 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.tree.model
+
+import org.apache.spark.mllib.linalg.Vector
+import org.apache.spark.mllib.tree.configuration.Algo._
+import org.apache.spark.rdd.RDD
+
+
+class GradientBoostingModel(trees: Array[DecisionTreeModel], algo: Algo) 
extends Serializable {
--- End diff --

Mark as @Experimental?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [MLLIB] [WIP] SPARK-1547: Adding Gradient Boos...

2014-10-17 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/2607#discussion_r19049170
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/tree/GradientBoosting.scala ---
@@ -0,0 +1,480 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.tree
+
+import org.apache.spark.SparkContext._
+import scala.collection.JavaConverters._
+
+import org.apache.spark.annotation.Experimental
+import org.apache.spark.api.java.JavaRDD
+import org.apache.spark.mllib.tree.configuration.{QuantileStrategy, 
BoostingStrategy}
+import org.apache.spark.Logging
+import org.apache.spark.mllib.tree.impl.TimeTracker
+import org.apache.spark.mllib.tree.impurity.Impurities
+import org.apache.spark.mllib.tree.loss.Losses
+import org.apache.spark.rdd.RDD
+import org.apache.spark.mllib.regression.LabeledPoint
+import org.apache.spark.mllib.tree.model.{GradientBoostingModel, 
DecisionTreeModel}
+import org.apache.spark.mllib.tree.configuration.Algo._
+import org.apache.spark.storage.StorageLevel
+
+/**
+ * :: Experimental ::
+ * A class that implements gradient boosting for regression problems.
+ * @param boostingStrategy Parameters for the gradient boosting algorithm
+ */
+@Experimental
+class GradientBoosting (
+private val boostingStrategy: BoostingStrategy) extends Serializable 
with Logging {
+
+  /**
+   * Method to train a gradient boosting model
+   * @param input Training dataset: RDD of 
[[org.apache.spark.mllib.regression.LabeledPoint]].
+   * @return GradientBoostingModel that can be used for prediction
+   */
+  def train(input: RDD[LabeledPoint]): GradientBoostingModel = {
+val strategy = boostingStrategy.strategy
+val algo = boostingStrategy.algo
+algo match {
+  case Regression => GradientBoosting.boost(input, boostingStrategy)
+  case Classification =>
+val remappedInput = input.map(x => new LabeledPoint((x.label * 2) 
- 1, x.features))
+GradientBoosting.boost(remappedInput, boostingStrategy)
+  case _ =>
+throw new IllegalArgumentException(s"$algo is not supported by the 
gradient boosting.")
+}
+  }
+
+}
+
+
+object GradientBoosting extends Logging {
+
+  /**
+   * Method to train a gradient boosting model.
+   *
+   * Note: Using 
[[org.apache.spark.mllib.tree.GradientBoosting#trainRegressor]]
+   *   is recommended to clearly specify regression.
+   *   Using 
[[org.apache.spark.mllib.tree.GradientBoosting#trainClassifier]]
+   *   is recommended to clearly specify regression.
+   *
+   * @param input Training dataset: RDD of 
[[org.apache.spark.mllib.regression.LabeledPoint]].
+   *  For classification, labels should take values {0, 1, 
..., numClasses-1}.
+   *  For regression, labels are real numbers.
+   * @param boostingStrategy Configuration options for the boosting 
algorithm.
+   * @return GradientBoostingModel that can be used for prediction
+   */
+  def train(
+  input: RDD[LabeledPoint],
+  boostingStrategy: BoostingStrategy): GradientBoostingModel = {
+new GradientBoosting(boostingStrategy).train(input)
+  }
+
+  /**
+   * Method to train a gradient boosting regression model.
+   *
+   * @param input Training dataset: RDD of 
[[org.apache.spark.mllib.regression.LabeledPoint]].
+   *  For classification, labels should take values {0, 1, 
..., numClasses-1}.
+   *  For regression, labels are real numbers.
+   * @param numEstimators Number of estimators used in boosting stages. In 
other words,
+   *  number of boosting iterations performed.
+   * @param loss Loss function used for minimization during gradient 
boosting.
+   * @param impurity Criterion used for information gain calculation.
+   * Supported for Classificati

[GitHub] spark pull request: [MLLIB] [WIP] SPARK-1547: Adding Gradient Boos...

2014-10-17 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/2607#discussion_r19049176
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/tree/impl/BaggedPoint.scala ---
@@ -47,20 +48,61 @@ private[tree] object BaggedPoint {
* Convert an input dataset into its BaggedPoint representation,
* choosing subsample counts for each instance.
* Each subsample has the same number of instances as the original 
dataset,
-   * and is created by subsampling with replacement.
+   * and is created by subsampling without replacement.
* @param input Input dataset.
+   * @param subsample Fraction of the training data used for learning 
decision tree.
* @param numSubsamples  Number of subsamples of this RDD to take.
-   * @param seed   Random seed.
--- End diff --

I think we should keep this parameter.  That will allow reproducible 
results.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [MLLIB] [WIP] SPARK-1547: Adding Gradient Boos...

2014-10-17 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/2607#discussion_r19049187
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/tree/loss/LogLoss.scala ---
@@ -0,0 +1,47 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.tree.loss
+
+import org.apache.spark.annotation.DeveloperApi
+import org.apache.spark.mllib.regression.LabeledPoint
+import org.apache.spark.mllib.tree.model.DecisionTreeModel
+
+/**
+ * Class for least squares error loss calculation.
+ */
+object LogLoss extends Loss {
+
+  /**
+   * Method to calculate the loss gradients for the gradient boosting 
calculation for binary
+   * classification
+   * @param model Model of the weak learner
+   * @param point Instance of the training dataset
+   * @param learningRate Learning rate parameter for regularization
+   * @return Loss gradient
+   */
+  @DeveloperApi
+  override def lossGradient(
+ model: DecisionTreeModel,
--- End diff --

spacing (4 spaces indentation)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [MLLIB] [WIP] SPARK-1547: Adding Gradient Boos...

2014-10-17 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/2607#discussion_r19049173
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/tree/GradientBoosting.scala ---
@@ -0,0 +1,480 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.tree
+
+import org.apache.spark.SparkContext._
+import scala.collection.JavaConverters._
+
+import org.apache.spark.annotation.Experimental
+import org.apache.spark.api.java.JavaRDD
+import org.apache.spark.mllib.tree.configuration.{QuantileStrategy, 
BoostingStrategy}
+import org.apache.spark.Logging
+import org.apache.spark.mllib.tree.impl.TimeTracker
+import org.apache.spark.mllib.tree.impurity.Impurities
+import org.apache.spark.mllib.tree.loss.Losses
+import org.apache.spark.rdd.RDD
+import org.apache.spark.mllib.regression.LabeledPoint
+import org.apache.spark.mllib.tree.model.{GradientBoostingModel, 
DecisionTreeModel}
+import org.apache.spark.mllib.tree.configuration.Algo._
+import org.apache.spark.storage.StorageLevel
+
+/**
+ * :: Experimental ::
+ * A class that implements gradient boosting for regression problems.
+ * @param boostingStrategy Parameters for the gradient boosting algorithm
+ */
+@Experimental
+class GradientBoosting (
+private val boostingStrategy: BoostingStrategy) extends Serializable 
with Logging {
+
+  /**
+   * Method to train a gradient boosting model
+   * @param input Training dataset: RDD of 
[[org.apache.spark.mllib.regression.LabeledPoint]].
+   * @return GradientBoostingModel that can be used for prediction
+   */
+  def train(input: RDD[LabeledPoint]): GradientBoostingModel = {
+val strategy = boostingStrategy.strategy
+val algo = boostingStrategy.algo
+algo match {
+  case Regression => GradientBoosting.boost(input, boostingStrategy)
+  case Classification =>
+val remappedInput = input.map(x => new LabeledPoint((x.label * 2) 
- 1, x.features))
+GradientBoosting.boost(remappedInput, boostingStrategy)
+  case _ =>
+throw new IllegalArgumentException(s"$algo is not supported by the 
gradient boosting.")
+}
+  }
+
+}
+
+
+object GradientBoosting extends Logging {
+
+  /**
+   * Method to train a gradient boosting model.
+   *
+   * Note: Using 
[[org.apache.spark.mllib.tree.GradientBoosting#trainRegressor]]
+   *   is recommended to clearly specify regression.
+   *   Using 
[[org.apache.spark.mllib.tree.GradientBoosting#trainClassifier]]
+   *   is recommended to clearly specify regression.
+   *
+   * @param input Training dataset: RDD of 
[[org.apache.spark.mllib.regression.LabeledPoint]].
+   *  For classification, labels should take values {0, 1, 
..., numClasses-1}.
+   *  For regression, labels are real numbers.
+   * @param boostingStrategy Configuration options for the boosting 
algorithm.
+   * @return GradientBoostingModel that can be used for prediction
+   */
+  def train(
+  input: RDD[LabeledPoint],
+  boostingStrategy: BoostingStrategy): GradientBoostingModel = {
+new GradientBoosting(boostingStrategy).train(input)
+  }
+
+  /**
+   * Method to train a gradient boosting regression model.
+   *
+   * @param input Training dataset: RDD of 
[[org.apache.spark.mllib.regression.LabeledPoint]].
+   *  For classification, labels should take values {0, 1, 
..., numClasses-1}.
+   *  For regression, labels are real numbers.
+   * @param numEstimators Number of estimators used in boosting stages. In 
other words,
+   *  number of boosting iterations performed.
+   * @param loss Loss function used for minimization during gradient 
boosting.
+   * @param impurity Criterion used for information gain calculation.
+   * Supported for Classificati

[GitHub] spark pull request: [MLLIB] [WIP] SPARK-1547: Adding Gradient Boos...

2014-10-17 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/2607#discussion_r19049169
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/tree/GradientBoosting.scala ---
@@ -0,0 +1,480 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.tree
+
+import org.apache.spark.SparkContext._
+import scala.collection.JavaConverters._
+
+import org.apache.spark.annotation.Experimental
+import org.apache.spark.api.java.JavaRDD
+import org.apache.spark.mllib.tree.configuration.{QuantileStrategy, 
BoostingStrategy}
+import org.apache.spark.Logging
+import org.apache.spark.mllib.tree.impl.TimeTracker
+import org.apache.spark.mllib.tree.impurity.Impurities
+import org.apache.spark.mllib.tree.loss.Losses
+import org.apache.spark.rdd.RDD
+import org.apache.spark.mllib.regression.LabeledPoint
+import org.apache.spark.mllib.tree.model.{GradientBoostingModel, 
DecisionTreeModel}
+import org.apache.spark.mllib.tree.configuration.Algo._
+import org.apache.spark.storage.StorageLevel
+
+/**
+ * :: Experimental ::
+ * A class that implements gradient boosting for regression problems.
+ * @param boostingStrategy Parameters for the gradient boosting algorithm
+ */
+@Experimental
+class GradientBoosting (
+private val boostingStrategy: BoostingStrategy) extends Serializable 
with Logging {
+
+  /**
+   * Method to train a gradient boosting model
+   * @param input Training dataset: RDD of 
[[org.apache.spark.mllib.regression.LabeledPoint]].
+   * @return GradientBoostingModel that can be used for prediction
+   */
+  def train(input: RDD[LabeledPoint]): GradientBoostingModel = {
+val strategy = boostingStrategy.strategy
+val algo = boostingStrategy.algo
+algo match {
+  case Regression => GradientBoosting.boost(input, boostingStrategy)
+  case Classification =>
+val remappedInput = input.map(x => new LabeledPoint((x.label * 2) 
- 1, x.features))
+GradientBoosting.boost(remappedInput, boostingStrategy)
+  case _ =>
+throw new IllegalArgumentException(s"$algo is not supported by the 
gradient boosting.")
+}
+  }
+
+}
+
+
+object GradientBoosting extends Logging {
+
+  /**
+   * Method to train a gradient boosting model.
+   *
+   * Note: Using 
[[org.apache.spark.mllib.tree.GradientBoosting#trainRegressor]]
+   *   is recommended to clearly specify regression.
+   *   Using 
[[org.apache.spark.mllib.tree.GradientBoosting#trainClassifier]]
+   *   is recommended to clearly specify regression.
+   *
+   * @param input Training dataset: RDD of 
[[org.apache.spark.mllib.regression.LabeledPoint]].
+   *  For classification, labels should take values {0, 1, 
..., numClasses-1}.
+   *  For regression, labels are real numbers.
+   * @param boostingStrategy Configuration options for the boosting 
algorithm.
+   * @return GradientBoostingModel that can be used for prediction
+   */
+  def train(
+  input: RDD[LabeledPoint],
+  boostingStrategy: BoostingStrategy): GradientBoostingModel = {
+new GradientBoosting(boostingStrategy).train(input)
+  }
+
+  /**
+   * Method to train a gradient boosting regression model.
+   *
+   * @param input Training dataset: RDD of 
[[org.apache.spark.mllib.regression.LabeledPoint]].
+   *  For classification, labels should take values {0, 1, 
..., numClasses-1}.
+   *  For regression, labels are real numbers.
+   * @param numEstimators Number of estimators used in boosting stages. In 
other words,
+   *  number of boosting iterations performed.
+   * @param loss Loss function used for minimization during gradient 
boosting.
+   * @param impurity Criterion used for information gain calculation.
+   * Supported for Classificati

[GitHub] spark pull request: SPARK-3568 [mllib] add ranking metrics

2014-10-17 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/2667#discussion_r19047629
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/evaluation/RankingMetrics.scala ---
@@ -0,0 +1,115 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.evaluation
+
+import scala.reflect.ClassTag
+
+import org.apache.spark.SparkContext._
+import org.apache.spark.annotation.Experimental
+import org.apache.spark.rdd.RDD
+
+/**
+ * ::Experimental::
+ * Evaluator for ranking algorithms.
+ *
+ * @param predictionAndLabels an RDD of (predicted ranking, ground truth 
set) pairs.
+ */
+@Experimental
+class RankingMetrics[T: ClassTag](predictionAndLabels: RDD[(Array[T], 
Array[T])]) {
+
+  /**
+   * Compute the average precision of all the queries, truncated at 
ranking position k.
+   * If for a query, the ranking algorithm returns n (n < k) results,
+   * the precision value will be computed as #(relevant items retrived) / 
k.
+   * See the following paper for detail:
+   *
+   * IR evaluation methods for retrieving highly relevant documents.
+   *K. Jarvelin and J. Kekalainen
+   *
+   * @param k the position to compute the truncated precision
+   * @return the average precision at the first k ranking positions
+   */
+  def precisionAt(k: Int): Double = predictionAndLabels.map { case (pred, 
lab) =>
+val labSet = lab.toSet
+val n = math.min(pred.length, k)
+var i = 0
+var cnt = 0
+
+while (i < n) {
+  if (labSet.contains(pred(i))) {
+cnt += 1
+  }
+  i += 1
+}
+cnt.toDouble / k
+  }.mean
+
+  /**
+   * Returns the mean average precision (MAP) of all the queries
+   */
+  lazy val meanAveragePrecision: Double = predictionAndLabels.map { case 
(pred, lab) =>
+val labSet = lab.toSet
+var i = 0
+var cnt = 0
+var precSum = 0.0
+val n = pred.length
+
+while (i < n) {
+  if (labSet.contains(pred(i))) {
+cnt += 1
+precSum += cnt.toDouble / (i + 1)
+  }
+  i += 1
+}
+precSum / labSet.size
+  }.mean
+
+  /**
+   * Compute the average NDCG value of all the queries, truncated at 
ranking position k.
+   * If for a query, the ranking algorithm returns n (n < k) results, the 
NDCG value at
+   * at position n will be used. See the following paper for detail:
--- End diff --

Btw, there are variants of NDCG definition. We need to say in the doc which 
version we implement.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2546] Clone JobConf for each task (bran...

2014-10-17 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/2684#issuecomment-59585743
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21866/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2546] Clone JobConf for each task (bran...

2014-10-17 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2684#issuecomment-59585737
  
  [QA tests have 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21866/consoleFull)
 for   PR 2684 at commit 
[`f14f259`](https://github.com/apache/spark/commit/f14f25981f1b922f1a8d07dfd80774a78daec368).
 * This patch **passes** unit tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3250] Implement Gap Sampling optimizati...

2014-10-17 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/2455#discussion_r19047174
  
--- Diff: 
core/src/main/scala/org/apache/spark/util/random/RandomSampler.scala ---
@@ -53,56 +89,238 @@ trait RandomSampler[T, U] extends Pseudorandom with 
Cloneable with Serializable
  * @tparam T item type
  */
 @DeveloperApi
-class BernoulliSampler[T](lb: Double, ub: Double, complement: Boolean = 
false)
+class BernoulliPartitionSampler[T](lb: Double, ub: Double, complement: 
Boolean = false)
--- End diff --

Sounds good.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3453] Netty-based BlockTransferService,...

2014-10-17 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2753#issuecomment-59584787
  
  [QA tests have 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21868/consoleFull)
 for   PR 2753 at commit 
[`ccd4959`](https://github.com/apache/spark/commit/ccd49595e8d0a730489e577b1152ad67027a5687).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-3568 [mllib] add ranking metrics

2014-10-17 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/2667#discussion_r19046954
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/evaluation/RankingMetrics.scala ---
@@ -0,0 +1,115 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.evaluation
+
+import scala.reflect.ClassTag
+
+import org.apache.spark.SparkContext._
+import org.apache.spark.annotation.Experimental
+import org.apache.spark.rdd.RDD
+
+/**
+ * ::Experimental::
+ * Evaluator for ranking algorithms.
+ *
+ * @param predictionAndLabels an RDD of (predicted ranking, ground truth 
set) pairs.
+ */
+@Experimental
+class RankingMetrics[T: ClassTag](predictionAndLabels: RDD[(Array[T], 
Array[T])]) {
+
+  /**
+   * Compute the average precision of all the queries, truncated at 
ranking position k.
+   * If for a query, the ranking algorithm returns n (n < k) results,
+   * the precision value will be computed as #(relevant items retrived) / 
k.
+   * See the following paper for detail:
+   *
+   * IR evaluation methods for retrieving highly relevant documents.
+   *K. Jarvelin and J. Kekalainen
+   *
+   * @param k the position to compute the truncated precision
+   * @return the average precision at the first k ranking positions
+   */
+  def precisionAt(k: Int): Double = predictionAndLabels.map { case (pred, 
lab) =>
+val labSet = lab.toSet
+val n = math.min(pred.length, k)
+var i = 0
+var cnt = 0
+
+while (i < n) {
+  if (labSet.contains(pred(i))) {
+cnt += 1
+  }
+  i += 1
+}
+cnt.toDouble / k
+  }.mean
+
+  /**
+   * Returns the mean average precision (MAP) of all the queries
+   */
+  lazy val meanAveragePrecision: Double = predictionAndLabels.map { case 
(pred, lab) =>
+val labSet = lab.toSet
+var i = 0
+var cnt = 0
+var precSum = 0.0
+val n = pred.length
+
+while (i < n) {
+  if (labSet.contains(pred(i))) {
+cnt += 1
+precSum += cnt.toDouble / (i + 1)
+  }
+  i += 1
+}
+precSum / labSet.size
+  }.mean
+
+  /**
+   * Compute the average NDCG value of all the queries, truncated at 
ranking position k.
+   * If for a query, the ranking algorithm returns n (n < k) results, the 
NDCG value at
+   * at position n will be used. See the following paper for detail:
+   *
+   * IR evaluation methods for retrieving highly relevant documents.
+   *K. Jarvelin and J. Kekalainen
+   *
+   * @param k the position to compute the truncated ndcg
+   * @return the average ndcg at the first k ranking positions
+   */
+  def ndcgAt(k: Int): Double = predictionAndLabels.map { case (pred, lab) 
=>
+val labSet = lab.toSet
+val labSetSize = labSet.size
+val n = math.min(math.max(pred.length, labSetSize), k)
+var maxDcg = 0.0
+var dcg = 0.0
+var i = 0
+
+while (i < n) {
+  // Calculate 1/log2(i + 2)
--- End diff --

the comment doesn't provide any extra information


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-3568 [mllib] add ranking metrics

2014-10-17 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/2667#discussion_r19046956
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/evaluation/RankingMetrics.scala ---
@@ -0,0 +1,115 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.evaluation
+
+import scala.reflect.ClassTag
+
+import org.apache.spark.SparkContext._
+import org.apache.spark.annotation.Experimental
+import org.apache.spark.rdd.RDD
+
+/**
+ * ::Experimental::
+ * Evaluator for ranking algorithms.
+ *
+ * @param predictionAndLabels an RDD of (predicted ranking, ground truth 
set) pairs.
+ */
+@Experimental
+class RankingMetrics[T: ClassTag](predictionAndLabels: RDD[(Array[T], 
Array[T])]) {
+
+  /**
+   * Compute the average precision of all the queries, truncated at 
ranking position k.
+   * If for a query, the ranking algorithm returns n (n < k) results,
+   * the precision value will be computed as #(relevant items retrived) / 
k.
+   * See the following paper for detail:
+   *
+   * IR evaluation methods for retrieving highly relevant documents.
+   *K. Jarvelin and J. Kekalainen
+   *
+   * @param k the position to compute the truncated precision
+   * @return the average precision at the first k ranking positions
+   */
+  def precisionAt(k: Int): Double = predictionAndLabels.map { case (pred, 
lab) =>
+val labSet = lab.toSet
+val n = math.min(pred.length, k)
+var i = 0
+var cnt = 0
+
+while (i < n) {
+  if (labSet.contains(pred(i))) {
+cnt += 1
+  }
+  i += 1
+}
+cnt.toDouble / k
+  }.mean
+
+  /**
+   * Returns the mean average precision (MAP) of all the queries
+   */
+  lazy val meanAveragePrecision: Double = predictionAndLabels.map { case 
(pred, lab) =>
+val labSet = lab.toSet
+var i = 0
+var cnt = 0
+var precSum = 0.0
+val n = pred.length
+
+while (i < n) {
+  if (labSet.contains(pred(i))) {
+cnt += 1
+precSum += cnt.toDouble / (i + 1)
+  }
+  i += 1
+}
+precSum / labSet.size
+  }.mean
+
+  /**
+   * Compute the average NDCG value of all the queries, truncated at 
ranking position k.
+   * If for a query, the ranking algorithm returns n (n < k) results, the 
NDCG value at
+   * at position n will be used. See the following paper for detail:
+   *
+   * IR evaluation methods for retrieving highly relevant documents.
+   *K. Jarvelin and J. Kekalainen
+   *
+   * @param k the position to compute the truncated ndcg
+   * @return the average ndcg at the first k ranking positions
+   */
+  def ndcgAt(k: Int): Double = predictionAndLabels.map { case (pred, lab) 
=>
+val labSet = lab.toSet
+val labSetSize = labSet.size
+val n = math.min(math.max(pred.length, labSetSize), k)
+var maxDcg = 0.0
+var dcg = 0.0
+var i = 0
+
+while (i < n) {
+  // Calculate 1/log2(i + 2)
+  val gain = math.log(2) / math.log(i + 2)
+  if (labSet.contains(pred(i))) {
+dcg += gain
+  }
+  if (i < labSetSize) {
+maxDcg += gain
+  }
+  i += 1
+}
+dcg / maxDcg
--- End diff --

`maxDcg` could be zero. Please add a test for this corner case.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-3568 [mllib] add ranking metrics

2014-10-17 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/2667#discussion_r19046949
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/evaluation/RankingMetrics.scala ---
@@ -0,0 +1,115 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.evaluation
+
+import scala.reflect.ClassTag
+
+import org.apache.spark.SparkContext._
+import org.apache.spark.annotation.Experimental
+import org.apache.spark.rdd.RDD
+
+/**
+ * ::Experimental::
+ * Evaluator for ranking algorithms.
+ *
+ * @param predictionAndLabels an RDD of (predicted ranking, ground truth 
set) pairs.
+ */
+@Experimental
+class RankingMetrics[T: ClassTag](predictionAndLabels: RDD[(Array[T], 
Array[T])]) {
+
+  /**
+   * Compute the average precision of all the queries, truncated at 
ranking position k.
+   * If for a query, the ranking algorithm returns n (n < k) results,
+   * the precision value will be computed as #(relevant items retrived) / 
k.
+   * See the following paper for detail:
+   *
+   * IR evaluation methods for retrieving highly relevant documents.
+   *K. Jarvelin and J. Kekalainen
+   *
+   * @param k the position to compute the truncated precision
+   * @return the average precision at the first k ranking positions
+   */
+  def precisionAt(k: Int): Double = predictionAndLabels.map { case (pred, 
lab) =>
+val labSet = lab.toSet
+val n = math.min(pred.length, k)
+var i = 0
+var cnt = 0
+
+while (i < n) {
+  if (labSet.contains(pred(i))) {
+cnt += 1
+  }
+  i += 1
+}
+cnt.toDouble / k
+  }.mean
+
+  /**
+   * Returns the mean average precision (MAP) of all the queries
+   */
+  lazy val meanAveragePrecision: Double = predictionAndLabels.map { case 
(pred, lab) =>
+val labSet = lab.toSet
+var i = 0
+var cnt = 0
+var precSum = 0.0
+val n = pred.length
+
+while (i < n) {
+  if (labSet.contains(pred(i))) {
+cnt += 1
+precSum += cnt.toDouble / (i + 1)
+  }
+  i += 1
+}
+precSum / labSet.size
+  }.mean
+
+  /**
+   * Compute the average NDCG value of all the queries, truncated at 
ranking position k.
+   * If for a query, the ranking algorithm returns n (n < k) results, the 
NDCG value at
+   * at position n will be used. See the following paper for detail:
--- End diff --

ditto: what if label set contains less than k items? Need doc.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-3568 [mllib] add ranking metrics

2014-10-17 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/2667#discussion_r19046952
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/evaluation/RankingMetrics.scala ---
@@ -0,0 +1,115 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.evaluation
+
+import scala.reflect.ClassTag
+
+import org.apache.spark.SparkContext._
+import org.apache.spark.annotation.Experimental
+import org.apache.spark.rdd.RDD
+
+/**
+ * ::Experimental::
+ * Evaluator for ranking algorithms.
+ *
+ * @param predictionAndLabels an RDD of (predicted ranking, ground truth 
set) pairs.
+ */
+@Experimental
+class RankingMetrics[T: ClassTag](predictionAndLabels: RDD[(Array[T], 
Array[T])]) {
+
+  /**
+   * Compute the average precision of all the queries, truncated at 
ranking position k.
+   * If for a query, the ranking algorithm returns n (n < k) results,
+   * the precision value will be computed as #(relevant items retrived) / 
k.
+   * See the following paper for detail:
+   *
+   * IR evaluation methods for retrieving highly relevant documents.
+   *K. Jarvelin and J. Kekalainen
+   *
+   * @param k the position to compute the truncated precision
+   * @return the average precision at the first k ranking positions
+   */
+  def precisionAt(k: Int): Double = predictionAndLabels.map { case (pred, 
lab) =>
+val labSet = lab.toSet
+val n = math.min(pred.length, k)
+var i = 0
+var cnt = 0
+
+while (i < n) {
+  if (labSet.contains(pred(i))) {
+cnt += 1
+  }
+  i += 1
+}
+cnt.toDouble / k
+  }.mean
+
+  /**
+   * Returns the mean average precision (MAP) of all the queries
+   */
+  lazy val meanAveragePrecision: Double = predictionAndLabels.map { case 
(pred, lab) =>
+val labSet = lab.toSet
+var i = 0
+var cnt = 0
+var precSum = 0.0
+val n = pred.length
+
+while (i < n) {
+  if (labSet.contains(pred(i))) {
+cnt += 1
+precSum += cnt.toDouble / (i + 1)
+  }
+  i += 1
+}
+precSum / labSet.size
+  }.mean
+
+  /**
+   * Compute the average NDCG value of all the queries, truncated at 
ranking position k.
+   * If for a query, the ranking algorithm returns n (n < k) results, the 
NDCG value at
+   * at position n will be used. See the following paper for detail:
+   *
+   * IR evaluation methods for retrieving highly relevant documents.
+   *K. Jarvelin and J. Kekalainen
+   *
+   * @param k the position to compute the truncated ndcg
+   * @return the average ndcg at the first k ranking positions
+   */
+  def ndcgAt(k: Int): Double = predictionAndLabels.map { case (pred, lab) 
=>
+val labSet = lab.toSet
--- End diff --

check `k > 0`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-3568 [mllib] add ranking metrics

2014-10-17 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/2667#discussion_r19046955
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/evaluation/RankingMetrics.scala ---
@@ -0,0 +1,115 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.evaluation
+
+import scala.reflect.ClassTag
+
+import org.apache.spark.SparkContext._
+import org.apache.spark.annotation.Experimental
+import org.apache.spark.rdd.RDD
+
+/**
+ * ::Experimental::
+ * Evaluator for ranking algorithms.
+ *
+ * @param predictionAndLabels an RDD of (predicted ranking, ground truth 
set) pairs.
+ */
+@Experimental
+class RankingMetrics[T: ClassTag](predictionAndLabels: RDD[(Array[T], 
Array[T])]) {
+
+  /**
+   * Compute the average precision of all the queries, truncated at 
ranking position k.
+   * If for a query, the ranking algorithm returns n (n < k) results,
+   * the precision value will be computed as #(relevant items retrived) / 
k.
+   * See the following paper for detail:
+   *
+   * IR evaluation methods for retrieving highly relevant documents.
+   *K. Jarvelin and J. Kekalainen
+   *
+   * @param k the position to compute the truncated precision
+   * @return the average precision at the first k ranking positions
+   */
+  def precisionAt(k: Int): Double = predictionAndLabels.map { case (pred, 
lab) =>
+val labSet = lab.toSet
+val n = math.min(pred.length, k)
+var i = 0
+var cnt = 0
+
+while (i < n) {
+  if (labSet.contains(pred(i))) {
+cnt += 1
+  }
+  i += 1
+}
+cnt.toDouble / k
+  }.mean
+
+  /**
+   * Returns the mean average precision (MAP) of all the queries
+   */
+  lazy val meanAveragePrecision: Double = predictionAndLabels.map { case 
(pred, lab) =>
+val labSet = lab.toSet
+var i = 0
+var cnt = 0
+var precSum = 0.0
+val n = pred.length
+
+while (i < n) {
+  if (labSet.contains(pred(i))) {
+cnt += 1
+precSum += cnt.toDouble / (i + 1)
+  }
+  i += 1
+}
+precSum / labSet.size
+  }.mean
+
+  /**
+   * Compute the average NDCG value of all the queries, truncated at 
ranking position k.
+   * If for a query, the ranking algorithm returns n (n < k) results, the 
NDCG value at
+   * at position n will be used. See the following paper for detail:
+   *
+   * IR evaluation methods for retrieving highly relevant documents.
+   *K. Jarvelin and J. Kekalainen
+   *
+   * @param k the position to compute the truncated ndcg
+   * @return the average ndcg at the first k ranking positions
+   */
+  def ndcgAt(k: Int): Double = predictionAndLabels.map { case (pred, lab) 
=>
+val labSet = lab.toSet
+val labSetSize = labSet.size
+val n = math.min(math.max(pred.length, labSetSize), k)
+var maxDcg = 0.0
+var dcg = 0.0
+var i = 0
+
+while (i < n) {
+  // Calculate 1/log2(i + 2)
+  val gain = math.log(2) / math.log(i + 2)
--- End diff --

`math.log(2)` -> `1.0`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-3568 [mllib] add ranking metrics

2014-10-17 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/2667#discussion_r19046944
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/evaluation/RankingMetrics.scala ---
@@ -0,0 +1,115 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.evaluation
+
+import scala.reflect.ClassTag
+
+import org.apache.spark.SparkContext._
+import org.apache.spark.annotation.Experimental
+import org.apache.spark.rdd.RDD
+
+/**
+ * ::Experimental::
+ * Evaluator for ranking algorithms.
+ *
+ * @param predictionAndLabels an RDD of (predicted ranking, ground truth 
set) pairs.
+ */
+@Experimental
+class RankingMetrics[T: ClassTag](predictionAndLabels: RDD[(Array[T], 
Array[T])]) {
+
+  /**
+   * Compute the average precision of all the queries, truncated at 
ranking position k.
+   * If for a query, the ranking algorithm returns n (n < k) results,
+   * the precision value will be computed as #(relevant items retrived) / 
k.
+   * See the following paper for detail:
+   *
+   * IR evaluation methods for retrieving highly relevant documents.
+   *K. Jarvelin and J. Kekalainen
+   *
+   * @param k the position to compute the truncated precision
+   * @return the average precision at the first k ranking positions
+   */
+  def precisionAt(k: Int): Double = predictionAndLabels.map { case (pred, 
lab) =>
+val labSet = lab.toSet
--- End diff --

Please check `k > 0`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-3568 [mllib] add ranking metrics

2014-10-17 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/2667#discussion_r19046948
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/evaluation/RankingMetrics.scala ---
@@ -0,0 +1,115 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.evaluation
+
+import scala.reflect.ClassTag
+
+import org.apache.spark.SparkContext._
+import org.apache.spark.annotation.Experimental
+import org.apache.spark.rdd.RDD
+
+/**
+ * ::Experimental::
+ * Evaluator for ranking algorithms.
+ *
+ * @param predictionAndLabels an RDD of (predicted ranking, ground truth 
set) pairs.
+ */
+@Experimental
+class RankingMetrics[T: ClassTag](predictionAndLabels: RDD[(Array[T], 
Array[T])]) {
+
+  /**
+   * Compute the average precision of all the queries, truncated at 
ranking position k.
+   * If for a query, the ranking algorithm returns n (n < k) results,
+   * the precision value will be computed as #(relevant items retrived) / 
k.
+   * See the following paper for detail:
+   *
+   * IR evaluation methods for retrieving highly relevant documents.
+   *K. Jarvelin and J. Kekalainen
+   *
+   * @param k the position to compute the truncated precision
+   * @return the average precision at the first k ranking positions
+   */
+  def precisionAt(k: Int): Double = predictionAndLabels.map { case (pred, 
lab) =>
+val labSet = lab.toSet
+val n = math.min(pred.length, k)
+var i = 0
+var cnt = 0
+
+while (i < n) {
+  if (labSet.contains(pred(i))) {
+cnt += 1
+  }
+  i += 1
+}
+cnt.toDouble / k
+  }.mean
+
+  /**
+   * Returns the mean average precision (MAP) of all the queries
+   */
+  lazy val meanAveragePrecision: Double = predictionAndLabels.map { case 
(pred, lab) =>
+val labSet = lab.toSet
+var i = 0
+var cnt = 0
+var precSum = 0.0
+val n = pred.length
+
+while (i < n) {
+  if (labSet.contains(pred(i))) {
+cnt += 1
+precSum += cnt.toDouble / (i + 1)
+  }
+  i += 1
+}
+precSum / labSet.size
--- End diff --

`labSet` could be empty.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-3568 [mllib] add ranking metrics

2014-10-17 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/2667#discussion_r19046940
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/evaluation/RankingMetrics.scala ---
@@ -0,0 +1,115 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.evaluation
+
+import scala.reflect.ClassTag
+
+import org.apache.spark.SparkContext._
+import org.apache.spark.annotation.Experimental
+import org.apache.spark.rdd.RDD
+
+/**
+ * ::Experimental::
+ * Evaluator for ranking algorithms.
+ *
+ * @param predictionAndLabels an RDD of (predicted ranking, ground truth 
set) pairs.
+ */
+@Experimental
+class RankingMetrics[T: ClassTag](predictionAndLabels: RDD[(Array[T], 
Array[T])]) {
+
+  /**
+   * Compute the average precision of all the queries, truncated at 
ranking position k.
+   * If for a query, the ranking algorithm returns n (n < k) results,
--- End diff --

What if the label set contains less than k items? It is worth documenting.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3994] Use standard Aggregator code path...

2014-10-17 Thread aarondav
Github user aarondav commented on the pull request:

https://github.com/apache/spark/pull/2839#issuecomment-59584036
  
@pwendell If you get a chance, PTAL


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3994] Use standard Aggregator code path...

2014-10-17 Thread aarondav
Github user aarondav commented on a diff in the pull request:

https://github.com/apache/spark/pull/2839#discussion_r19046865
  
--- Diff: core/src/main/scala/org/apache/spark/rdd/RDD.scala ---
@@ -911,32 +911,15 @@ abstract class RDD[T: ClassTag](
   }
 
   /**
-   * Return the count of each unique value in this RDD as a map of (value, 
count) pairs. The final
-   * combine step happens locally on the master, equivalent to running a 
single reduce task.
+   * Return the count of each unique value in this RDD as a local map of 
(value, count) pairs.
+   *
+   * Note that this method should only be used if the resulting map is 
expected to be small, as
+   * the whole thing is loaded into the driver's memory.
+   * To handle very large results, consider using rdd.map(x => (x, 
1)).reduceByKey(_ + _), which
+   * returns an RDD[T, Long] instead of a map.
*/
   def countByValue()(implicit ord: Ordering[T] = null): Map[T, Long] = {
-if (elementClassTag.runtimeClass.isArray) {
-  throw new SparkException("countByValue() does not support arrays")
--- End diff --

Note that we still don't support arrays, but this is caught by combineByKey:
`org.apache.spark.SparkException: Cannot use map-side combining with array 
keys.`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3994] Use standard Aggregator code path...

2014-10-17 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2839#issuecomment-59583927
  
  [QA tests have 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21867/consoleFull)
 for   PR 2839 at commit 
[`e1f06d3`](https://github.com/apache/spark/commit/e1f06d3c550e5419bb3b20745da7cbfdedbb2841).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3994] Use standard Aggregator code path...

2014-10-17 Thread aarondav
GitHub user aarondav opened a pull request:

https://github.com/apache/spark/pull/2839

[SPARK-3994] Use standard Aggregator code path for countByKey and 
countByValue

See [JIRA](https://issues.apache.org/jira/browse/SPARK-3994) for more 
information. Also adds
a note which warns against using these methods.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/aarondav/spark countByKey

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/2839.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #2839


commit e1f06d3c550e5419bb3b20745da7cbfdedbb2841
Author: Aaron Davidson 
Date:   2014-10-17T22:11:25Z

[SPARK-3994] Use standard Aggregator code path for countByKey and 
countByValue

See [JIRA](https://issues.apache.org/jira/browse/SPARK-3994) for more 
information. Also adds
a note which warns against using these methods.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3934] [SPARK-3918] [mllib] Bug fixes fo...

2014-10-17 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/2785


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3934] [SPARK-3918] [mllib] Bug fixes fo...

2014-10-17 Thread mengxr
Github user mengxr commented on the pull request:

https://github.com/apache/spark/pull/2785#issuecomment-59582528
  
LGTM. Merged into master. Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3989]Added possibility to directly inst...

2014-10-17 Thread ziky90
Github user ziky90 commented on the pull request:

https://github.com/apache/spark/pull/2836#issuecomment-59582401
  
Ok thank you. Now I can see it.

Based on this I also think that it'd need much more effort than I 
previously thought to do the bootstrap script execution in a robust way (it 
will probably need to implement another ssh method). 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2546] Clone JobConf for each task (bran...

2014-10-17 Thread ash211
Github user ash211 commented on the pull request:

https://github.com/apache/spark/pull/2684#issuecomment-59581783
  
More flavor on the perf numbers was we ran 6 jobs in a row before and after 
(starting up a new driver on each job), discarded the first run, and took the 
average of the remaining five.

Pre-patch the times were ~1m50s, post-patch they were ~2m1s.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3985] [Examples] fix file path using os...

2014-10-17 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/2834


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3985] [Examples] fix file path using os...

2014-10-17 Thread JoshRosen
Github user JoshRosen commented on the pull request:

https://github.com/apache/spark/pull/2834#issuecomment-59581146
  
LGTM.  Thanks for catching this!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3952] [Streaming] [PySpark] add Python ...

2014-10-17 Thread JoshRosen
Github user JoshRosen commented on the pull request:

https://github.com/apache/spark/pull/2808#issuecomment-59581079
  
Found one more issue (sorry, hopefully this is the last one):


![image](https://cloud.githubusercontent.com/assets/50748/4686555/40c48ce0-5647-11e4-99b5-45f033218396.png)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3985] [Examples] fix file path using os...

2014-10-17 Thread JoshRosen
Github user JoshRosen commented on a diff in the pull request:

https://github.com/apache/spark/pull/2834#discussion_r19045505
  
--- Diff: examples/src/main/python/sql.py ---
@@ -48,7 +48,7 @@
 
 # A JSON dataset is pointed to by path.
 # The path can be either a single text file or a directory storing 
text files.
-path = os.environ['SPARK_HOME'] + 
"examples/src/main/resources/people.json"
+path = os.path.join(os.environ['SPARK_HOME'], 
"examples/src/main/resources/people.json")
--- End diff --

I think the `os.path.join` is fine; I think it's more clear / less brittle. 
 If it turns out that we do need to make changes to the path separator in order 
to support Windows, then using `os.path.join` makes it easier to spot where 
path construction is taking place in order to find the code that needs to be 
changed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3989]Added possibility to directly inst...

2014-10-17 Thread JoshRosen
Github user JoshRosen commented on the pull request:

https://github.com/apache/spark/pull/2836#issuecomment-59580682
  
> Could you please give me an example if you see some option where possibly 
might be the problem.

If you look at the documentation of the `ssh`, function, it says "Run a 
command on a host through SSH, retrying up to five times _and then throwing an 
exception if ssh continues to fail_".  If a library can't be installed, I think 
that the current code will cause `spark-ec2` to exit rather than continuing in 
a best-effort manner.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3985] [Examples] fix file path using os...

2014-10-17 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/2834#discussion_r19045058
  
--- Diff: examples/src/main/python/sql.py ---
@@ -48,7 +48,7 @@
 
 # A JSON dataset is pointed to by path.
 # The path can be either a single text file or a directory storing 
text files.
-path = os.environ['SPARK_HOME'] + 
"examples/src/main/resources/people.json"
+path = os.path.join(os.environ['SPARK_HOME'], 
"examples/src/main/resources/people.json")
--- End diff --

Consistency is good too. Leave it unless someone else thinks it should 
change. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3985] [Examples] fix file path using os...

2014-10-17 Thread adrian-wang
Github user adrian-wang commented on a diff in the pull request:

https://github.com/apache/spark/pull/2834#discussion_r19045017
  
--- Diff: examples/src/main/python/sql.py ---
@@ -48,7 +48,7 @@
 
 # A JSON dataset is pointed to by path.
 # The path can be either a single text file or a directory storing 
text files.
-path = os.environ['SPARK_HOME'] + 
"examples/src/main/resources/people.json"
+path = os.path.join(os.environ['SPARK_HOME'], 
"examples/src/main/resources/people.json")
--- End diff --

I was just trying to make this consistent with other code in 
python/pyspark/**.py. It is OK if you believe add a '/' is better.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2546] Clone JobConf for each task (bran...

2014-10-17 Thread frydawg524
Github user frydawg524 commented on the pull request:

https://github.com/apache/spark/pull/2684#issuecomment-59578918
  
@JoshRosen, 
Awesome! Thanks for helping out with this. I'll make sure that this gets 
broadcasted to my team. 

Zach


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3993] [PySpark] fix bug while reuse wor...

2014-10-17 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/2838#issuecomment-59578891
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21865/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3916] [Streaming] discover new appended...

2014-10-17 Thread davies
Github user davies commented on the pull request:

https://github.com/apache/spark/pull/2806#issuecomment-59578404
  
@tdas Could you help to review this? The failed tests run stable locally, 
I'm investigating it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2546] Clone JobConf for each task (bran...

2014-10-17 Thread JoshRosen
Github user JoshRosen commented on the pull request:

https://github.com/apache/spark/pull/2684#issuecomment-59578207
  
@frydawg524 Thanks for testing this out!  I'm glad to hear that it solves 
the bug.

I just pushed a new commit which adds a configuration option 
(`spark.hadoop.cloneConf`) for controlling whether to clone the configuration 
(as in the patch you tested) or share a single configuration object across all 
tasks (the old code).  The reasoning for this is that releasing 1.1.1 and 1.0.3 
patches that cause measurable performance regressions will upset users who 
weren't affected by this issue.  In 1.2, we may revisit this by seeing if we 
can find ways to make the cloning process faster.

I also plan to open an upstream ticket with Hadoop.  That won't solve the 
problem for Spark users who might be stuck using older Hadoop versions (so we 
still need our own workaround), but it would be nice to see this eventually get 
fixed upstream.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2546] Clone JobConf for each task (bran...

2014-10-17 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2684#issuecomment-59578047
  
  [QA tests have 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21866/consoleFull)
 for   PR 2684 at commit 
[`f14f259`](https://github.com/apache/spark/commit/f14f25981f1b922f1a8d07dfd80774a78daec368).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3855][SQL] Preserve the result attribut...

2014-10-17 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/2717


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3855][SQL] Preserve the result attribut...

2014-10-17 Thread marmbrus
Github user marmbrus commented on the pull request:

https://github.com/apache/spark/pull/2717#issuecomment-59576965
  
Yes, please do.
On Oct 17, 2014 5:10 PM, "Patrick Wendell"  wrote:

> I'd like to pull this in - is that alright @marmbrus
> ?
>
> —
> Reply to this email directly or view it on GitHub
> .
>


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



  1   2   3   >