[GitHub] spark pull request: [SPARK-2062][GraphX] VertexRDD.apply does not ...

2014-09-18 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/1903


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2062][GraphX] VertexRDD.apply does not ...

2014-09-18 Thread ankurdave
Github user ankurdave commented on the pull request:

https://github.com/apache/spark/pull/1903#issuecomment-56140430
  
Thanks! Merged into master and branch-1.1.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3536][SQL] SELECT on empty parquet tabl...

2014-09-18 Thread ravipesala
GitHub user ravipesala opened a pull request:

https://github.com/apache/spark/pull/2456

[SPARK-3536][SQL] SELECT on empty parquet table throws exception

It return null metadata from parquet if querying on empty parquet file 
while calculating splits.So added null check and returns the empty splits.

Author : ravipesala ravindra.pes...@huawei.com

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/ravipesala/spark SPARK-3536

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/2456.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #2456


commit 1e81a50631b1f44ad7de65b83408a40218234745
Author: ravipesala 
Date:   2014-09-18T18:02:46Z

Fixed the issue when querying on empty parquet file.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3584] sbin/slaves doesn't work when we ...

2014-09-18 Thread sarutak
Github user sarutak commented on a diff in the pull request:

https://github.com/apache/spark/pull/2444#discussion_r17769855
  
--- Diff: .gitignore ---
@@ -19,6 +19,7 @@ conf/*.sh
 conf/*.properties
 conf/*.conf
 conf/*.xml
+conf/slaves
--- End diff --

So, this file will not to be edited? User should use another slave list 
file by SPARK_SLAVES variable right?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3584] sbin/slaves doesn't work when we ...

2014-09-18 Thread sarutak
Github user sarutak commented on a diff in the pull request:

https://github.com/apache/spark/pull/2444#discussion_r17769822
  
--- Diff: sbin/slaves.sh ---
@@ -67,20 +69,26 @@ fi
 
 if [ "$HOSTLIST" = "" ]; then
   if [ "$SPARK_SLAVES" = "" ]; then
-export HOSTLIST="${SPARK_CONF_DIR}/slaves"
+if [ -f "${SPARK_CONF_DIR}/slaves" ]; then
+  HOSTLIST=`cat "${SPARK_CONF_DIR}/slaves"`
+else
+  HOSTLIST=localhost
+fi
   else
-export HOSTLIST="${SPARK_SLAVES}"
+HOSTLIST=`cat "${SPARK_SLAVES}"`
--- End diff --

This is to use HOSTLIST as List of Host, not file.
It's to use localhost as a default host list entry.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3547]Using a special exit code instead ...

2014-09-18 Thread sarutak
Github user sarutak commented on the pull request:

https://github.com/apache/spark/pull/2421#issuecomment-56137301
  
@liancheng Ah,  I saw sometimes Jenkins ignores us... but recently he is 
friendly :D


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-1987] EdgePartitionBuilder: More memory...

2014-09-18 Thread ankurdave
Github user ankurdave commented on the pull request:

https://github.com/apache/spark/pull/2446#issuecomment-56137274
  
Thanks! ok to test


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3218, SPARK-3219, SPARK-3261, SPARK-342...

2014-09-18 Thread mengxr
Github user mengxr commented on the pull request:

https://github.com/apache/spark/pull/2419#issuecomment-56136714
  
@derrickburns I cannot see the Jenkins log. Let's call Jenkins again.

test this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [MLLIB] fix a unresolved reference variable 'n...

2014-09-18 Thread mengxr
Github user mengxr commented on the pull request:

https://github.com/apache/spark/pull/2423#issuecomment-56136584
  
@OdinLin Thanks for catching the bug! As @davies mentioned, #2378 will 
completely replace the current SerDe. Could you close this PR?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3491] [MLlib] [PySpark] use pickle to s...

2014-09-18 Thread mengxr
Github user mengxr commented on the pull request:

https://github.com/apache/spark/pull/2378#issuecomment-56136476
  
test this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3418] Sparse Matrix support (CCS) and a...

2014-09-18 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/2294


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3418] Sparse Matrix support (CCS) and a...

2014-09-18 Thread mengxr
Github user mengxr commented on the pull request:

https://github.com/apache/spark/pull/2294#issuecomment-56136224
  
LGTM. I'm merging this into master. (We might need to make slight changes 
to some methods before the 1.2 release, but let's not block the multi-model 
training PR for now.)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-927] detect numpy at time of use

2014-09-18 Thread jkbradley
Github user jkbradley commented on the pull request:

https://github.com/apache/spark/pull/2313#issuecomment-56135962
  
Philosophically, I agree with @erikerlandson about it being OK for random 
generators to be, well, random.  If problems are caused by the output of a 
randomized process not being reproducible, then then output probably isn't 
being used/tested correctly.

Practically, I second @mengxr in saying we should encourage reproducibility 
by requiring numpy in MLlib.  But avoiding it where possible sounds good, 
assuming the performance hit is not too bad.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3250] Implement Gap Sampling optimizati...

2014-09-18 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/2455#discussion_r17769391
  
--- Diff: 
core/src/main/scala/org/apache/spark/util/random/RandomSampler.scala ---
@@ -43,66 +46,218 @@ trait RandomSampler[T, U] extends Pseudorandom with 
Cloneable with Serializable
 throw new NotImplementedError("clone() is not implemented.")
 }
 
+
+object RandomSampler {
+  // Default random number generator used by random samplers
+  def rngDefault: Random = new XORShiftRandom
+
+  // Default gap sampling maximum
+  // For sampling fractions <= this value, the gap sampling optimization 
will be applied.
+  // Above this value, it is assumed that "tradtional" bernoulli sampling 
is faster.  The 
+  // optimal value for this will depend on the RNG.  More expensive RNGs 
will tend to make
+  // the optimal value higher.  The most reliable way to determine this 
value for a given RNG
+  // is to experiment.  I would expect a value of 0.5 to be close in most 
cases.
+  def gsmDefault: Double = 0.4
+
+  // Default gap sampling epsilon
+  // When sampling random floating point values the gap sampling logic 
requires value > 0.  An
+  // optimal value for this parameter is at or near the minimum positive 
floating point value 
+  // returned by nextDouble() for the RNG being used.
+  def epsDefault: Double = 5e-11
+}
+
+
 /**
  * :: DeveloperApi ::
  * A sampler based on Bernoulli trials.
  *
- * @param lb lower bound of the acceptance range
- * @param ub upper bound of the acceptance range
- * @param complement whether to use the complement of the range specified, 
default to false
+ * @param fraction the sampling fraction, aka Bernoulli sampling 
probability
  * @tparam T item type
  */
 @DeveloperApi
-class BernoulliSampler[T](lb: Double, ub: Double, complement: Boolean = 
false)
-  extends RandomSampler[T, T] {
+class BernoulliSampler[T: ClassTag](fraction: Double) extends 
RandomSampler[T, T] {
 
-  private[random] var rng: Random = new XORShiftRandom
+  require(fraction >= 0.0  &&  fraction <= 1.0, "Sampling fraction must be 
on interval [0, 1]")
 
-  def this(ratio: Double) = this(0.0d, ratio)
+  def this(lb: Double, ub: Double, complement: Boolean = false) =
+this(if (complement) (1.0 - (ub - lb)) else (ub - lb))
+
+  private val rng: Random = RandomSampler.rngDefault
 
   override def setSeed(seed: Long) = rng.setSeed(seed)
 
   override def sample(items: Iterator[T]): Iterator[T] = {
-items.filter { item =>
-  val x = rng.nextDouble()
-  (x >= lb && x < ub) ^ complement
+fraction match {
+  case f if (f <= 0.0) => Iterator.empty
+  case f if (f >= 1.0) => items
+  case f if (f <= RandomSampler.gsmDefault) =>
+new GapSamplingIterator(items, f, rng, RandomSampler.epsDefault)
+  case _ => items.filter(_ => (rng.nextDouble() <= fraction))
--- End diff --

Did you test whether `rdd.randomSplit()` will produce non-overlapping 
subsets with this change?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-927] detect numpy at time of use

2014-09-18 Thread mengxr
Github user mengxr commented on the pull request:

https://github.com/apache/spark/pull/2313#issuecomment-56135525
  
@JoshRosen PySpark/MLlib requires NumPy to run, and I don't think we 
claimed that we support different versions of NumPy.

`sample()` in core is different. Maybe we can test the overhead of using 
Python's own random. The overhead may not be huge, due to 
serialization/deserialization.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [WIP][SPARK-1486][MLlib] Multi Model Training ...

2014-09-18 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/2451#discussion_r17769270
  
--- Diff: mllib/src/main/scala/org/apache/spark/mllib/linalg/BLAS.scala ---
@@ -197,4 +201,452 @@ private[mllib] object BLAS extends Serializable {
 throw new IllegalArgumentException(s"scal doesn't support vector 
type ${x.getClass}.")
 }
   }
+
+  // For level-3 routines, we use the native BLAS.
+  private def nativeBLAS: NetlibBLAS = {
+if (_nativeBLAS == null) {
+  _nativeBLAS = NativeBLAS
+}
+_nativeBLAS
+  }
+
+  /**
+   * C := alpha * A * B + beta * C
+   * @param transA whether to use the transpose of matrix A (true), or A 
itself (false).
+   * @param transB whether to use the transpose of matrix B (true), or B 
itself (false).
+   * @param alpha a scalar to scale the multiplication A * B.
+   * @param A the matrix A that will be left multiplied to B. Size of m x 
k.
+   * @param B the matrix B that will be left multiplied by A. Size of k x 
n.
+   * @param beta a scalar that can be used to scale matrix C.
+   * @param C the resulting matrix C. Size of m x n.
+   */
+  def gemm(
+  transA: Boolean,
+  transB: Boolean,
+  alpha: Double,
+  A: Matrix,
+  B: Matrix,
+  beta: Double,
+  C: DenseMatrix): Unit = {
+if (alpha == 0.0) {
+  logDebug("gemm: alpha is equal to 0. Returning C.")
+} else {
+  A match {
+case sparse: SparseMatrix =>
+  B match {
+case dB: DenseMatrix => gemm(transA, transB, alpha, sparse, 
dB, beta, C)
+case sB: SparseMatrix =>
+  throw new IllegalArgumentException(s"gemm doesn't support 
sparse-sparse matrix " +
+s"multiplication")
+case _ =>
+  throw new IllegalArgumentException(s"gemm doesn't support 
matrix type ${B.getClass}.")
+  }
+case dense: DenseMatrix =>
+  B match {
+case dB: DenseMatrix => gemm(transA, transB, alpha, dense, dB, 
beta, C)
+case sB: SparseMatrix => gemm(transA, transB, alpha, dense, 
sB, beta, C)
+case _ =>
+  throw new IllegalArgumentException(s"gemm doesn't support 
matrix type ${B.getClass}.")
+  }
+case _ =>
+  throw new IllegalArgumentException(s"gemm doesn't support matrix 
type ${A.getClass}.")
+  }
+}
+  }
+
+  /**
+   * C := alpha * A * B + beta * C
+   *
+   * @param alpha a scalar to scale the multiplication A * B.
+   * @param A the matrix A that will be left multiplied to B. Size of m x 
k.
+   * @param B the matrix B that will be left multiplied by A. Size of k x 
n.
+   * @param beta a scalar that can be used to scale matrix C.
+   * @param C the resulting matrix C. Size of m x n.
+   */
+  def gemm(
+  alpha: Double,
+  A: Matrix,
+  B: Matrix,
+  beta: Double,
+  C: DenseMatrix): Unit = {
+gemm(false, false, alpha, A, B, beta, C)
+  }
+
+  /**
+   * C := alpha * A * B + beta * C
+   * For `DenseMatrix` A.
+   */
+  private def gemm(
+  transA: Boolean,
+  transB: Boolean,
+  alpha: Double,
+  A: DenseMatrix,
+  B: DenseMatrix,
+  beta: Double,
+  C: DenseMatrix): Unit = {
+val mA: Int = if (!transA) A.numRows else A.numCols
+val nB: Int = if (!transB) B.numCols else B.numRows
+val kA: Int = if (!transA) A.numCols else A.numRows
+val kB: Int = if (!transB) B.numRows else B.numCols
+val tAstr = if (!transA) "N" else "T"
+val tBstr = if (!transB) "N" else "T"
+
+require(kA == kB, s"The columns of A don't match the rows of B. A: 
$kA, B: $kB")
+require(mA == C.numRows, s"The rows of C don't match the rows of A. C: 
${C.numRows}, A: $mA")
+require(nB == C.numCols,
+  s"The columns of C don't match the columns of B. C: ${C.numCols}, A: 
$nB")
+
+nativeBLAS.dgemm(tAstr, tBstr, mA, nB, kA, alpha, A.values, A.numRows, 
B.values, B.numRows,
+  beta, C.values, C.numRows)
+  }
+
+  /**
+   * C := alpha * A * B + beta * C
+   * For `SparseMatrix` A.
+   */
+  private def gemm(
+  transA: Boolean,
+  transB: Boolean,
+  alpha: Double,
+  A: SparseMatrix,
+  B: DenseMatrix,
+  beta: Double,
+  C: DenseMatrix): Unit = {
+val mA: Int = if (!transA) A.numRows else A.numCols
+val nB: Int = if (!transB) B.numCols else B.numRows
+val kA: Int = if (!transA) A.numCols else A.numRows
+val kB: Int = if (!transB) B.numRows else B.numCols
+
+r

[GitHub] spark pull request: SPARK-1767: Prefer HDFS-cached replicas when s...

2014-09-18 Thread aarondav
Github user aarondav commented on the pull request:

https://github.com/apache/spark/pull/1486#issuecomment-56135114
  
I only had a few minor comments about documentation while trying to do a 
quick read-through of this patch. No substantive comments.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-1767: Prefer HDFS-cached replicas when s...

2014-09-18 Thread aarondav
Github user aarondav commented on a diff in the pull request:

https://github.com/apache/spark/pull/1486#discussion_r17769069
  
--- Diff: core/src/main/scala/org/apache/spark/scheduler/TaskLocation.scala 
---
@@ -22,13 +22,35 @@ package org.apache.spark.scheduler
  * In the latter case, we will prefer to launch the task on that 
executorID, but our next level
  * of preference will be executors on the same host if this is not 
possible.
  */
-private[spark]
-class TaskLocation private (val host: String, val executorId: 
Option[String]) extends Serializable {
-  override def toString: String = "TaskLocation(" + host + ", " + 
executorId + ")"
+private[spark] sealed abstract class TaskLocation(val host: String) {
+}
+
+private [spark] case class ExecutorCacheTaskLocation(override val host: 
String,
+val executorId: String) extends TaskLocation(host) {
+}
+
+private [spark] case class HDFSCachedTaskLocation(override val host: 
String)
+extends TaskLocation(host) {
+  override def toString = TaskLocation.in_memory_location_tag + host
+}
+
+private [spark] case class HostTaskLocation(override val host: String) 
extends TaskLocation(host) {
+  override def toString = host
 }
 
 private[spark] object TaskLocation {
-  def apply(host: String, executorId: String) = new TaskLocation(host, 
Some(executorId))
+  // We identify hosts on which the block is cached with this prefix.  
Because this prefix contains
+  // underscores, which are not legal characters in hostnames, there 
should be no potential for
+  // confusion.  See  RFC 952 and RFC 1123 for information about the 
format of hostnames.
+  val in_memory_location_tag = "_hdfs_cache_"
+
+  def apply(host: String, executorId: String) = new 
ExecutorCacheTaskLocation(host, executorId)
 
-  def apply(host: String) = new TaskLocation(host, None)
+  def apply(str: String) = {
--- End diff --

The contract of this method is kinda sketchy -- taking in a "str" which is 
either a host name or a tag. Would you mind adding a bit of Javadoc to explain 
that this is what is happening?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-1767: Prefer HDFS-cached replicas when s...

2014-09-18 Thread aarondav
Github user aarondav commented on a diff in the pull request:

https://github.com/apache/spark/pull/1486#discussion_r17769055
  
--- Diff: core/src/main/scala/org/apache/spark/scheduler/TaskLocation.scala 
---
@@ -22,13 +22,35 @@ package org.apache.spark.scheduler
  * In the latter case, we will prefer to launch the task on that 
executorID, but our next level
  * of preference will be executors on the same host if this is not 
possible.
  */
-private[spark]
-class TaskLocation private (val host: String, val executorId: 
Option[String]) extends Serializable {
-  override def toString: String = "TaskLocation(" + host + ", " + 
executorId + ")"
+private[spark] sealed abstract class TaskLocation(val host: String) {
+}
+
+private [spark] case class ExecutorCacheTaskLocation(override val host: 
String,
+val executorId: String) extends TaskLocation(host) {
+}
+
+private [spark] case class HDFSCachedTaskLocation(override val host: 
String)
+extends TaskLocation(host) {
+  override def toString = TaskLocation.in_memory_location_tag + host
+}
+
+private [spark] case class HostTaskLocation(override val host: String) 
extends TaskLocation(host) {
+  override def toString = host
 }
 
 private[spark] object TaskLocation {
-  def apply(host: String, executorId: String) = new TaskLocation(host, 
Some(executorId))
+  // We identify hosts on which the block is cached with this prefix.  
Because this prefix contains
+  // underscores, which are not legal characters in hostnames, there 
should be no potential for
+  // confusion.  See  RFC 952 and RFC 1123 for information about the 
format of hostnames.
+  val in_memory_location_tag = "_hdfs_cache_"
+
+  def apply(host: String, executorId: String) = new 
ExecutorCacheTaskLocation(host, executorId)
 
-  def apply(host: String) = new TaskLocation(host, None)
+  def apply(str: String) = {
+if (str.startsWith(in_memory_location_tag)) {
+  new 
HDFSCachedTaskLocation(str.substring(in_memory_location_tag.length))
--- End diff --

nit: `str.stripPrefix(in_memory_location_tag)`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-1767: Prefer HDFS-cached replicas when s...

2014-09-18 Thread aarondav
Github user aarondav commented on a diff in the pull request:

https://github.com/apache/spark/pull/1486#discussion_r17769053
  
--- Diff: core/src/main/scala/org/apache/spark/scheduler/TaskLocation.scala 
---
@@ -22,13 +22,35 @@ package org.apache.spark.scheduler
  * In the latter case, we will prefer to launch the task on that 
executorID, but our next level
  * of preference will be executors on the same host if this is not 
possible.
  */
-private[spark]
-class TaskLocation private (val host: String, val executorId: 
Option[String]) extends Serializable {
-  override def toString: String = "TaskLocation(" + host + ", " + 
executorId + ")"
+private[spark] sealed abstract class TaskLocation(val host: String) {
+}
+
+private [spark] case class ExecutorCacheTaskLocation(override val host: 
String,
+val executorId: String) extends TaskLocation(host) {
+}
+
+private [spark] case class HDFSCachedTaskLocation(override val host: 
String)
+extends TaskLocation(host) {
+  override def toString = TaskLocation.in_memory_location_tag + host
+}
+
+private [spark] case class HostTaskLocation(override val host: String) 
extends TaskLocation(host) {
+  override def toString = host
 }
 
 private[spark] object TaskLocation {
-  def apply(host: String, executorId: String) = new TaskLocation(host, 
Some(executorId))
+  // We identify hosts on which the block is cached with this prefix.  
Because this prefix contains
+  // underscores, which are not legal characters in hostnames, there 
should be no potential for
+  // confusion.  See  RFC 952 and RFC 1123 for information about the 
format of hostnames.
+  val in_memory_location_tag = "_hdfs_cache_"
--- End diff --

Also, nit: could you use camel case: `inMemoryLocationTag`, or all caps 
with underscores if you prefer.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-1767: Prefer HDFS-cached replicas when s...

2014-09-18 Thread aarondav
Github user aarondav commented on a diff in the pull request:

https://github.com/apache/spark/pull/1486#discussion_r17768989
  
--- Diff: core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala ---
@@ -309,4 +323,42 @@ private[spark] object HadoopRDD {
   f(inputSplit, firstParent[T].iterator(split, context))
 }
   }
+
+  private[spark] class SplitInfoReflections {
+val inputSplitWithLocationInfo =
+  Class.forName("org.apache.hadoop.mapred.InputSplitWithLocationInfo")
+val getLocationInfo = 
inputSplitWithLocationInfo.getMethod("getLocationInfo")
+val newInputSplit = 
Class.forName("org.apache.hadoop.mapreduce.InputSplit")
+val newGetLocationInfo = newInputSplit.getMethod("getLocationInfo")
+val splitLocationInfo = 
Class.forName("org.apache.hadoop.mapred.SplitLocationInfo")
+val isInMemory = splitLocationInfo.getMethod("isInMemory")
+val getLocation = splitLocationInfo.getMethod("getLocation")
+  }
+
+  private[spark] val SPLIT_INFO_REFLECTIONS = try {
+Some(new SplitInfoReflections)
+  } catch {
+case e: Exception =>
+  logDebug("SplitLocationInfo and other new Hadoop classes are " +
+  "unavailable. Using the older Hadoop location info code.", e)
+  None
+  }
+
+  private[spark] def convertSplitLocationInfo(infos: Array[AnyRef]) 
:Seq[String] = {
--- End diff --

nit: '): Seq[String]`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-1767: Prefer HDFS-cached replicas when s...

2014-09-18 Thread aarondav
Github user aarondav commented on a diff in the pull request:

https://github.com/apache/spark/pull/1486#discussion_r17768962
  
--- Diff: core/src/main/scala/org/apache/spark/scheduler/TaskLocation.scala 
---
@@ -22,13 +22,35 @@ package org.apache.spark.scheduler
  * In the latter case, we will prefer to launch the task on that 
executorID, but our next level
  * of preference will be executors on the same host if this is not 
possible.
  */
--- End diff --

Would you mind beefing up the documentation here a bit? I am having trouble 
reading through and quickly finding out the difference between HostTaskLocation 
and ExecutorCacheTaskLocation. I guess the latter is exclusively used for the 
BlockManager cache, but it would be good to be explicit.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-1767: Prefer HDFS-cached replicas when s...

2014-09-18 Thread aarondav
Github user aarondav commented on a diff in the pull request:

https://github.com/apache/spark/pull/1486#discussion_r17768928
  
--- Diff: core/src/main/scala/org/apache/spark/scheduler/TaskLocation.scala 
---
@@ -22,13 +22,35 @@ package org.apache.spark.scheduler
  * In the latter case, we will prefer to launch the task on that 
executorID, but our next level
  * of preference will be executors on the same host if this is not 
possible.
  */
-private[spark]
-class TaskLocation private (val host: String, val executorId: 
Option[String]) extends Serializable {
-  override def toString: String = "TaskLocation(" + host + ", " + 
executorId + ")"
+private[spark] sealed abstract class TaskLocation(val host: String) {
+}
+
+private [spark] case class ExecutorCacheTaskLocation(override val host: 
String,
+val executorId: String) extends TaskLocation(host) {
+}
+
+private [spark] case class HDFSCachedTaskLocation(override val host: 
String)
+extends TaskLocation(host) {
+  override def toString = TaskLocation.in_memory_location_tag + host
+}
+
+private [spark] case class HostTaskLocation(override val host: String) 
extends TaskLocation(host) {
--- End diff --

Minor, but `override val` on something that exports the same parameter is 
kinda weird, I think this could be cleaned up just slightly by making 
TaskLocation a trait instead with a `def host: String`. Then this still works 
and is the sole implementation.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-1767: Prefer HDFS-cached replicas when s...

2014-09-18 Thread pwendell
Github user pwendell commented on the pull request:

https://github.com/apache/spark/pull/1486#issuecomment-5619
  
Yes, this appears to be an issue with our checker and adding an exclusion 
is fine for now. The class is private.

Just had really minor comments and I can address them on merge if you want. 
This is looking good to me. Any other changes or is this good from your side?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-1767: Prefer HDFS-cached replicas when s...

2014-09-18 Thread pwendell
Github user pwendell commented on a diff in the pull request:

https://github.com/apache/spark/pull/1486#discussion_r17768506
  
--- Diff: core/src/main/scala/org/apache/spark/scheduler/TaskLocation.scala 
---
@@ -22,13 +22,35 @@ package org.apache.spark.scheduler
  * In the latter case, we will prefer to launch the task on that 
executorID, but our next level
  * of preference will be executors on the same host if this is not 
possible.
  */
-private[spark]
-class TaskLocation private (val host: String, val executorId: 
Option[String]) extends Serializable {
-  override def toString: String = "TaskLocation(" + host + ", " + 
executorId + ")"
+private[spark] sealed abstract class TaskLocation(val host: String) {
+}
+
+private [spark] case class ExecutorCacheTaskLocation(override val host: 
String,
+val executorId: String) extends TaskLocation(host) {
+}
+
+private [spark] case class HDFSCachedTaskLocation(override val host: 
String)
+extends TaskLocation(host) {
+  override def toString = TaskLocation.in_memory_location_tag + host
+}
+
+private [spark] case class HostTaskLocation(override val host: String) 
extends TaskLocation(host) {
+  override def toString = host
 }
 
 private[spark] object TaskLocation {
-  def apply(host: String, executorId: String) = new TaskLocation(host, 
Some(executorId))
+  // We identify hosts on which the block is cached with this prefix.  
Because this prefix contains
+  // underscores, which are not legal characters in hostnames, there 
should be no potential for
+  // confusion.  See  RFC 952 and RFC 1123 for information about the 
format of hostnames.
+  val in_memory_location_tag = "_hdfs_cache_"
--- End diff --

could you drop the prefixing `_` here to make it consistent with blockid? 
Having a trailing underscore seems sufficient to distinguish it from a real 
hostname.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3250] Implement Gap Sampling optimizati...

2014-09-18 Thread erikerlandson
GitHub user erikerlandson opened a pull request:

https://github.com/apache/spark/pull/2455

[SPARK-3250] Implement Gap Sampling optimization for random sampling

More efficient sampling, based on Gap Sampling optimization:

http://erikerlandson.github.io/blog/2014/09/11/faster-random-samples-with-gap-sampling/


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/erikerlandson/spark spark-3250-pr

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/2455.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #2455


commit 82f461f95dc18ff44e0cd6f2e4784d3557ad3d0f
Author: Erik Erlandson 
Date:   2014-09-18T22:26:49Z

[SPARK-3250] Implement Gap Sampling optimization for random sampling




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-1767: Prefer HDFS-cached replicas when s...

2014-09-18 Thread pwendell
Github user pwendell commented on a diff in the pull request:

https://github.com/apache/spark/pull/1486#discussion_r17768479
  
--- Diff: core/src/main/scala/org/apache/spark/scheduler/TaskLocation.scala 
---
@@ -22,13 +22,35 @@ package org.apache.spark.scheduler
  * In the latter case, we will prefer to launch the task on that 
executorID, but our next level
  * of preference will be executors on the same host if this is not 
possible.
  */
-private[spark]
-class TaskLocation private (val host: String, val executorId: 
Option[String]) extends Serializable {
-  override def toString: String = "TaskLocation(" + host + ", " + 
executorId + ")"
+private[spark] sealed abstract class TaskLocation(val host: String) {
+}
+
+private [spark] case class ExecutorCacheTaskLocation(override val host: 
String,
+val executorId: String) extends TaskLocation(host) {
+}
+
+private [spark] case class HDFSCachedTaskLocation(override val host: 
String)
--- End diff --

should this be `HDFSCacheTaskLocation` to be consistent with 
`ExecutorCacheTaskLocation`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-1767: Prefer HDFS-cached replicas when s...

2014-09-18 Thread pwendell
Github user pwendell commented on a diff in the pull request:

https://github.com/apache/spark/pull/1486#discussion_r17768467
  
--- Diff: core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala ---
@@ -309,4 +323,42 @@ private[spark] object HadoopRDD {
   f(inputSplit, firstParent[T].iterator(split, context))
 }
   }
+
+  private[spark] class SplitInfoReflections {
+val inputSplitWithLocationInfo =
+  Class.forName("org.apache.hadoop.mapred.InputSplitWithLocationInfo")
+val getLocationInfo = 
inputSplitWithLocationInfo.getMethod("getLocationInfo")
+val newInputSplit = 
Class.forName("org.apache.hadoop.mapreduce.InputSplit")
+val newGetLocationInfo = newInputSplit.getMethod("getLocationInfo")
+val splitLocationInfo = 
Class.forName("org.apache.hadoop.mapred.SplitLocationInfo")
+val isInMemory = splitLocationInfo.getMethod("isInMemory")
+val getLocation = splitLocationInfo.getMethod("getLocation")
+  }
+
+  private[spark] val SPLIT_INFO_REFLECTIONS = try {
--- End diff --

did you decide you'd prefer not to do this?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-3574. Shuffle finish time always reporte...

2014-09-18 Thread sryza
Github user sryza commented on the pull request:

https://github.com/apache/spark/pull/2440#issuecomment-56132627
  
Removing this sounds good to me too.  Will upload a patch.  I think a 
measure of how long a task spends in shuffle would be useful though, as it 
helps users understand whether task slowness is caused by their code or the 
framework.  (I've found a similar metric useful when trying to debug MR 
performance issues.)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: Modify default YARN memory_overhead-- from an ...

2014-09-18 Thread sryza
Github user sryza commented on the pull request:

https://github.com/apache/spark/pull/1391#issuecomment-56132497
  
These changes look good to me.  This addresses what continues to be the #1 
issue that we see in Cloudera customer YARN deployments.  It's worth 
considering boosting this when using PySpark, but that's probably work for 
another JIRA.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: Modify default YARN memory_overhead-- from an ...

2014-09-18 Thread sryza
Github user sryza commented on the pull request:

https://github.com/apache/spark/pull/1391#issuecomment-56132524
  
@nishkamravi2 mind resolving the merge conflicts?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-1701] Clarify slice vs partition in the...

2014-09-18 Thread mattf
Github user mattf commented on the pull request:

https://github.com/apache/spark/pull/2305#issuecomment-56130009
  
> This patch fails unit tests.

i'm getting HTTP 503 from jenkins, but i'm gonna go out on a limb and say 
this doc change didn't break the unit tests.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [YARN] SPARK-2668: Add variable of yarn log di...

2014-09-18 Thread renozhang
Github user renozhang commented on the pull request:

https://github.com/apache/spark/pull/1573#issuecomment-56129984
  
Sorry @tgravescs , these days very busy, I'll address them this weekend.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-927] detect numpy at time of use

2014-09-18 Thread mattf
Github user mattf commented on the pull request:

https://github.com/apache/spark/pull/2313#issuecomment-56129891
  
that's a very good point, especially about how it's an unsolved problem in 
general, at least on our existing operating systems. iirc, systems like plan9 
tried to address complete reproducibility, but i may be misremembering the 
specifics.

the four stated cases are:
 a) driver w/ numpy, worker w/ numpy - numpy used, no message emitted
 b) driver w/ numpy, worker w/o numpy - numpy used on driver, not used on 
workers, warning emitted
 c) driver w/o numpy, worker w/ numpy - numpy not used on driver nor 
worker, no message emitted
 d) driver w/o numpy, worker w/o numpy - numpy not used on driver nor 
worker, no message emitted

case (a) is not a concern because numpy is used consistently throughout
case (b) is not a concern because python random is used consistely throught 
the workers
case (c) and (d) are not a concern because pythons random module are used 
throughout

however, there's a fifth case:
 d) driver w/ numpy, some workers w/ numpy, some workers w/o numpy

there's actually a sixth case, but it's intractable for spark and shouldn't 
be considered: different implementations of python random or numpy's random 
across workers. this is something that should be managed outside of spark.

in (d), some workers will use numpy and others will use random. previously, 
all workers w/o numpy would error out, potentially terminating the computation. 
now, a warning will be emitted (though it'll be emitted to /dev/null) and 
execution will complete.

i'd solve this with a systems approach: remove the python random code and 
require numpy to be present, or remove the numpy code. and, i'd lean toward 
using the faster code (numpy). however, that might not be palatable for the 
project. if it is, i'm more than happy to redo scrap this ticket and create 
another to simplify the RDDSampler.

as i see it, to proceed we evaluate -

```
if acceptable to require numpy:
   matt.dothat()
else:
   if acceptable to potentially compromise re-computability w/ warning:
  commit this
   else:
  scrap this
```

(i'm left out the case where we decide to simply by always using the slower 
python code, because i'd rather not trade off performance to avoid an error 
message and i think adding a numpy dep is straightforward)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3599]Avoid loaing properties file frequ...

2014-09-18 Thread WangTaoTheTonic
GitHub user WangTaoTheTonic opened a pull request:

https://github.com/apache/spark/pull/2454

[SPARK-3599]Avoid loaing properties file frequently

https://issues.apache.org/jira/browse/SPARK-3599

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/WangTaoTheTonic/spark avoidLoadingFrequently

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/2454.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #2454


commit 2a79f26497f9232465aa2e9b496b0d54b9ccda75
Author: WangTaoTheTonic 
Date:   2014-09-19T01:52:37Z

Avoid loaing properties file frequently




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3562]Periodic cleanup event logs

2014-09-18 Thread viper-kun
Github user viper-kun commented on the pull request:

https://github.com/apache/spark/pull/2391#issuecomment-56129613
  
@andrewor14  i have checked Hadoop's JobHistoryServer. it is 
JobHistoryServer's responsibility to delete the application logs. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3589][Minor]remove redundant code

2014-09-18 Thread WangTaoTheTonic
Github user WangTaoTheTonic commented on a diff in the pull request:

https://github.com/apache/spark/pull/2445#discussion_r17766944
  
--- Diff: bin/spark-class ---
@@ -169,7 +169,6 @@ if [ -n "$SPARK_SUBMIT_BOOTSTRAP_DRIVER" ]; then
   # This is used only if the properties file actually contains these 
special configs
   # Export the environment variables needed by 
SparkSubmitDriverBootstrapper
   export RUNNER
-  export CLASSPATH
--- End diff --

ok, it dose make sense.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3562]Periodic cleanup event logs

2014-09-18 Thread viper-kun
Github user viper-kun commented on a diff in the pull request:

https://github.com/apache/spark/pull/2391#discussion_r17766911
  
--- Diff: 
core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala ---
@@ -100,6 +125,12 @@ private[history] class FsHistoryProvider(conf: 
SparkConf) extends ApplicationHis
 checkForLogs()
 logCheckingThread.setDaemon(true)
 logCheckingThread.start()
+
+if(conf.getBoolean("spark.history.cleaner.enable", false)) {
+  cleanLogs()
+  logCleaningThread.setDaemon(true)
+  logCleaningThread.start()
+}
--- End diff --

yes,you are right. it is no need to clean it on initialization.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-1701] Clarify slice vs partition in the...

2014-09-18 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2305#issuecomment-56128662
  
  [QA tests have 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20567/consoleFull)
 for   PR 2305 at commit 
[`c0af05d`](https://github.com/apache/spark/commit/c0af05d387c85d8dce79d1b4839c5950da443dbc).
 * This patch **fails** unit tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3218, SPARK-3219, SPARK-3261, SPARK-342...

2014-09-18 Thread derrickburns
Github user derrickburns commented on the pull request:

https://github.com/apache/spark/pull/2419#issuecomment-56127880
  
I don't understand the test failure. Can someone help me? 

Sent from my iPhone

> On Sep 16, 2014, at 6:59 PM, Nicholas Chammas  
wrote:
> 
> cc @mengxr
> 
> —
> Reply to this email directly or view it on GitHub.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3407][SQL]Add Date type support

2014-09-18 Thread adrian-wang
Github user adrian-wang commented on a diff in the pull request:

https://github.com/apache/spark/pull/2344#discussion_r17766373
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/types/dataTypes.scala 
---
@@ -170,6 +170,18 @@ case object TimestampType extends NativeType {
   def simpleString: String = "timestamp"
 }
 
+case object DateType extends NativeType {
+  private[sql] type JvmType = Date
+
+  @transient private[sql] lazy val tag = ScalaReflectionLock.synchronized 
{ typeTag[JvmType] }
+
+  private[sql] val ordering = new Ordering[JvmType] {
+def compare(x: Date, y: Date) = x.compareTo(y)
--- End diff --

Do we also need to modify `compareTo` from `TimestampType`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-927] detect numpy at time of use

2014-09-18 Thread JoshRosen
Github user JoshRosen commented on the pull request:

https://github.com/apache/spark/pull/2313#issuecomment-56127611
  
This is a tricky issue.

Exact reproducibility / determinism crops up in two different senses here: 
re-running an entire job and re-computing a lost partition.  Spark's 
lineage-based fault-tolerance is built on the idea that partitions can be 
deterministically recomputed.  Tasks that have dependencies on the external 
environment may violate this determinism property (e.g. by reading the current 
system time to set a random seed).  Workers using different versions of 
libraries which give different results is one way that the environment can leak 
into tasks and make them non-deterministic based on where they're run.

There are some scenarios where exact reproducibility might be desirable.  
Imagine that I train a ML model on some data, make predictions with it, and 
want to go back and understand the lineage that led to that model being 
created.  To do this, I may want to deterministically re-run the job with 
additional internal logging.  This use-case is tricky in general, though: 
details of the execution environment might creep in via other means.  We might 
see different results due to rounding errors / numerical instability if we run 
on environments with different BLAS libraries, etc (I guess we could say 
"deterministic within some rounding error / to _k_ bits of precision).  Exact 
long-term reproducibility of computational results is a hard, unsolved problem 
in general.

/cc @mengxr @jkbradley; since you work on MLlib; what do you think we 
should do here?  Is there any precedent in MLlib and its use of native 
libraries?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3597][Mesos] Implement `killTask`.

2014-09-18 Thread brndnmtthws
GitHub user brndnmtthws opened a pull request:

https://github.com/apache/spark/pull/2453

[SPARK-3597][Mesos] Implement `killTask`.

The MesosSchedulerBackend did not previously implement `killTask`,
resulting in an exception.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/brndnmtthws/spark implement-killtask

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/2453.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #2453


commit 40c447861de272fde227a165cb6b1d05bd668447
Author: Brenden Matthews 
Date:   2014-09-19T02:02:50Z

[SPARK-3597][Mesos] Implement `killTask`.

The MesosSchedulerBackend did not previously implement `killTask`,
resulting in an exception.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3529] [SQL] Delete the temp files after...

2014-09-18 Thread mattf
Github user mattf commented on the pull request:

https://github.com/apache/spark/pull/2393#issuecomment-56127248
  
+1 lgtm

fyi, i checked, deleteOnExit isn't an option because it cannot recursively 
delete


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2098] All Spark processes should suppor...

2014-09-18 Thread witgo
Github user witgo commented on a diff in the pull request:

https://github.com/apache/spark/pull/2379#discussion_r17766151
  
--- Diff: 
core/src/main/scala/org/apache/spark/deploy/history/HistoryServerArguments.scala
 ---
@@ -44,30 +50,19 @@ private[spark] class HistoryServerArguments(conf: 
SparkConf, args: Array[String]
 }
   }
 
+  // Use common defaults file, if not specified by user
+  propertiesFile = 
Option(propertiesFile).getOrElse(Utils.getDefaultConfigFile)
+
   private def printUsageAndExit(exitCode: Int) {
 System.err.println(
   """
-  |Usage: HistoryServer
-  |
-  |Configuration options can be set by setting the corresponding JVM 
system property.
-  |History Server options are always available; additional options 
depend on the provider.
-  |
-  |History Server options:
-  |
-  |  spark.history.ui.port  Port where server will listen 
for connections
-  | (default 18080)
-  |  spark.history.acls.enable  Whether to enable view acls 
for all applications
-  | (default false)
-  |  spark.history.provider Name of history provider class 
(defaults to
-  | file system-based provider)
-  |  spark.history.retainedApplications Max number of application UIs 
to keep loaded in memory
-  | (default 50)
-  |FsHistoryProvider options:
-  |
-  |  spark.history.fs.logDirectory  Directory where app logs are 
stored (required)
-  |  spark.history.fs.updateIntervalHow often to reload log data 
from storage (in seconds,
-  | default 10)
-  |""".stripMargin)
--- End diff --

Only here to show the `spark.*` options is confusing. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3535][Mesos] Add 15% task memory overhe...

2014-09-18 Thread brndnmtthws
Github user brndnmtthws commented on the pull request:

https://github.com/apache/spark/pull/2401#issuecomment-56126568
  
I've cleaned up the patch again.

I spent about an hour trying to apply this to the YARN code, but it was 
pretty difficult to follow so I gave up.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3407][SQL]Add Date type support

2014-09-18 Thread chenghao-intel
Github user chenghao-intel commented on a diff in the pull request:

https://github.com/apache/spark/pull/2344#discussion_r17765726
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/types/dataTypes.scala 
---
@@ -170,6 +170,18 @@ case object TimestampType extends NativeType {
   def simpleString: String = "timestamp"
 }
 
+case object DateType extends NativeType {
+  private[sql] type JvmType = Date
+
+  @transient private[sql] lazy val tag = ScalaReflectionLock.synchronized 
{ typeTag[JvmType] }
+
+  private[sql] val ordering = new Ordering[JvmType] {
+def compare(x: Date, y: Date) = x.compareTo(y)
--- End diff --

I've checked the logic of `java.sql.Date`.`compareTo`, and it is not the 
same as `DateWritable`.`compareTo`, which is the internal representation in 
Hive. The former will compare its milliseconds, but the later only compare the 
`days since Epoch`, probably we need to follow the same semantic with Hive here.
BTW: If we change the logic here, does that also mean we needn't cast the 
`date` to `string` for `BinaryPredicate` in `HiveTypeCoercion`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3407][SQL]Add Date type support

2014-09-18 Thread chenghao-intel
Github user chenghao-intel commented on a diff in the pull request:

https://github.com/apache/spark/pull/2344#discussion_r17765733
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/types/dataTypes.scala 
---
@@ -170,6 +170,18 @@ case object TimestampType extends NativeType {
   def simpleString: String = "timestamp"
 }
 
+case object DateType extends NativeType {
+  private[sql] type JvmType = Date
+
+  @transient private[sql] lazy val tag = ScalaReflectionLock.synchronized 
{ typeTag[JvmType] }
+
+  private[sql] val ordering = new Ordering[JvmType] {
+def compare(x: Date, y: Date) = x.compareTo(y)
--- End diff --

I've checked the logic of `java.sql.Date`.`compareTo`, and it is not the 
same as `DateWritable`.`compareTo`, which is the internal representation in 
Hive. The former will compare its milliseconds, but the later only compare the 
`days since Epoch`, probably we need to follow the same semantic with Hive here.
BTW: If we change the logic here, does that also mean we needn't cast the 
`date` to `string` for `BinaryPredicate` in `HiveTypeCoercion`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-1701] Clarify slice vs partition in the...

2014-09-18 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2305#issuecomment-56126150
  
  [QA tests have 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20567/consoleFull)
 for   PR 2305 at commit 
[`c0af05d`](https://github.com/apache/spark/commit/c0af05d387c85d8dce79d1b4839c5950da443dbc).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2713] Executors of same application in ...

2014-09-18 Thread li-zhihui
Github user li-zhihui commented on the pull request:

https://github.com/apache/spark/pull/1616#issuecomment-56126098
  
@andrewor14 any more comments?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3589][Minor]remove redundant code

2014-09-18 Thread andrewor14
Github user andrewor14 commented on a diff in the pull request:

https://github.com/apache/spark/pull/2445#discussion_r17765764
  
--- Diff: bin/spark-class ---
@@ -169,7 +169,6 @@ if [ -n "$SPARK_SUBMIT_BOOTSTRAP_DRIVER" ]; then
   # This is used only if the properties file actually contains these 
special configs
   # Export the environment variables needed by 
SparkSubmitDriverBootstrapper
   export RUNNER
-  export CLASSPATH
--- End diff --

Oh I see, we already exported it out there. I put this to make it more 
explicit that these are the variables needed by 
`SparkSubmitDriverBootstrapper`, so even though this line is somewhat 
redundant, it's here for code readability reasons.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3407][SQL]Add Date type support

2014-09-18 Thread chenghao-intel
Github user chenghao-intel commented on a diff in the pull request:

https://github.com/apache/spark/pull/2344#discussion_r17765755
  
--- Diff: 
sql/core/src/main/java/org/apache/spark/sql/api/java/DateType.java ---
@@ -0,0 +1,27 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.api.java;
+
+/**
+ * The data type representing java.sql.Timestamp values.
+ *
+ * {@code TimestampType} is represented by the singleton object {@link 
DataType#TimestampType}.
--- End diff --

Update the comment, please. :)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-1701] Clarify slice vs partition in the...

2014-09-18 Thread mattf
Github user mattf commented on the pull request:

https://github.com/apache/spark/pull/2305#issuecomment-56125812
  
thanks for the feedback. i've changed the language to be more inline with 
your suggestion.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3535][Mesos] Add 15% task memory overhe...

2014-09-18 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2401#issuecomment-56125354
  
  [QA tests have 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20564/consoleFull)
 for   PR 2401 at commit 
[`d51b74f`](https://github.com/apache/spark/commit/d51b74f26b67ff3719b7ba3182d0693e45301a81).
 * This patch **passes** unit tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-1767: Prefer HDFS-cached replicas when s...

2014-09-18 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1486#issuecomment-56125277
  
  [QA tests have 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20563/consoleFull)
 for   PR 1486 at commit 
[`d1f9fe3`](https://github.com/apache/spark/commit/d1f9fe36392ab18e36e8491cae4598e0063e59fa).
 * This patch **passes** unit tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [WIP][SPARK-1486][MLlib] Multi Model Training ...

2014-09-18 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/2451#discussion_r17765442
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/mllib/linalg/BLASSuite.scala ---
@@ -126,4 +126,142 @@ class BLASSuite extends FunSuite {
   }
 }
   }
+
+  test("gemm") {
--- End diff --

Shouldn't this test all 4 options for transA,transB?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-1545] [mllib] Add Random Forests

2014-09-18 Thread chouqin
Github user chouqin commented on a diff in the pull request:

https://github.com/apache/spark/pull/2435#discussion_r17765427
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/tree/impl/DTStatsAggregator.scala 
---
@@ -189,6 +160,230 @@ private[tree] class DTStatsAggregator(
 }
 this
   }
+}
+
+/**
+ * DecisionTree statistics aggregator.
+ * This holds a flat array of statistics for a set of (nodes, features, 
bins)
+ * and helps with indexing.
+ *
+ * This instance of [[DTStatsAggregator]] is used when not subsampling 
features.
+ *
+ * @param numNodes  Number of nodes to collect statistics for.
+ */
+private[tree] class DTStatsAggregatorFixedFeatures(
+metadata: DecisionTreeMetadata,
+numNodes: Int) extends DTStatsAggregator(metadata) {
+
+  /**
+   * Offset for each feature for calculating indices into the 
[[_allStats]] array.
+   * Mapping: featureIndex --> offset
+   */
+  private val featureOffsets: Array[Int] = {
+metadata.numBins.scanLeft(0)((total, nBins) => total + statsSize * 
nBins)
+  }
+
+  /**
+   * Number of elements for each node, corresponding to stride between 
nodes in [[_allStats]].
+   */
+  private val nodeStride: Int = featureOffsets.last
+
+  /**
+   * Total number of elements stored in this aggregator.
+   */
+  def allStatsSize: Int = numNodes * nodeStride
+
+  /**
+   * Flat array of elements.
+   * Index for start of stats for a (node, feature, bin) is:
+   *   index = nodeIndex * nodeStride + featureOffsets(featureIndex) + 
binIndex * statsSize
+   * Note: For unordered features, the left child stats precede the right 
child stats
+   *   in the binIndex order.
+   */
+  protected val _allStats: Array[Double] = new Array[Double](allStatsSize)
+
+  /**
+   * Get flat array of elements stored in this aggregator.
+   */
+  protected def allStats: Array[Double] = _allStats
+
+  /**
+   * Update the stats for a given (node, feature, bin) for ordered 
features, using the given label.
+   */
+  def update(
+  nodeIndex: Int,
+  featureIndex: Int,
+  binIndex: Int,
+  label: Double,
+  instanceWeight: Double): Unit = {
+val i = nodeIndex * nodeStride + featureOffsets(featureIndex) + 
binIndex * statsSize
+impurityAggregator.update(_allStats, i, label, instanceWeight)
--- End diff --

In `DTStatsAggregatorFixedFeatures` and 
`DTStatsAggregatorSubsampledFeatures`, `impurityAggregator.update` is called 
with `_allStats`, and in `DTStatsAggregator` it is called with `allStats`. I 
think we make it consistent by making a function `updateAggregator` in 
`DTStatsAggregator` like this:

```
def updateAggregator(offset: Int, label: Double, instanceWeight: Double): 
Unit = {
impurityAggregator.update(allStats, offset, label, instanceWeight)
  }
```

all other functions that need to call `impurityAggragator.update` call this 
function.





---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-1701] Clarify slice vs partition in the...

2014-09-18 Thread JoshRosen
Github user JoshRosen commented on a diff in the pull request:

https://github.com/apache/spark/pull/2305#discussion_r17765356
  
--- Diff: docs/programming-guide.md ---
@@ -286,7 +286,7 @@ We describe operations on distributed datasets later on.
 
 
 
-One important parameter for parallel collections is the number of *slices* 
to cut the dataset into. Spark will run one task for each slice of the cluster. 
Typically you want 2-4 slices for each CPU in your cluster. Normally, Spark 
tries to set the number of slices automatically based on your cluster. However, 
you can also set it manually by passing it as a second parameter to 
`parallelize` (e.g. `sc.parallelize(data, 10)`).
+One important parameter for parallel collections is the number of 
*partitions* to cut the dataset into. Spark will run one task for each 
partition of the cluster. Typically you want 2-4 partitions for each CPU in 
your cluster. Normally, Spark tries to set the number of partitions 
automatically based on your cluster. However, you can also set it manually by 
passing it as a second parameter to `parallelize` (e.g. `sc.parallelize(data, 
10)`). Note: the parameter is called numSlices (not numPartitions) to maintain 
backward compatibility.
--- End diff --

Maybe the "Note:" should mention that in _some_ places we still say 
numSlices (for backwards compatibility with earlier versions of Spark) and that 
"slices" should be considered as a synonym for "partitions"; there are a lot of 
places that use `numPartitions`, etc, so we may want to emphasize that this 
discrepancy only occurs in a few places.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-1701] Clarify slice vs partition in the...

2014-09-18 Thread JoshRosen
Github user JoshRosen commented on the pull request:

https://github.com/apache/spark/pull/2305#issuecomment-56124864
  
Sorry for not reviewing this until now; it sort of fell off my radar.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3589][Minor]remove redundant code

2014-09-18 Thread WangTaoTheTonic
Github user WangTaoTheTonic commented on a diff in the pull request:

https://github.com/apache/spark/pull/2445#discussion_r17765294
  
--- Diff: bin/spark-class ---
@@ -169,7 +169,6 @@ if [ -n "$SPARK_SUBMIT_BOOTSTRAP_DRIVER" ]; then
   # This is used only if the properties file actually contains these 
special configs
   # Export the environment variables needed by 
SparkSubmitDriverBootstrapper
   export RUNNER
-  export CLASSPATH
--- End diff --

@pwendell  I mean we just export CLASSPATH before entering if 
conditional(line 161). Are the two different?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3529] [SQL] Delete the temp files after...

2014-09-18 Thread chenghao-intel
Github user chenghao-intel commented on the pull request:

https://github.com/apache/spark/pull/2393#issuecomment-56124852
  
Thank you all, I've removed the `Signal` and use the 
`Utils.deleteRecursively` instead.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3554] [PySpark] use broadcast automatic...

2014-09-18 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/2417


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3554] [PySpark] use broadcast automatic...

2014-09-18 Thread JoshRosen
Github user JoshRosen commented on the pull request:

https://github.com/apache/spark/pull/2417#issuecomment-56124617
  
LGTM.  Surprising that the broadcast variable removal code was never 
triggered in the test suite before; thanks for fixing that!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [WIP][SPARK-1486][MLlib] Multi Model Training ...

2014-09-18 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/2451#discussion_r17765188
  
--- Diff: mllib/src/main/scala/org/apache/spark/mllib/linalg/BLAS.scala ---
@@ -197,4 +201,452 @@ private[mllib] object BLAS extends Serializable {
 throw new IllegalArgumentException(s"scal doesn't support vector 
type ${x.getClass}.")
 }
   }
+
+  // For level-3 routines, we use the native BLAS.
+  private def nativeBLAS: NetlibBLAS = {
+if (_nativeBLAS == null) {
+  _nativeBLAS = NativeBLAS
+}
+_nativeBLAS
+  }
+
+  /**
+   * C := alpha * A * B + beta * C
+   * @param transA whether to use the transpose of matrix A (true), or A 
itself (false).
+   * @param transB whether to use the transpose of matrix B (true), or B 
itself (false).
+   * @param alpha a scalar to scale the multiplication A * B.
+   * @param A the matrix A that will be left multiplied to B. Size of m x 
k.
+   * @param B the matrix B that will be left multiplied by A. Size of k x 
n.
+   * @param beta a scalar that can be used to scale matrix C.
+   * @param C the resulting matrix C. Size of m x n.
+   */
+  def gemm(
+  transA: Boolean,
+  transB: Boolean,
+  alpha: Double,
+  A: Matrix,
+  B: Matrix,
+  beta: Double,
+  C: DenseMatrix): Unit = {
+if (alpha == 0.0) {
+  logDebug("gemm: alpha is equal to 0. Returning C.")
+} else {
+  A match {
+case sparse: SparseMatrix =>
+  B match {
+case dB: DenseMatrix => gemm(transA, transB, alpha, sparse, 
dB, beta, C)
+case sB: SparseMatrix =>
+  throw new IllegalArgumentException(s"gemm doesn't support 
sparse-sparse matrix " +
+s"multiplication")
+case _ =>
+  throw new IllegalArgumentException(s"gemm doesn't support 
matrix type ${B.getClass}.")
+  }
+case dense: DenseMatrix =>
+  B match {
+case dB: DenseMatrix => gemm(transA, transB, alpha, dense, dB, 
beta, C)
+case sB: SparseMatrix => gemm(transA, transB, alpha, dense, 
sB, beta, C)
+case _ =>
+  throw new IllegalArgumentException(s"gemm doesn't support 
matrix type ${B.getClass}.")
+  }
+case _ =>
+  throw new IllegalArgumentException(s"gemm doesn't support matrix 
type ${A.getClass}.")
+  }
+}
+  }
+
+  /**
+   * C := alpha * A * B + beta * C
+   *
+   * @param alpha a scalar to scale the multiplication A * B.
+   * @param A the matrix A that will be left multiplied to B. Size of m x 
k.
+   * @param B the matrix B that will be left multiplied by A. Size of k x 
n.
+   * @param beta a scalar that can be used to scale matrix C.
+   * @param C the resulting matrix C. Size of m x n.
+   */
+  def gemm(
+  alpha: Double,
+  A: Matrix,
+  B: Matrix,
+  beta: Double,
+  C: DenseMatrix): Unit = {
+gemm(false, false, alpha, A, B, beta, C)
+  }
+
+  /**
+   * C := alpha * A * B + beta * C
+   * For `DenseMatrix` A.
+   */
+  private def gemm(
+  transA: Boolean,
+  transB: Boolean,
+  alpha: Double,
+  A: DenseMatrix,
+  B: DenseMatrix,
+  beta: Double,
+  C: DenseMatrix): Unit = {
+val mA: Int = if (!transA) A.numRows else A.numCols
+val nB: Int = if (!transB) B.numCols else B.numRows
+val kA: Int = if (!transA) A.numCols else A.numRows
+val kB: Int = if (!transB) B.numRows else B.numCols
+val tAstr = if (!transA) "N" else "T"
+val tBstr = if (!transB) "N" else "T"
+
+require(kA == kB, s"The columns of A don't match the rows of B. A: 
$kA, B: $kB")
+require(mA == C.numRows, s"The rows of C don't match the rows of A. C: 
${C.numRows}, A: $mA")
+require(nB == C.numCols,
+  s"The columns of C don't match the columns of B. C: ${C.numCols}, A: 
$nB")
+
+nativeBLAS.dgemm(tAstr, tBstr, mA, nB, kA, alpha, A.values, A.numRows, 
B.values, B.numRows,
+  beta, C.values, C.numRows)
+  }
+
+  /**
+   * C := alpha * A * B + beta * C
+   * For `SparseMatrix` A.
+   */
+  private def gemm(
+  transA: Boolean,
+  transB: Boolean,
+  alpha: Double,
+  A: SparseMatrix,
+  B: DenseMatrix,
+  beta: Double,
+  C: DenseMatrix): Unit = {
+val mA: Int = if (!transA) A.numRows else A.numCols
+val nB: Int = if (!transB) B.numCols else B.numRows
+val kA: Int = if (!transA) A.numCols else A.numRows
+val kB: Int = if (!transB) B.numRows else B.numCols
+
+r

[GitHub] spark pull request: [SPARK-1545] [mllib] Add Random Forests

2014-09-18 Thread chouqin
Github user chouqin commented on a diff in the pull request:

https://github.com/apache/spark/pull/2435#discussion_r17765183
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/tree/impl/DTStatsAggregator.scala 
---
@@ -189,6 +160,230 @@ private[tree] class DTStatsAggregator(
 }
 this
   }
+}
+
+/**
+ * DecisionTree statistics aggregator.
+ * This holds a flat array of statistics for a set of (nodes, features, 
bins)
+ * and helps with indexing.
+ *
+ * This instance of [[DTStatsAggregator]] is used when not subsampling 
features.
+ *
+ * @param numNodes  Number of nodes to collect statistics for.
+ */
+private[tree] class DTStatsAggregatorFixedFeatures(
+metadata: DecisionTreeMetadata,
+numNodes: Int) extends DTStatsAggregator(metadata) {
+
+  /**
+   * Offset for each feature for calculating indices into the 
[[_allStats]] array.
+   * Mapping: featureIndex --> offset
+   */
+  private val featureOffsets: Array[Int] = {
+metadata.numBins.scanLeft(0)((total, nBins) => total + statsSize * 
nBins)
+  }
+
+  /**
+   * Number of elements for each node, corresponding to stride between 
nodes in [[_allStats]].
+   */
+  private val nodeStride: Int = featureOffsets.last
+
+  /**
+   * Total number of elements stored in this aggregator.
+   */
+  def allStatsSize: Int = numNodes * nodeStride
+
+  /**
+   * Flat array of elements.
+   * Index for start of stats for a (node, feature, bin) is:
+   *   index = nodeIndex * nodeStride + featureOffsets(featureIndex) + 
binIndex * statsSize
+   * Note: For unordered features, the left child stats precede the right 
child stats
+   *   in the binIndex order.
+   */
+  protected val _allStats: Array[Double] = new Array[Double](allStatsSize)
+
+  /**
+   * Get flat array of elements stored in this aggregator.
+   */
+  protected def allStats: Array[Double] = _allStats
+
+  /**
+   * Update the stats for a given (node, feature, bin) for ordered 
features, using the given label.
+   */
+  def update(
+  nodeIndex: Int,
+  featureIndex: Int,
+  binIndex: Int,
+  label: Double,
+  instanceWeight: Double): Unit = {
+val i = nodeIndex * nodeStride + featureOffsets(featureIndex) + 
binIndex * statsSize
+impurityAggregator.update(_allStats, i, label, instanceWeight)
+  }
+
+  /**
+   * Pre-compute node offset for use with [[nodeUpdate]].
+   */
+  def getNodeOffset(nodeIndex: Int): Int = nodeIndex * nodeStride
+
+  /**
+   * Faster version of [[update]].
+   * Update the stats for a given (node, feature, bin) for ordered 
features, using the given label.
+   * @param nodeOffset  Pre-computed node offset from [[getNodeOffset]].
+   */
+  def nodeUpdate(
+  nodeOffset: Int,
+  nodeIndex: Int,
+  featureIndex: Int,
+  binIndex: Int,
+  label: Double,
+  instanceWeight: Double): Unit = {
+val i = nodeOffset + featureOffsets(featureIndex) + binIndex * 
statsSize
+impurityAggregator.update(_allStats, i, label, instanceWeight)
+  }
+
+  /**
+   * Pre-compute (node, feature) offset for use with [[nodeFeatureUpdate]].
+   * For ordered features only.
+   */
+  def getNodeFeatureOffset(nodeIndex: Int, featureIndex: Int): Int = {
+require(!isUnordered(featureIndex),
+  s"DTStatsAggregator.getNodeFeatureOffset is for ordered features 
only, but was called" +
+s" for unordered feature $featureIndex.")
+nodeIndex * nodeStride + featureOffsets(featureIndex)
+  }
+
+  /**
+   * Pre-compute (node, feature) offset for use with [[nodeFeatureUpdate]].
+   * For unordered features only.
+   */
+  def getLeftRightNodeFeatureOffsets(nodeIndex: Int, featureIndex: Int): 
(Int, Int) = {
+require(isUnordered(featureIndex),
--- End diff --

Can we promote this function to the base class `DTStatsAggregator` like 
this:

```scala
  def getLeftRightNodeFeatureOffsets(nodeIndex: Int, featureIndex: Int): 
(Int, Int) = {
val baseOffset = getNodeFeatureOffset(nodeIndex, featureIndex)
(baseOffset, baseOffset + (metadata.numBins(featureIndex) >> 1) * 
statsSize)
  }
```

This can  avoid duplicate code in derived class.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubs

[GitHub] spark pull request: [WIP][SPARK-1486][MLlib] Multi Model Training ...

2014-09-18 Thread brkyvz
Github user brkyvz commented on a diff in the pull request:

https://github.com/apache/spark/pull/2451#discussion_r17765187
  
--- Diff: mllib/src/main/scala/org/apache/spark/mllib/linalg/BLAS.scala ---
@@ -197,4 +201,452 @@ private[mllib] object BLAS extends Serializable {
 throw new IllegalArgumentException(s"scal doesn't support vector 
type ${x.getClass}.")
 }
   }
+
+  // For level-3 routines, we use the native BLAS.
+  private def nativeBLAS: NetlibBLAS = {
+if (_nativeBLAS == null) {
+  _nativeBLAS = NativeBLAS
+}
+_nativeBLAS
+  }
+
+  /**
+   * C := alpha * A * B + beta * C
+   * @param transA whether to use the transpose of matrix A (true), or A 
itself (false).
+   * @param transB whether to use the transpose of matrix B (true), or B 
itself (false).
+   * @param alpha a scalar to scale the multiplication A * B.
+   * @param A the matrix A that will be left multiplied to B. Size of m x 
k.
+   * @param B the matrix B that will be left multiplied by A. Size of k x 
n.
+   * @param beta a scalar that can be used to scale matrix C.
+   * @param C the resulting matrix C. Size of m x n.
+   */
+  def gemm(
+  transA: Boolean,
+  transB: Boolean,
+  alpha: Double,
+  A: Matrix,
+  B: Matrix,
+  beta: Double,
+  C: DenseMatrix): Unit = {
+if (alpha == 0.0) {
+  logDebug("gemm: alpha is equal to 0. Returning C.")
+} else {
+  A match {
+case sparse: SparseMatrix =>
+  B match {
+case dB: DenseMatrix => gemm(transA, transB, alpha, sparse, 
dB, beta, C)
+case sB: SparseMatrix =>
+  throw new IllegalArgumentException(s"gemm doesn't support 
sparse-sparse matrix " +
+s"multiplication")
+case _ =>
+  throw new IllegalArgumentException(s"gemm doesn't support 
matrix type ${B.getClass}.")
+  }
+case dense: DenseMatrix =>
+  B match {
+case dB: DenseMatrix => gemm(transA, transB, alpha, dense, dB, 
beta, C)
+case sB: SparseMatrix => gemm(transA, transB, alpha, dense, 
sB, beta, C)
+case _ =>
+  throw new IllegalArgumentException(s"gemm doesn't support 
matrix type ${B.getClass}.")
+  }
+case _ =>
+  throw new IllegalArgumentException(s"gemm doesn't support matrix 
type ${A.getClass}.")
+  }
+}
+  }
+
+  /**
+   * C := alpha * A * B + beta * C
+   *
+   * @param alpha a scalar to scale the multiplication A * B.
+   * @param A the matrix A that will be left multiplied to B. Size of m x 
k.
+   * @param B the matrix B that will be left multiplied by A. Size of k x 
n.
+   * @param beta a scalar that can be used to scale matrix C.
+   * @param C the resulting matrix C. Size of m x n.
+   */
+  def gemm(
+  alpha: Double,
+  A: Matrix,
+  B: Matrix,
+  beta: Double,
+  C: DenseMatrix): Unit = {
+gemm(false, false, alpha, A, B, beta, C)
+  }
+
+  /**
+   * C := alpha * A * B + beta * C
+   * For `DenseMatrix` A.
+   */
+  private def gemm(
+  transA: Boolean,
+  transB: Boolean,
+  alpha: Double,
+  A: DenseMatrix,
+  B: DenseMatrix,
+  beta: Double,
+  C: DenseMatrix): Unit = {
+val mA: Int = if (!transA) A.numRows else A.numCols
+val nB: Int = if (!transB) B.numCols else B.numRows
+val kA: Int = if (!transA) A.numCols else A.numRows
+val kB: Int = if (!transB) B.numRows else B.numCols
+val tAstr = if (!transA) "N" else "T"
+val tBstr = if (!transB) "N" else "T"
+
+require(kA == kB, s"The columns of A don't match the rows of B. A: 
$kA, B: $kB")
+require(mA == C.numRows, s"The rows of C don't match the rows of A. C: 
${C.numRows}, A: $mA")
+require(nB == C.numCols,
+  s"The columns of C don't match the columns of B. C: ${C.numCols}, A: 
$nB")
+
+nativeBLAS.dgemm(tAstr, tBstr, mA, nB, kA, alpha, A.values, A.numRows, 
B.values, B.numRows,
+  beta, C.values, C.numRows)
+  }
+
+  /**
+   * C := alpha * A * B + beta * C
+   * For `SparseMatrix` A.
+   */
+  private def gemm(
+  transA: Boolean,
+  transB: Boolean,
+  alpha: Double,
+  A: SparseMatrix,
+  B: DenseMatrix,
+  beta: Double,
+  C: DenseMatrix): Unit = {
+val mA: Int = if (!transA) A.numRows else A.numCols
+val nB: Int = if (!transB) B.numCols else B.numRows
+val kA: Int = if (!transA) A.numCols else A.numRows
+val kB: Int = if (!transB) B.numRows else B.numCols
+
+requ

[GitHub] spark pull request: [WIP][SPARK-1486][MLlib] Multi Model Training ...

2014-09-18 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/2451#discussion_r17765178
  
--- Diff: mllib/src/main/scala/org/apache/spark/mllib/linalg/BLAS.scala ---
@@ -197,4 +201,452 @@ private[mllib] object BLAS extends Serializable {
 throw new IllegalArgumentException(s"scal doesn't support vector 
type ${x.getClass}.")
 }
   }
+
+  // For level-3 routines, we use the native BLAS.
+  private def nativeBLAS: NetlibBLAS = {
+if (_nativeBLAS == null) {
+  _nativeBLAS = NativeBLAS
+}
+_nativeBLAS
+  }
+
+  /**
+   * C := alpha * A * B + beta * C
+   * @param transA whether to use the transpose of matrix A (true), or A 
itself (false).
+   * @param transB whether to use the transpose of matrix B (true), or B 
itself (false).
+   * @param alpha a scalar to scale the multiplication A * B.
+   * @param A the matrix A that will be left multiplied to B. Size of m x 
k.
+   * @param B the matrix B that will be left multiplied by A. Size of k x 
n.
+   * @param beta a scalar that can be used to scale matrix C.
+   * @param C the resulting matrix C. Size of m x n.
+   */
+  def gemm(
+  transA: Boolean,
+  transB: Boolean,
+  alpha: Double,
+  A: Matrix,
+  B: Matrix,
+  beta: Double,
+  C: DenseMatrix): Unit = {
+if (alpha == 0.0) {
+  logDebug("gemm: alpha is equal to 0. Returning C.")
+} else {
+  A match {
+case sparse: SparseMatrix =>
+  B match {
+case dB: DenseMatrix => gemm(transA, transB, alpha, sparse, 
dB, beta, C)
+case sB: SparseMatrix =>
+  throw new IllegalArgumentException(s"gemm doesn't support 
sparse-sparse matrix " +
+s"multiplication")
+case _ =>
+  throw new IllegalArgumentException(s"gemm doesn't support 
matrix type ${B.getClass}.")
+  }
+case dense: DenseMatrix =>
+  B match {
+case dB: DenseMatrix => gemm(transA, transB, alpha, dense, dB, 
beta, C)
+case sB: SparseMatrix => gemm(transA, transB, alpha, dense, 
sB, beta, C)
+case _ =>
+  throw new IllegalArgumentException(s"gemm doesn't support 
matrix type ${B.getClass}.")
+  }
+case _ =>
+  throw new IllegalArgumentException(s"gemm doesn't support matrix 
type ${A.getClass}.")
+  }
+}
+  }
+
+  /**
+   * C := alpha * A * B + beta * C
+   *
+   * @param alpha a scalar to scale the multiplication A * B.
+   * @param A the matrix A that will be left multiplied to B. Size of m x 
k.
+   * @param B the matrix B that will be left multiplied by A. Size of k x 
n.
+   * @param beta a scalar that can be used to scale matrix C.
+   * @param C the resulting matrix C. Size of m x n.
+   */
+  def gemm(
+  alpha: Double,
+  A: Matrix,
+  B: Matrix,
+  beta: Double,
+  C: DenseMatrix): Unit = {
+gemm(false, false, alpha, A, B, beta, C)
+  }
+
+  /**
+   * C := alpha * A * B + beta * C
+   * For `DenseMatrix` A.
+   */
+  private def gemm(
+  transA: Boolean,
+  transB: Boolean,
+  alpha: Double,
+  A: DenseMatrix,
+  B: DenseMatrix,
+  beta: Double,
+  C: DenseMatrix): Unit = {
+val mA: Int = if (!transA) A.numRows else A.numCols
+val nB: Int = if (!transB) B.numCols else B.numRows
+val kA: Int = if (!transA) A.numCols else A.numRows
+val kB: Int = if (!transB) B.numRows else B.numCols
+val tAstr = if (!transA) "N" else "T"
+val tBstr = if (!transB) "N" else "T"
+
+require(kA == kB, s"The columns of A don't match the rows of B. A: 
$kA, B: $kB")
+require(mA == C.numRows, s"The rows of C don't match the rows of A. C: 
${C.numRows}, A: $mA")
+require(nB == C.numCols,
+  s"The columns of C don't match the columns of B. C: ${C.numCols}, A: 
$nB")
+
+nativeBLAS.dgemm(tAstr, tBstr, mA, nB, kA, alpha, A.values, A.numRows, 
B.values, B.numRows,
+  beta, C.values, C.numRows)
+  }
+
+  /**
+   * C := alpha * A * B + beta * C
+   * For `SparseMatrix` A.
+   */
+  private def gemm(
+  transA: Boolean,
+  transB: Boolean,
+  alpha: Double,
+  A: SparseMatrix,
+  B: DenseMatrix,
+  beta: Double,
+  C: DenseMatrix): Unit = {
+val mA: Int = if (!transA) A.numRows else A.numCols
+val nB: Int = if (!transB) B.numCols else B.numRows
+val kA: Int = if (!transA) A.numCols else A.numRows
+val kB: Int = if (!transB) B.numRows else B.numCols
+
+r

[GitHub] spark pull request: [WIP][SPARK-1486][MLlib] Multi Model Training ...

2014-09-18 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/2451#discussion_r17765175
  
--- Diff: mllib/src/main/scala/org/apache/spark/mllib/linalg/BLAS.scala ---
@@ -197,4 +201,452 @@ private[mllib] object BLAS extends Serializable {
 throw new IllegalArgumentException(s"scal doesn't support vector 
type ${x.getClass}.")
 }
   }
+
+  // For level-3 routines, we use the native BLAS.
+  private def nativeBLAS: NetlibBLAS = {
+if (_nativeBLAS == null) {
+  _nativeBLAS = NativeBLAS
+}
+_nativeBLAS
+  }
+
+  /**
+   * C := alpha * A * B + beta * C
+   * @param transA whether to use the transpose of matrix A (true), or A 
itself (false).
+   * @param transB whether to use the transpose of matrix B (true), or B 
itself (false).
+   * @param alpha a scalar to scale the multiplication A * B.
+   * @param A the matrix A that will be left multiplied to B. Size of m x 
k.
+   * @param B the matrix B that will be left multiplied by A. Size of k x 
n.
+   * @param beta a scalar that can be used to scale matrix C.
+   * @param C the resulting matrix C. Size of m x n.
+   */
+  def gemm(
+  transA: Boolean,
+  transB: Boolean,
+  alpha: Double,
+  A: Matrix,
+  B: Matrix,
+  beta: Double,
+  C: DenseMatrix): Unit = {
+if (alpha == 0.0) {
+  logDebug("gemm: alpha is equal to 0. Returning C.")
+} else {
+  A match {
+case sparse: SparseMatrix =>
+  B match {
+case dB: DenseMatrix => gemm(transA, transB, alpha, sparse, 
dB, beta, C)
+case sB: SparseMatrix =>
+  throw new IllegalArgumentException(s"gemm doesn't support 
sparse-sparse matrix " +
+s"multiplication")
+case _ =>
+  throw new IllegalArgumentException(s"gemm doesn't support 
matrix type ${B.getClass}.")
+  }
+case dense: DenseMatrix =>
+  B match {
+case dB: DenseMatrix => gemm(transA, transB, alpha, dense, dB, 
beta, C)
+case sB: SparseMatrix => gemm(transA, transB, alpha, dense, 
sB, beta, C)
+case _ =>
+  throw new IllegalArgumentException(s"gemm doesn't support 
matrix type ${B.getClass}.")
+  }
+case _ =>
+  throw new IllegalArgumentException(s"gemm doesn't support matrix 
type ${A.getClass}.")
+  }
+}
+  }
+
+  /**
+   * C := alpha * A * B + beta * C
+   *
+   * @param alpha a scalar to scale the multiplication A * B.
+   * @param A the matrix A that will be left multiplied to B. Size of m x 
k.
+   * @param B the matrix B that will be left multiplied by A. Size of k x 
n.
+   * @param beta a scalar that can be used to scale matrix C.
+   * @param C the resulting matrix C. Size of m x n.
+   */
+  def gemm(
+  alpha: Double,
+  A: Matrix,
+  B: Matrix,
+  beta: Double,
+  C: DenseMatrix): Unit = {
+gemm(false, false, alpha, A, B, beta, C)
+  }
+
+  /**
+   * C := alpha * A * B + beta * C
+   * For `DenseMatrix` A.
+   */
+  private def gemm(
+  transA: Boolean,
+  transB: Boolean,
+  alpha: Double,
+  A: DenseMatrix,
+  B: DenseMatrix,
+  beta: Double,
+  C: DenseMatrix): Unit = {
+val mA: Int = if (!transA) A.numRows else A.numCols
+val nB: Int = if (!transB) B.numCols else B.numRows
+val kA: Int = if (!transA) A.numCols else A.numRows
+val kB: Int = if (!transB) B.numRows else B.numCols
+val tAstr = if (!transA) "N" else "T"
+val tBstr = if (!transB) "N" else "T"
+
+require(kA == kB, s"The columns of A don't match the rows of B. A: 
$kA, B: $kB")
+require(mA == C.numRows, s"The rows of C don't match the rows of A. C: 
${C.numRows}, A: $mA")
+require(nB == C.numCols,
+  s"The columns of C don't match the columns of B. C: ${C.numCols}, A: 
$nB")
+
+nativeBLAS.dgemm(tAstr, tBstr, mA, nB, kA, alpha, A.values, A.numRows, 
B.values, B.numRows,
+  beta, C.values, C.numRows)
+  }
+
+  /**
+   * C := alpha * A * B + beta * C
+   * For `SparseMatrix` A.
+   */
+  private def gemm(
+  transA: Boolean,
+  transB: Boolean,
+  alpha: Double,
+  A: SparseMatrix,
+  B: DenseMatrix,
+  beta: Double,
+  C: DenseMatrix): Unit = {
+val mA: Int = if (!transA) A.numRows else A.numCols
+val nB: Int = if (!transB) B.numCols else B.numRows
+val kA: Int = if (!transA) A.numCols else A.numRows
+val kB: Int = if (!transB) B.numRows else B.numCols
+
+r

[GitHub] spark pull request: [WIP][SPARK-1486][MLlib] Multi Model Training ...

2014-09-18 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/2451#discussion_r17765173
  
--- Diff: mllib/src/main/scala/org/apache/spark/mllib/linalg/BLAS.scala ---
@@ -197,4 +201,452 @@ private[mllib] object BLAS extends Serializable {
 throw new IllegalArgumentException(s"scal doesn't support vector 
type ${x.getClass}.")
 }
   }
+
+  // For level-3 routines, we use the native BLAS.
+  private def nativeBLAS: NetlibBLAS = {
+if (_nativeBLAS == null) {
+  _nativeBLAS = NativeBLAS
+}
+_nativeBLAS
+  }
+
+  /**
+   * C := alpha * A * B + beta * C
+   * @param transA whether to use the transpose of matrix A (true), or A 
itself (false).
+   * @param transB whether to use the transpose of matrix B (true), or B 
itself (false).
+   * @param alpha a scalar to scale the multiplication A * B.
+   * @param A the matrix A that will be left multiplied to B. Size of m x 
k.
+   * @param B the matrix B that will be left multiplied by A. Size of k x 
n.
+   * @param beta a scalar that can be used to scale matrix C.
+   * @param C the resulting matrix C. Size of m x n.
+   */
+  def gemm(
+  transA: Boolean,
+  transB: Boolean,
+  alpha: Double,
+  A: Matrix,
+  B: Matrix,
+  beta: Double,
+  C: DenseMatrix): Unit = {
+if (alpha == 0.0) {
+  logDebug("gemm: alpha is equal to 0. Returning C.")
+} else {
+  A match {
+case sparse: SparseMatrix =>
+  B match {
+case dB: DenseMatrix => gemm(transA, transB, alpha, sparse, 
dB, beta, C)
+case sB: SparseMatrix =>
+  throw new IllegalArgumentException(s"gemm doesn't support 
sparse-sparse matrix " +
+s"multiplication")
+case _ =>
+  throw new IllegalArgumentException(s"gemm doesn't support 
matrix type ${B.getClass}.")
+  }
+case dense: DenseMatrix =>
+  B match {
+case dB: DenseMatrix => gemm(transA, transB, alpha, dense, dB, 
beta, C)
+case sB: SparseMatrix => gemm(transA, transB, alpha, dense, 
sB, beta, C)
+case _ =>
+  throw new IllegalArgumentException(s"gemm doesn't support 
matrix type ${B.getClass}.")
+  }
+case _ =>
+  throw new IllegalArgumentException(s"gemm doesn't support matrix 
type ${A.getClass}.")
+  }
+}
+  }
+
+  /**
+   * C := alpha * A * B + beta * C
+   *
+   * @param alpha a scalar to scale the multiplication A * B.
+   * @param A the matrix A that will be left multiplied to B. Size of m x 
k.
+   * @param B the matrix B that will be left multiplied by A. Size of k x 
n.
+   * @param beta a scalar that can be used to scale matrix C.
+   * @param C the resulting matrix C. Size of m x n.
+   */
+  def gemm(
+  alpha: Double,
+  A: Matrix,
+  B: Matrix,
+  beta: Double,
+  C: DenseMatrix): Unit = {
+gemm(false, false, alpha, A, B, beta, C)
+  }
+
+  /**
+   * C := alpha * A * B + beta * C
+   * For `DenseMatrix` A.
+   */
+  private def gemm(
+  transA: Boolean,
+  transB: Boolean,
+  alpha: Double,
+  A: DenseMatrix,
+  B: DenseMatrix,
+  beta: Double,
+  C: DenseMatrix): Unit = {
+val mA: Int = if (!transA) A.numRows else A.numCols
+val nB: Int = if (!transB) B.numCols else B.numRows
+val kA: Int = if (!transA) A.numCols else A.numRows
+val kB: Int = if (!transB) B.numRows else B.numCols
+val tAstr = if (!transA) "N" else "T"
+val tBstr = if (!transB) "N" else "T"
+
+require(kA == kB, s"The columns of A don't match the rows of B. A: 
$kA, B: $kB")
+require(mA == C.numRows, s"The rows of C don't match the rows of A. C: 
${C.numRows}, A: $mA")
+require(nB == C.numCols,
+  s"The columns of C don't match the columns of B. C: ${C.numCols}, A: 
$nB")
+
+nativeBLAS.dgemm(tAstr, tBstr, mA, nB, kA, alpha, A.values, A.numRows, 
B.values, B.numRows,
+  beta, C.values, C.numRows)
+  }
+
+  /**
+   * C := alpha * A * B + beta * C
+   * For `SparseMatrix` A.
+   */
+  private def gemm(
+  transA: Boolean,
+  transB: Boolean,
+  alpha: Double,
+  A: SparseMatrix,
+  B: DenseMatrix,
+  beta: Double,
+  C: DenseMatrix): Unit = {
+val mA: Int = if (!transA) A.numRows else A.numCols
+val nB: Int = if (!transB) B.numCols else B.numRows
+val kA: Int = if (!transA) A.numCols else A.numRows
+val kB: Int = if (!transB) B.numRows else B.numCols
+
+r

[GitHub] spark pull request: [WIP][SPARK-1486][MLlib] Multi Model Training ...

2014-09-18 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/2451#discussion_r17765167
  
--- Diff: mllib/src/main/scala/org/apache/spark/mllib/linalg/BLAS.scala ---
@@ -197,4 +201,452 @@ private[mllib] object BLAS extends Serializable {
 throw new IllegalArgumentException(s"scal doesn't support vector 
type ${x.getClass}.")
 }
   }
+
+  // For level-3 routines, we use the native BLAS.
+  private def nativeBLAS: NetlibBLAS = {
+if (_nativeBLAS == null) {
+  _nativeBLAS = NativeBLAS
+}
+_nativeBLAS
+  }
+
+  /**
+   * C := alpha * A * B + beta * C
+   * @param transA whether to use the transpose of matrix A (true), or A 
itself (false).
+   * @param transB whether to use the transpose of matrix B (true), or B 
itself (false).
+   * @param alpha a scalar to scale the multiplication A * B.
+   * @param A the matrix A that will be left multiplied to B. Size of m x 
k.
+   * @param B the matrix B that will be left multiplied by A. Size of k x 
n.
+   * @param beta a scalar that can be used to scale matrix C.
+   * @param C the resulting matrix C. Size of m x n.
+   */
+  def gemm(
+  transA: Boolean,
+  transB: Boolean,
+  alpha: Double,
+  A: Matrix,
+  B: Matrix,
+  beta: Double,
+  C: DenseMatrix): Unit = {
+if (alpha == 0.0) {
+  logDebug("gemm: alpha is equal to 0. Returning C.")
+} else {
+  A match {
+case sparse: SparseMatrix =>
+  B match {
+case dB: DenseMatrix => gemm(transA, transB, alpha, sparse, 
dB, beta, C)
+case sB: SparseMatrix =>
+  throw new IllegalArgumentException(s"gemm doesn't support 
sparse-sparse matrix " +
+s"multiplication")
+case _ =>
+  throw new IllegalArgumentException(s"gemm doesn't support 
matrix type ${B.getClass}.")
+  }
+case dense: DenseMatrix =>
+  B match {
+case dB: DenseMatrix => gemm(transA, transB, alpha, dense, dB, 
beta, C)
+case sB: SparseMatrix => gemm(transA, transB, alpha, dense, 
sB, beta, C)
+case _ =>
+  throw new IllegalArgumentException(s"gemm doesn't support 
matrix type ${B.getClass}.")
+  }
+case _ =>
+  throw new IllegalArgumentException(s"gemm doesn't support matrix 
type ${A.getClass}.")
+  }
+}
+  }
+
+  /**
+   * C := alpha * A * B + beta * C
+   *
+   * @param alpha a scalar to scale the multiplication A * B.
+   * @param A the matrix A that will be left multiplied to B. Size of m x 
k.
+   * @param B the matrix B that will be left multiplied by A. Size of k x 
n.
+   * @param beta a scalar that can be used to scale matrix C.
+   * @param C the resulting matrix C. Size of m x n.
+   */
+  def gemm(
+  alpha: Double,
+  A: Matrix,
+  B: Matrix,
+  beta: Double,
+  C: DenseMatrix): Unit = {
+gemm(false, false, alpha, A, B, beta, C)
+  }
+
+  /**
+   * C := alpha * A * B + beta * C
+   * For `DenseMatrix` A.
+   */
+  private def gemm(
+  transA: Boolean,
+  transB: Boolean,
+  alpha: Double,
+  A: DenseMatrix,
+  B: DenseMatrix,
+  beta: Double,
+  C: DenseMatrix): Unit = {
+val mA: Int = if (!transA) A.numRows else A.numCols
+val nB: Int = if (!transB) B.numCols else B.numRows
+val kA: Int = if (!transA) A.numCols else A.numRows
+val kB: Int = if (!transB) B.numRows else B.numCols
+val tAstr = if (!transA) "N" else "T"
+val tBstr = if (!transB) "N" else "T"
+
+require(kA == kB, s"The columns of A don't match the rows of B. A: 
$kA, B: $kB")
+require(mA == C.numRows, s"The rows of C don't match the rows of A. C: 
${C.numRows}, A: $mA")
+require(nB == C.numCols,
+  s"The columns of C don't match the columns of B. C: ${C.numCols}, A: 
$nB")
+
+nativeBLAS.dgemm(tAstr, tBstr, mA, nB, kA, alpha, A.values, A.numRows, 
B.values, B.numRows,
+  beta, C.values, C.numRows)
+  }
+
+  /**
+   * C := alpha * A * B + beta * C
+   * For `SparseMatrix` A.
+   */
+  private def gemm(
+  transA: Boolean,
+  transB: Boolean,
+  alpha: Double,
+  A: SparseMatrix,
+  B: DenseMatrix,
+  beta: Double,
+  C: DenseMatrix): Unit = {
+val mA: Int = if (!transA) A.numRows else A.numCols
+val nB: Int = if (!transB) B.numCols else B.numRows
+val kA: Int = if (!transA) A.numCols else A.numRows
+val kB: Int = if (!transB) B.numRows else B.numCols
+
+r

[GitHub] spark pull request: [WIP][SPARK-1486][MLlib] Multi Model Training ...

2014-09-18 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/2451#discussion_r17765077
  
--- Diff: mllib/src/main/scala/org/apache/spark/mllib/linalg/BLAS.scala ---
@@ -197,4 +201,452 @@ private[mllib] object BLAS extends Serializable {
 throw new IllegalArgumentException(s"scal doesn't support vector 
type ${x.getClass}.")
 }
   }
+
+  // For level-3 routines, we use the native BLAS.
+  private def nativeBLAS: NetlibBLAS = {
+if (_nativeBLAS == null) {
+  _nativeBLAS = NativeBLAS
+}
+_nativeBLAS
+  }
+
+  /**
+   * C := alpha * A * B + beta * C
+   * @param transA whether to use the transpose of matrix A (true), or A 
itself (false).
+   * @param transB whether to use the transpose of matrix B (true), or B 
itself (false).
+   * @param alpha a scalar to scale the multiplication A * B.
+   * @param A the matrix A that will be left multiplied to B. Size of m x 
k.
+   * @param B the matrix B that will be left multiplied by A. Size of k x 
n.
+   * @param beta a scalar that can be used to scale matrix C.
+   * @param C the resulting matrix C. Size of m x n.
+   */
+  def gemm(
+  transA: Boolean,
+  transB: Boolean,
+  alpha: Double,
+  A: Matrix,
+  B: Matrix,
+  beta: Double,
+  C: DenseMatrix): Unit = {
+if (alpha == 0.0) {
+  logDebug("gemm: alpha is equal to 0. Returning C.")
+} else {
+  A match {
+case sparse: SparseMatrix =>
+  B match {
+case dB: DenseMatrix => gemm(transA, transB, alpha, sparse, 
dB, beta, C)
+case sB: SparseMatrix =>
+  throw new IllegalArgumentException(s"gemm doesn't support 
sparse-sparse matrix " +
+s"multiplication")
+case _ =>
+  throw new IllegalArgumentException(s"gemm doesn't support 
matrix type ${B.getClass}.")
+  }
+case dense: DenseMatrix =>
+  B match {
+case dB: DenseMatrix => gemm(transA, transB, alpha, dense, dB, 
beta, C)
+case sB: SparseMatrix => gemm(transA, transB, alpha, dense, 
sB, beta, C)
+case _ =>
+  throw new IllegalArgumentException(s"gemm doesn't support 
matrix type ${B.getClass}.")
+  }
+case _ =>
+  throw new IllegalArgumentException(s"gemm doesn't support matrix 
type ${A.getClass}.")
+  }
+}
+  }
+
+  /**
+   * C := alpha * A * B + beta * C
+   *
+   * @param alpha a scalar to scale the multiplication A * B.
+   * @param A the matrix A that will be left multiplied to B. Size of m x 
k.
+   * @param B the matrix B that will be left multiplied by A. Size of k x 
n.
+   * @param beta a scalar that can be used to scale matrix C.
+   * @param C the resulting matrix C. Size of m x n.
+   */
+  def gemm(
+  alpha: Double,
+  A: Matrix,
+  B: Matrix,
+  beta: Double,
+  C: DenseMatrix): Unit = {
+gemm(false, false, alpha, A, B, beta, C)
+  }
+
+  /**
+   * C := alpha * A * B + beta * C
+   * For `DenseMatrix` A.
+   */
+  private def gemm(
+  transA: Boolean,
+  transB: Boolean,
+  alpha: Double,
+  A: DenseMatrix,
+  B: DenseMatrix,
+  beta: Double,
+  C: DenseMatrix): Unit = {
+val mA: Int = if (!transA) A.numRows else A.numCols
+val nB: Int = if (!transB) B.numCols else B.numRows
+val kA: Int = if (!transA) A.numCols else A.numRows
+val kB: Int = if (!transB) B.numRows else B.numCols
+val tAstr = if (!transA) "N" else "T"
+val tBstr = if (!transB) "N" else "T"
+
+require(kA == kB, s"The columns of A don't match the rows of B. A: 
$kA, B: $kB")
+require(mA == C.numRows, s"The rows of C don't match the rows of A. C: 
${C.numRows}, A: $mA")
+require(nB == C.numCols,
+  s"The columns of C don't match the columns of B. C: ${C.numCols}, A: 
$nB")
+
+nativeBLAS.dgemm(tAstr, tBstr, mA, nB, kA, alpha, A.values, A.numRows, 
B.values, B.numRows,
+  beta, C.values, C.numRows)
+  }
+
+  /**
+   * C := alpha * A * B + beta * C
+   * For `SparseMatrix` A.
+   */
+  private def gemm(
+  transA: Boolean,
+  transB: Boolean,
+  alpha: Double,
+  A: SparseMatrix,
+  B: DenseMatrix,
+  beta: Double,
+  C: DenseMatrix): Unit = {
+val mA: Int = if (!transA) A.numRows else A.numCols
+val nB: Int = if (!transB) B.numCols else B.numRows
+val kA: Int = if (!transA) A.numCols else A.numRows
+val kB: Int = if (!transB) B.numRows else B.numCols
+
+r

[GitHub] spark pull request: [SPARK-1545] [mllib] Add Random Forests

2014-09-18 Thread chouqin
Github user chouqin commented on a diff in the pull request:

https://github.com/apache/spark/pull/2435#discussion_r17765048
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/tree/impl/DTStatsAggregator.scala 
---
@@ -189,6 +160,230 @@ private[tree] class DTStatsAggregator(
 }
 this
   }
+}
+
+/**
+ * DecisionTree statistics aggregator.
+ * This holds a flat array of statistics for a set of (nodes, features, 
bins)
+ * and helps with indexing.
+ *
+ * This instance of [[DTStatsAggregator]] is used when not subsampling 
features.
+ *
+ * @param numNodes  Number of nodes to collect statistics for.
+ */
+private[tree] class DTStatsAggregatorFixedFeatures(
+metadata: DecisionTreeMetadata,
+numNodes: Int) extends DTStatsAggregator(metadata) {
+
+  /**
+   * Offset for each feature for calculating indices into the 
[[_allStats]] array.
+   * Mapping: featureIndex --> offset
+   */
+  private val featureOffsets: Array[Int] = {
+metadata.numBins.scanLeft(0)((total, nBins) => total + statsSize * 
nBins)
+  }
+
+  /**
+   * Number of elements for each node, corresponding to stride between 
nodes in [[_allStats]].
+   */
+  private val nodeStride: Int = featureOffsets.last
+
+  /**
+   * Total number of elements stored in this aggregator.
+   */
+  def allStatsSize: Int = numNodes * nodeStride
+
+  /**
+   * Flat array of elements.
+   * Index for start of stats for a (node, feature, bin) is:
+   *   index = nodeIndex * nodeStride + featureOffsets(featureIndex) + 
binIndex * statsSize
+   * Note: For unordered features, the left child stats precede the right 
child stats
+   *   in the binIndex order.
+   */
+  protected val _allStats: Array[Double] = new Array[Double](allStatsSize)
+
+  /**
+   * Get flat array of elements stored in this aggregator.
+   */
+  protected def allStats: Array[Double] = _allStats
+
+  /**
+   * Update the stats for a given (node, feature, bin) for ordered 
features, using the given label.
+   */
+  def update(
+  nodeIndex: Int,
+  featureIndex: Int,
+  binIndex: Int,
+  label: Double,
+  instanceWeight: Double): Unit = {
+val i = nodeIndex * nodeStride + featureOffsets(featureIndex) + 
binIndex * statsSize
+impurityAggregator.update(_allStats, i, label, instanceWeight)
+  }
+
+  /**
+   * Pre-compute node offset for use with [[nodeUpdate]].
+   */
+  def getNodeOffset(nodeIndex: Int): Int = nodeIndex * nodeStride
+
+  /**
+   * Faster version of [[update]].
+   * Update the stats for a given (node, feature, bin) for ordered 
features, using the given label.
--- End diff --

just curious, is this function really faster than `update`. I think it just 
saves one multiplication and in `DTStatsAggregatorSubsampledFeatures` it saves 
nothing.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [WIP][SPARK-1486][MLlib] Multi Model Training ...

2014-09-18 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/2451#discussion_r17765001
  
--- Diff: mllib/src/main/scala/org/apache/spark/mllib/linalg/BLAS.scala ---
@@ -197,4 +201,452 @@ private[mllib] object BLAS extends Serializable {
 throw new IllegalArgumentException(s"scal doesn't support vector 
type ${x.getClass}.")
 }
   }
+
+  // For level-3 routines, we use the native BLAS.
+  private def nativeBLAS: NetlibBLAS = {
+if (_nativeBLAS == null) {
+  _nativeBLAS = NativeBLAS
+}
+_nativeBLAS
+  }
+
+  /**
+   * C := alpha * A * B + beta * C
+   * @param transA whether to use the transpose of matrix A (true), or A 
itself (false).
+   * @param transB whether to use the transpose of matrix B (true), or B 
itself (false).
+   * @param alpha a scalar to scale the multiplication A * B.
+   * @param A the matrix A that will be left multiplied to B. Size of m x 
k.
+   * @param B the matrix B that will be left multiplied by A. Size of k x 
n.
+   * @param beta a scalar that can be used to scale matrix C.
+   * @param C the resulting matrix C. Size of m x n.
+   */
+  def gemm(
+  transA: Boolean,
+  transB: Boolean,
+  alpha: Double,
+  A: Matrix,
+  B: Matrix,
+  beta: Double,
+  C: DenseMatrix): Unit = {
+if (alpha == 0.0) {
+  logDebug("gemm: alpha is equal to 0. Returning C.")
+} else {
+  A match {
+case sparse: SparseMatrix =>
+  B match {
+case dB: DenseMatrix => gemm(transA, transB, alpha, sparse, 
dB, beta, C)
+case sB: SparseMatrix =>
+  throw new IllegalArgumentException(s"gemm doesn't support 
sparse-sparse matrix " +
+s"multiplication")
+case _ =>
+  throw new IllegalArgumentException(s"gemm doesn't support 
matrix type ${B.getClass}.")
+  }
+case dense: DenseMatrix =>
+  B match {
+case dB: DenseMatrix => gemm(transA, transB, alpha, dense, dB, 
beta, C)
+case sB: SparseMatrix => gemm(transA, transB, alpha, dense, 
sB, beta, C)
+case _ =>
+  throw new IllegalArgumentException(s"gemm doesn't support 
matrix type ${B.getClass}.")
+  }
+case _ =>
+  throw new IllegalArgumentException(s"gemm doesn't support matrix 
type ${A.getClass}.")
+  }
+}
+  }
+
+  /**
+   * C := alpha * A * B + beta * C
+   *
+   * @param alpha a scalar to scale the multiplication A * B.
+   * @param A the matrix A that will be left multiplied to B. Size of m x 
k.
+   * @param B the matrix B that will be left multiplied by A. Size of k x 
n.
+   * @param beta a scalar that can be used to scale matrix C.
+   * @param C the resulting matrix C. Size of m x n.
+   */
+  def gemm(
+  alpha: Double,
+  A: Matrix,
+  B: Matrix,
+  beta: Double,
+  C: DenseMatrix): Unit = {
+gemm(false, false, alpha, A, B, beta, C)
+  }
+
+  /**
+   * C := alpha * A * B + beta * C
+   * For `DenseMatrix` A.
+   */
+  private def gemm(
+  transA: Boolean,
+  transB: Boolean,
+  alpha: Double,
+  A: DenseMatrix,
+  B: DenseMatrix,
+  beta: Double,
+  C: DenseMatrix): Unit = {
+val mA: Int = if (!transA) A.numRows else A.numCols
+val nB: Int = if (!transB) B.numCols else B.numRows
+val kA: Int = if (!transA) A.numCols else A.numRows
+val kB: Int = if (!transB) B.numRows else B.numCols
+val tAstr = if (!transA) "N" else "T"
+val tBstr = if (!transB) "N" else "T"
+
+require(kA == kB, s"The columns of A don't match the rows of B. A: 
$kA, B: $kB")
+require(mA == C.numRows, s"The rows of C don't match the rows of A. C: 
${C.numRows}, A: $mA")
+require(nB == C.numCols,
+  s"The columns of C don't match the columns of B. C: ${C.numCols}, A: 
$nB")
+
+nativeBLAS.dgemm(tAstr, tBstr, mA, nB, kA, alpha, A.values, A.numRows, 
B.values, B.numRows,
+  beta, C.values, C.numRows)
+  }
+
+  /**
+   * C := alpha * A * B + beta * C
+   * For `SparseMatrix` A.
+   */
+  private def gemm(
+  transA: Boolean,
+  transB: Boolean,
+  alpha: Double,
+  A: SparseMatrix,
+  B: DenseMatrix,
+  beta: Double,
+  C: DenseMatrix): Unit = {
+val mA: Int = if (!transA) A.numRows else A.numCols
+val nB: Int = if (!transB) B.numCols else B.numRows
+val kA: Int = if (!transA) A.numCols else A.numRows
+val kB: Int = if (!transB) B.numRows else B.numCols
+
+r

[GitHub] spark pull request: [SPARK-1987] EdgePartitionBuilder: More memory...

2014-09-18 Thread larryxiao
Github user larryxiao commented on the pull request:

https://github.com/apache/spark/pull/2446#issuecomment-56123977
  
Thanks Ankur! You are really efficient!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [WIP][SPARK-1486][MLlib] Multi Model Training ...

2014-09-18 Thread brkyvz
Github user brkyvz commented on a diff in the pull request:

https://github.com/apache/spark/pull/2451#discussion_r17764905
  
--- Diff: mllib/src/main/scala/org/apache/spark/mllib/linalg/BLAS.scala ---
@@ -197,4 +201,452 @@ private[mllib] object BLAS extends Serializable {
 throw new IllegalArgumentException(s"scal doesn't support vector 
type ${x.getClass}.")
 }
   }
+
+  // For level-3 routines, we use the native BLAS.
+  private def nativeBLAS: NetlibBLAS = {
+if (_nativeBLAS == null) {
+  _nativeBLAS = NativeBLAS
+}
+_nativeBLAS
+  }
+
+  /**
+   * C := alpha * A * B + beta * C
+   * @param transA whether to use the transpose of matrix A (true), or A 
itself (false).
+   * @param transB whether to use the transpose of matrix B (true), or B 
itself (false).
+   * @param alpha a scalar to scale the multiplication A * B.
+   * @param A the matrix A that will be left multiplied to B. Size of m x 
k.
+   * @param B the matrix B that will be left multiplied by A. Size of k x 
n.
+   * @param beta a scalar that can be used to scale matrix C.
+   * @param C the resulting matrix C. Size of m x n.
+   */
+  def gemm(
+  transA: Boolean,
+  transB: Boolean,
+  alpha: Double,
+  A: Matrix,
+  B: Matrix,
+  beta: Double,
+  C: DenseMatrix): Unit = {
+if (alpha == 0.0) {
+  logDebug("gemm: alpha is equal to 0. Returning C.")
+} else {
+  A match {
+case sparse: SparseMatrix =>
+  B match {
+case dB: DenseMatrix => gemm(transA, transB, alpha, sparse, 
dB, beta, C)
+case sB: SparseMatrix =>
+  throw new IllegalArgumentException(s"gemm doesn't support 
sparse-sparse matrix " +
+s"multiplication")
+case _ =>
+  throw new IllegalArgumentException(s"gemm doesn't support 
matrix type ${B.getClass}.")
+  }
+case dense: DenseMatrix =>
+  B match {
+case dB: DenseMatrix => gemm(transA, transB, alpha, dense, dB, 
beta, C)
+case sB: SparseMatrix => gemm(transA, transB, alpha, dense, 
sB, beta, C)
+case _ =>
+  throw new IllegalArgumentException(s"gemm doesn't support 
matrix type ${B.getClass}.")
+  }
+case _ =>
+  throw new IllegalArgumentException(s"gemm doesn't support matrix 
type ${A.getClass}.")
+  }
+}
+  }
+
+  /**
+   * C := alpha * A * B + beta * C
+   *
+   * @param alpha a scalar to scale the multiplication A * B.
+   * @param A the matrix A that will be left multiplied to B. Size of m x 
k.
+   * @param B the matrix B that will be left multiplied by A. Size of k x 
n.
+   * @param beta a scalar that can be used to scale matrix C.
+   * @param C the resulting matrix C. Size of m x n.
+   */
+  def gemm(
+  alpha: Double,
+  A: Matrix,
+  B: Matrix,
+  beta: Double,
+  C: DenseMatrix): Unit = {
+gemm(false, false, alpha, A, B, beta, C)
+  }
+
+  /**
+   * C := alpha * A * B + beta * C
+   * For `DenseMatrix` A.
+   */
+  private def gemm(
+  transA: Boolean,
+  transB: Boolean,
+  alpha: Double,
+  A: DenseMatrix,
+  B: DenseMatrix,
+  beta: Double,
+  C: DenseMatrix): Unit = {
+val mA: Int = if (!transA) A.numRows else A.numCols
+val nB: Int = if (!transB) B.numCols else B.numRows
+val kA: Int = if (!transA) A.numCols else A.numRows
+val kB: Int = if (!transB) B.numRows else B.numCols
+val tAstr = if (!transA) "N" else "T"
+val tBstr = if (!transB) "N" else "T"
+
+require(kA == kB, s"The columns of A don't match the rows of B. A: 
$kA, B: $kB")
+require(mA == C.numRows, s"The rows of C don't match the rows of A. C: 
${C.numRows}, A: $mA")
+require(nB == C.numCols,
+  s"The columns of C don't match the columns of B. C: ${C.numCols}, A: 
$nB")
+
+nativeBLAS.dgemm(tAstr, tBstr, mA, nB, kA, alpha, A.values, A.numRows, 
B.values, B.numRows,
+  beta, C.values, C.numRows)
+  }
+
+  /**
+   * C := alpha * A * B + beta * C
+   * For `SparseMatrix` A.
+   */
+  private def gemm(
+  transA: Boolean,
+  transB: Boolean,
+  alpha: Double,
+  A: SparseMatrix,
+  B: DenseMatrix,
+  beta: Double,
+  C: DenseMatrix): Unit = {
+val mA: Int = if (!transA) A.numRows else A.numCols
+val nB: Int = if (!transB) B.numCols else B.numRows
+val kA: Int = if (!transA) A.numCols else A.numRows
+val kB: Int = if (!transB) B.numRows else B.numCols
+
+requ

[GitHub] spark pull request: [SPARK-3418] Sparse Matrix support (CCS) and a...

2014-09-18 Thread brkyvz
Github user brkyvz commented on the pull request:

https://github.com/apache/spark/pull/2294#issuecomment-56123894
  
@ScrapCodes THANKS A LOT! That fixed it! I didn't realize I didn't update 
my local repo for such a long time.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [WIP][SPARK-1486][MLlib] Multi Model Training ...

2014-09-18 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/2451#discussion_r17764836
  
--- Diff: mllib/src/main/scala/org/apache/spark/mllib/linalg/BLAS.scala ---
@@ -197,4 +201,452 @@ private[mllib] object BLAS extends Serializable {
 throw new IllegalArgumentException(s"scal doesn't support vector 
type ${x.getClass}.")
 }
   }
+
+  // For level-3 routines, we use the native BLAS.
+  private def nativeBLAS: NetlibBLAS = {
+if (_nativeBLAS == null) {
+  _nativeBLAS = NativeBLAS
+}
+_nativeBLAS
+  }
+
+  /**
+   * C := alpha * A * B + beta * C
+   * @param transA whether to use the transpose of matrix A (true), or A 
itself (false).
+   * @param transB whether to use the transpose of matrix B (true), or B 
itself (false).
+   * @param alpha a scalar to scale the multiplication A * B.
+   * @param A the matrix A that will be left multiplied to B. Size of m x 
k.
+   * @param B the matrix B that will be left multiplied by A. Size of k x 
n.
+   * @param beta a scalar that can be used to scale matrix C.
+   * @param C the resulting matrix C. Size of m x n.
+   */
+  def gemm(
+  transA: Boolean,
+  transB: Boolean,
+  alpha: Double,
+  A: Matrix,
+  B: Matrix,
+  beta: Double,
+  C: DenseMatrix): Unit = {
+if (alpha == 0.0) {
+  logDebug("gemm: alpha is equal to 0. Returning C.")
+} else {
+  A match {
+case sparse: SparseMatrix =>
+  B match {
+case dB: DenseMatrix => gemm(transA, transB, alpha, sparse, 
dB, beta, C)
+case sB: SparseMatrix =>
+  throw new IllegalArgumentException(s"gemm doesn't support 
sparse-sparse matrix " +
+s"multiplication")
+case _ =>
+  throw new IllegalArgumentException(s"gemm doesn't support 
matrix type ${B.getClass}.")
+  }
+case dense: DenseMatrix =>
+  B match {
+case dB: DenseMatrix => gemm(transA, transB, alpha, dense, dB, 
beta, C)
+case sB: SparseMatrix => gemm(transA, transB, alpha, dense, 
sB, beta, C)
+case _ =>
+  throw new IllegalArgumentException(s"gemm doesn't support 
matrix type ${B.getClass}.")
+  }
+case _ =>
+  throw new IllegalArgumentException(s"gemm doesn't support matrix 
type ${A.getClass}.")
+  }
+}
+  }
+
+  /**
+   * C := alpha * A * B + beta * C
+   *
+   * @param alpha a scalar to scale the multiplication A * B.
+   * @param A the matrix A that will be left multiplied to B. Size of m x 
k.
+   * @param B the matrix B that will be left multiplied by A. Size of k x 
n.
+   * @param beta a scalar that can be used to scale matrix C.
+   * @param C the resulting matrix C. Size of m x n.
+   */
+  def gemm(
+  alpha: Double,
+  A: Matrix,
+  B: Matrix,
+  beta: Double,
+  C: DenseMatrix): Unit = {
+gemm(false, false, alpha, A, B, beta, C)
+  }
+
+  /**
+   * C := alpha * A * B + beta * C
+   * For `DenseMatrix` A.
+   */
+  private def gemm(
+  transA: Boolean,
+  transB: Boolean,
+  alpha: Double,
+  A: DenseMatrix,
+  B: DenseMatrix,
+  beta: Double,
+  C: DenseMatrix): Unit = {
+val mA: Int = if (!transA) A.numRows else A.numCols
+val nB: Int = if (!transB) B.numCols else B.numRows
+val kA: Int = if (!transA) A.numCols else A.numRows
+val kB: Int = if (!transB) B.numRows else B.numCols
+val tAstr = if (!transA) "N" else "T"
+val tBstr = if (!transB) "N" else "T"
+
+require(kA == kB, s"The columns of A don't match the rows of B. A: 
$kA, B: $kB")
+require(mA == C.numRows, s"The rows of C don't match the rows of A. C: 
${C.numRows}, A: $mA")
+require(nB == C.numCols,
+  s"The columns of C don't match the columns of B. C: ${C.numCols}, A: 
$nB")
+
+nativeBLAS.dgemm(tAstr, tBstr, mA, nB, kA, alpha, A.values, A.numRows, 
B.values, B.numRows,
+  beta, C.values, C.numRows)
+  }
+
+  /**
+   * C := alpha * A * B + beta * C
+   * For `SparseMatrix` A.
+   */
+  private def gemm(
+  transA: Boolean,
+  transB: Boolean,
+  alpha: Double,
+  A: SparseMatrix,
+  B: DenseMatrix,
+  beta: Double,
+  C: DenseMatrix): Unit = {
+val mA: Int = if (!transA) A.numRows else A.numCols
+val nB: Int = if (!transB) B.numCols else B.numRows
+val kA: Int = if (!transA) A.numCols else A.numRows
+val kB: Int = if (!transB) B.numRows else B.numCols
+
+r

[GitHub] spark pull request: [WIP][SPARK-1486][MLlib] Multi Model Training ...

2014-09-18 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/2451#discussion_r17764833
  
--- Diff: mllib/src/main/scala/org/apache/spark/mllib/linalg/BLAS.scala ---
@@ -197,4 +201,452 @@ private[mllib] object BLAS extends Serializable {
 throw new IllegalArgumentException(s"scal doesn't support vector 
type ${x.getClass}.")
 }
   }
+
+  // For level-3 routines, we use the native BLAS.
+  private def nativeBLAS: NetlibBLAS = {
+if (_nativeBLAS == null) {
+  _nativeBLAS = NativeBLAS
+}
+_nativeBLAS
+  }
+
+  /**
+   * C := alpha * A * B + beta * C
+   * @param transA whether to use the transpose of matrix A (true), or A 
itself (false).
+   * @param transB whether to use the transpose of matrix B (true), or B 
itself (false).
+   * @param alpha a scalar to scale the multiplication A * B.
+   * @param A the matrix A that will be left multiplied to B. Size of m x 
k.
+   * @param B the matrix B that will be left multiplied by A. Size of k x 
n.
+   * @param beta a scalar that can be used to scale matrix C.
+   * @param C the resulting matrix C. Size of m x n.
+   */
+  def gemm(
+  transA: Boolean,
+  transB: Boolean,
+  alpha: Double,
+  A: Matrix,
+  B: Matrix,
+  beta: Double,
+  C: DenseMatrix): Unit = {
+if (alpha == 0.0) {
+  logDebug("gemm: alpha is equal to 0. Returning C.")
+} else {
+  A match {
+case sparse: SparseMatrix =>
+  B match {
+case dB: DenseMatrix => gemm(transA, transB, alpha, sparse, 
dB, beta, C)
+case sB: SparseMatrix =>
+  throw new IllegalArgumentException(s"gemm doesn't support 
sparse-sparse matrix " +
+s"multiplication")
+case _ =>
+  throw new IllegalArgumentException(s"gemm doesn't support 
matrix type ${B.getClass}.")
+  }
+case dense: DenseMatrix =>
+  B match {
+case dB: DenseMatrix => gemm(transA, transB, alpha, dense, dB, 
beta, C)
+case sB: SparseMatrix => gemm(transA, transB, alpha, dense, 
sB, beta, C)
+case _ =>
+  throw new IllegalArgumentException(s"gemm doesn't support 
matrix type ${B.getClass}.")
+  }
+case _ =>
+  throw new IllegalArgumentException(s"gemm doesn't support matrix 
type ${A.getClass}.")
+  }
+}
+  }
+
+  /**
+   * C := alpha * A * B + beta * C
+   *
+   * @param alpha a scalar to scale the multiplication A * B.
+   * @param A the matrix A that will be left multiplied to B. Size of m x 
k.
+   * @param B the matrix B that will be left multiplied by A. Size of k x 
n.
+   * @param beta a scalar that can be used to scale matrix C.
+   * @param C the resulting matrix C. Size of m x n.
+   */
+  def gemm(
+  alpha: Double,
+  A: Matrix,
+  B: Matrix,
+  beta: Double,
+  C: DenseMatrix): Unit = {
+gemm(false, false, alpha, A, B, beta, C)
+  }
+
+  /**
+   * C := alpha * A * B + beta * C
+   * For `DenseMatrix` A.
+   */
+  private def gemm(
+  transA: Boolean,
+  transB: Boolean,
+  alpha: Double,
+  A: DenseMatrix,
+  B: DenseMatrix,
+  beta: Double,
+  C: DenseMatrix): Unit = {
+val mA: Int = if (!transA) A.numRows else A.numCols
+val nB: Int = if (!transB) B.numCols else B.numRows
+val kA: Int = if (!transA) A.numCols else A.numRows
+val kB: Int = if (!transB) B.numRows else B.numCols
+val tAstr = if (!transA) "N" else "T"
+val tBstr = if (!transB) "N" else "T"
+
+require(kA == kB, s"The columns of A don't match the rows of B. A: 
$kA, B: $kB")
+require(mA == C.numRows, s"The rows of C don't match the rows of A. C: 
${C.numRows}, A: $mA")
+require(nB == C.numCols,
+  s"The columns of C don't match the columns of B. C: ${C.numCols}, A: 
$nB")
+
+nativeBLAS.dgemm(tAstr, tBstr, mA, nB, kA, alpha, A.values, A.numRows, 
B.values, B.numRows,
+  beta, C.values, C.numRows)
+  }
+
+  /**
+   * C := alpha * A * B + beta * C
+   * For `SparseMatrix` A.
+   */
+  private def gemm(
+  transA: Boolean,
+  transB: Boolean,
+  alpha: Double,
+  A: SparseMatrix,
+  B: DenseMatrix,
+  beta: Double,
+  C: DenseMatrix): Unit = {
+val mA: Int = if (!transA) A.numRows else A.numCols
+val nB: Int = if (!transB) B.numCols else B.numRows
+val kA: Int = if (!transA) A.numCols else A.numRows
+val kB: Int = if (!transB) B.numRows else B.numCols
+
+r

[GitHub] spark pull request: [SPARK-3535][Mesos] Add 15% task memory overhe...

2014-09-18 Thread brndnmtthws
Github user brndnmtthws commented on the pull request:

https://github.com/apache/spark/pull/2401#issuecomment-56123833
  
I can emulate the YARN behaviour, but it seems better to just do the same 
thing with both Mesos and YARN.  Thoughts?  I can refactor this (including the 
YARN code) to make it consistent.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3418] Sparse Matrix support (CCS) and a...

2014-09-18 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2294#issuecomment-56123782
  
  [QA tests have 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20562/consoleFull)
 for   PR 2294 at commit 
[`88814ed`](https://github.com/apache/spark/commit/88814ed215e81bb6b945c8fff60bdb822ac2cb96).
 * This patch **passes** unit tests.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `sealed trait Matrix extends Serializable `
  * `class SparseMatrix(`
  * `sealed trait Vector extends Serializable `



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3535][Mesos] Add 15% task memory overhe...

2014-09-18 Thread andrewor14
Github user andrewor14 commented on the pull request:

https://github.com/apache/spark/pull/2401#issuecomment-56123690
  
Yes @willb I have the same concern. I think in other contexts "overhead" 
refers to the additional space, not the total space, so an "overhead fraction" 
of 0.15 means we use extra space that is equal to 15% of the original memory. 
It will be confusing if we expose a "fraction" config that has a value of 1.15.

However, even with the latest changes this is still inconsistent with what 
we do for yarn. I think there is value in having an equivalent config for Mesos 
rather than one that is semantically different.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [Minor Hot Fix] Move a line in SparkSubmit to ...

2014-09-18 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/2452


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [Minor Hot Fix] Move a line in SparkSubmit to ...

2014-09-18 Thread andrewor14
Github user andrewor14 commented on the pull request:

https://github.com/apache/spark/pull/2452#issuecomment-56123334
  
Thanks @andrewor14, I'm merging this into master and 1.1!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-927] detect numpy at time of use

2014-09-18 Thread mattf
Github user mattf commented on the pull request:

https://github.com/apache/spark/pull/2313#issuecomment-56123348
  
@JoshRosen it looks like @davies and i are on the same page. how would you 
like to proceed?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [Minor Hot Fix] Move a line in SparkSubmit to ...

2014-09-18 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2452#issuecomment-56122936
  
  [QA tests have 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20561/consoleFull)
 for   PR 2452 at commit 
[`d5190ca`](https://github.com/apache/spark/commit/d5190ca661991b11a6dc6682c92600beab032175).
 * This patch **passes** unit tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [WIP][SPARK-1486][MLlib] Multi Model Training ...

2014-09-18 Thread jkbradley
Github user jkbradley commented on the pull request:

https://github.com/apache/spark/pull/2451#issuecomment-56122908
  
@brkyvz  Just wondering: Which reference library are you using to determine 
the order of arguments for BLAS routines?  E.g., it's different from [Netlib 
LAPACK](http://www.netlib.org/lapack/explore-html/d7/d2b/dgemm_8f.html).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3491] [MLlib] [PySpark] use pickle to s...

2014-09-18 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2378#issuecomment-56122608
  
  [QA tests have 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20560/consoleFull)
 for   PR 2378 at commit 
[`810f97f`](https://github.com/apache/spark/commit/810f97f53befdb262e55e736500d909b5f869f1a).
 * This patch **passes** unit tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-1545] [mllib] Add Random Forests

2014-09-18 Thread chouqin
Github user chouqin commented on a diff in the pull request:

https://github.com/apache/spark/pull/2435#discussion_r17764228
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/tree/impl/DTStatsAggregator.scala 
---
@@ -189,6 +160,230 @@ private[tree] class DTStatsAggregator(
 }
 this
   }
+}
+
+/**
+ * DecisionTree statistics aggregator.
+ * This holds a flat array of statistics for a set of (nodes, features, 
bins)
+ * and helps with indexing.
+ *
+ * This instance of [[DTStatsAggregator]] is used when not subsampling 
features.
+ *
+ * @param numNodes  Number of nodes to collect statistics for.
+ */
+private[tree] class DTStatsAggregatorFixedFeatures(
+metadata: DecisionTreeMetadata,
+numNodes: Int) extends DTStatsAggregator(metadata) {
+
+  /**
+   * Offset for each feature for calculating indices into the 
[[_allStats]] array.
+   * Mapping: featureIndex --> offset
+   */
+  private val featureOffsets: Array[Int] = {
+metadata.numBins.scanLeft(0)((total, nBins) => total + statsSize * 
nBins)
+  }
+
+  /**
+   * Number of elements for each node, corresponding to stride between 
nodes in [[_allStats]].
+   */
+  private val nodeStride: Int = featureOffsets.last
+
+  /**
+   * Total number of elements stored in this aggregator.
+   */
+  def allStatsSize: Int = numNodes * nodeStride
+
+  /**
+   * Flat array of elements.
+   * Index for start of stats for a (node, feature, bin) is:
+   *   index = nodeIndex * nodeStride + featureOffsets(featureIndex) + 
binIndex * statsSize
+   * Note: For unordered features, the left child stats precede the right 
child stats
+   *   in the binIndex order.
+   */
+  protected val _allStats: Array[Double] = new Array[Double](allStatsSize)
+
+  /**
+   * Get flat array of elements stored in this aggregator.
+   */
+  protected def allStats: Array[Double] = _allStats
+
+  /**
+   * Update the stats for a given (node, feature, bin) for ordered 
features, using the given label.
+   */
+  def update(
+  nodeIndex: Int,
+  featureIndex: Int,
+  binIndex: Int,
+  label: Double,
+  instanceWeight: Double): Unit = {
+val i = nodeIndex * nodeStride + featureOffsets(featureIndex) + 
binIndex * statsSize
+impurityAggregator.update(_allStats, i, label, instanceWeight)
+  }
+
+  /**
+   * Pre-compute node offset for use with [[nodeUpdate]].
+   */
+  def getNodeOffset(nodeIndex: Int): Int = nodeIndex * nodeStride
+
+  /**
+   * Faster version of [[update]].
+   * Update the stats for a given (node, feature, bin) for ordered 
features, using the given label.
+   * @param nodeOffset  Pre-computed node offset from [[getNodeOffset]].
+   */
+  def nodeUpdate(
+  nodeOffset: Int,
+  nodeIndex: Int,
--- End diff --

sorry, `nodeIndex` is needed in `DTStatsAggregatorSubsampledFeatures`. 
Close this comment.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3535][Mesos] Add 15% task memory overhe...

2014-09-18 Thread vanzin
Github user vanzin commented on the pull request:

https://github.com/apache/spark/pull/2401#issuecomment-56122505
  
@willb in the case of Yarn, the parameter is called "overhead" because it 
actually sets the amount of overhead you want to add to the requested heap 
memory. The PR being referenced just uses a multiplier to set the default 
overhead value; the user-visible parameter doesn't set the multiplier.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-1545] [mllib] Add Random Forests

2014-09-18 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/2435#discussion_r17764224
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/tree/impl/DTStatsAggregator.scala 
---
@@ -189,6 +160,230 @@ private[tree] class DTStatsAggregator(
 }
 this
   }
+}
+
+/**
+ * DecisionTree statistics aggregator.
+ * This holds a flat array of statistics for a set of (nodes, features, 
bins)
+ * and helps with indexing.
+ *
+ * This instance of [[DTStatsAggregator]] is used when not subsampling 
features.
+ *
+ * @param numNodes  Number of nodes to collect statistics for.
+ */
+private[tree] class DTStatsAggregatorFixedFeatures(
+metadata: DecisionTreeMetadata,
+numNodes: Int) extends DTStatsAggregator(metadata) {
+
+  /**
+   * Offset for each feature for calculating indices into the 
[[_allStats]] array.
+   * Mapping: featureIndex --> offset
+   */
+  private val featureOffsets: Array[Int] = {
+metadata.numBins.scanLeft(0)((total, nBins) => total + statsSize * 
nBins)
+  }
+
+  /**
+   * Number of elements for each node, corresponding to stride between 
nodes in [[_allStats]].
+   */
+  private val nodeStride: Int = featureOffsets.last
+
+  /**
+   * Total number of elements stored in this aggregator.
+   */
+  def allStatsSize: Int = numNodes * nodeStride
+
+  /**
+   * Flat array of elements.
+   * Index for start of stats for a (node, feature, bin) is:
+   *   index = nodeIndex * nodeStride + featureOffsets(featureIndex) + 
binIndex * statsSize
+   * Note: For unordered features, the left child stats precede the right 
child stats
+   *   in the binIndex order.
+   */
+  protected val _allStats: Array[Double] = new Array[Double](allStatsSize)
+
+  /**
+   * Get flat array of elements stored in this aggregator.
+   */
+  protected def allStats: Array[Double] = _allStats
+
+  /**
+   * Update the stats for a given (node, feature, bin) for ordered 
features, using the given label.
+   */
+  def update(
+  nodeIndex: Int,
+  featureIndex: Int,
+  binIndex: Int,
+  label: Double,
+  instanceWeight: Double): Unit = {
+val i = nodeIndex * nodeStride + featureOffsets(featureIndex) + 
binIndex * statsSize
+impurityAggregator.update(_allStats, i, label, instanceWeight)
+  }
+
+  /**
+   * Pre-compute node offset for use with [[nodeUpdate]].
+   */
+  def getNodeOffset(nodeIndex: Int): Int = nodeIndex * nodeStride
+
+  /**
+   * Faster version of [[update]].
+   * Update the stats for a given (node, feature, bin) for ordered 
features, using the given label.
+   * @param nodeOffset  Pre-computed node offset from [[getNodeOffset]].
+   */
+  def nodeUpdate(
+  nodeOffset: Int,
+  nodeIndex: Int,
--- End diff --

This is to stay consistent with the abstract class API.  The other instance 
of this class (for subsampling features) does require nodeIndex.  I agree this 
is awkward, but I wanted to keep the same API (so that the RF aggregation code 
does not need to know the difference).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-1545] [mllib] Add Random Forests

2014-09-18 Thread chouqin
Github user chouqin commented on a diff in the pull request:

https://github.com/apache/spark/pull/2435#discussion_r17764131
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/tree/impl/DTStatsAggregator.scala 
---
@@ -189,6 +160,230 @@ private[tree] class DTStatsAggregator(
 }
 this
   }
+}
+
+/**
+ * DecisionTree statistics aggregator.
+ * This holds a flat array of statistics for a set of (nodes, features, 
bins)
+ * and helps with indexing.
+ *
+ * This instance of [[DTStatsAggregator]] is used when not subsampling 
features.
+ *
+ * @param numNodes  Number of nodes to collect statistics for.
+ */
+private[tree] class DTStatsAggregatorFixedFeatures(
+metadata: DecisionTreeMetadata,
+numNodes: Int) extends DTStatsAggregator(metadata) {
+
+  /**
+   * Offset for each feature for calculating indices into the 
[[_allStats]] array.
+   * Mapping: featureIndex --> offset
+   */
+  private val featureOffsets: Array[Int] = {
+metadata.numBins.scanLeft(0)((total, nBins) => total + statsSize * 
nBins)
+  }
+
+  /**
+   * Number of elements for each node, corresponding to stride between 
nodes in [[_allStats]].
+   */
+  private val nodeStride: Int = featureOffsets.last
+
+  /**
+   * Total number of elements stored in this aggregator.
+   */
+  def allStatsSize: Int = numNodes * nodeStride
+
+  /**
+   * Flat array of elements.
+   * Index for start of stats for a (node, feature, bin) is:
+   *   index = nodeIndex * nodeStride + featureOffsets(featureIndex) + 
binIndex * statsSize
+   * Note: For unordered features, the left child stats precede the right 
child stats
+   *   in the binIndex order.
+   */
+  protected val _allStats: Array[Double] = new Array[Double](allStatsSize)
+
+  /**
+   * Get flat array of elements stored in this aggregator.
+   */
+  protected def allStats: Array[Double] = _allStats
+
+  /**
+   * Update the stats for a given (node, feature, bin) for ordered 
features, using the given label.
+   */
+  def update(
+  nodeIndex: Int,
+  featureIndex: Int,
+  binIndex: Int,
+  label: Double,
+  instanceWeight: Double): Unit = {
+val i = nodeIndex * nodeStride + featureOffsets(featureIndex) + 
binIndex * statsSize
+impurityAggregator.update(_allStats, i, label, instanceWeight)
+  }
+
+  /**
+   * Pre-compute node offset for use with [[nodeUpdate]].
+   */
+  def getNodeOffset(nodeIndex: Int): Int = nodeIndex * nodeStride
+
+  /**
+   * Faster version of [[update]].
+   * Update the stats for a given (node, feature, bin) for ordered 
features, using the given label.
+   * @param nodeOffset  Pre-computed node offset from [[getNodeOffset]].
+   */
+  def nodeUpdate(
+  nodeOffset: Int,
+  nodeIndex: Int,
--- End diff --

`nodeIndex` is never used, because we already have `nodeOffset`, which is 
computed by `nodeIndex`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [Minor Hot Fix] Move a line in SparkSubmit to ...

2014-09-18 Thread vanzin
Github user vanzin commented on the pull request:

https://github.com/apache/spark/pull/2452#issuecomment-56122218
  
lgtm


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [WIP][SPARK-2816][SQL] Type-safe SQL Queries

2014-09-18 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1759#issuecomment-56121439
  
  [QA tests have 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20558/consoleFull)
 for   PR 1759 at commit 
[`677fa3d`](https://github.com/apache/spark/commit/677fa3d1cfffa7129cd5c8ab01b8918234cf4ce5).
 * This patch **fails** unit tests.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `trait ScalaReflection `
  * `  case class Schema(dataType: DataType, nullable: Boolean)`
  * `  class Macros[C <: Context](val c: C) extends ScalaReflection `
  * `trait InterpolatedItem `
  * `case class InterpolatedUDF(index: Int, expr: c.Expr[Any], 
returnType: DataType)`
  * `case class InterpolatedTable(index: Int, expr: c.Expr[Any], 
schema: StructType)`
  * `case class RecSchema(name: String, index: Int, cType: DataType, 
tpe: Type)`
  * `  case class ImplSchema(name: String, tpe: Type, impl: Tree)`
  * `trait TypedSQL `
  * `  implicit class SQLInterpolation(val strCtx: StringContext) `



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3535][Mesos] Add 15% task memory overhe...

2014-09-18 Thread willb
Github user willb commented on the pull request:

https://github.com/apache/spark/pull/2401#issuecomment-56121260
  
@andrewor14 This bothers me too, but in a slightly different way:  calling 
the parameter “overhead” when it really refers to how to scale requested 
memory to accommodate anticipated overhead seems wrong.  (115% overhead would 
be appalling indeed!)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3535][Mesos] Add 15% task memory overhe...

2014-09-18 Thread brndnmtthws
Github user brndnmtthws commented on a diff in the pull request:

https://github.com/apache/spark/pull/2401#discussion_r17763516
  
--- Diff: 
core/src/main/scala/org/apache/spark/scheduler/cluster/mesos/CommonProps.scala 
---
@@ -0,0 +1,27 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.scheduler.cluster.mesos
+
+import org.apache.spark.SparkContext
+
+object CommonProps {
+  def memoryOverheadFraction(sc: SparkContext) =
+sc.conf.getOption("spark.executor.memory.overhead.fraction")
+  .getOrElse("1.15")
--- End diff --

I was trying to keep the code looking clean, but I'll change `fraction` to 
`multiplier` in the method, if that seems more sensible.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: Modify default YARN memory_overhead-- from an ...

2014-09-18 Thread nishkamravi2
Github user nishkamravi2 commented on the pull request:

https://github.com/apache/spark/pull/1391#issuecomment-56120931
  
Updated as per @andrewor14 's comments


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3535][Mesos] Add 15% task memory overhe...

2014-09-18 Thread andrewor14
Github user andrewor14 commented on a diff in the pull request:

https://github.com/apache/spark/pull/2401#discussion_r17763285
  
--- Diff: 
core/src/main/scala/org/apache/spark/scheduler/cluster/mesos/CommonProps.scala 
---
@@ -0,0 +1,27 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.scheduler.cluster.mesos
+
+import org.apache.spark.SparkContext
+
+object CommonProps {
+  def memoryOverheadFraction(sc: SparkContext) =
+sc.conf.getOption("spark.executor.memory.overhead.fraction")
+  .getOrElse("1.15")
--- End diff --

In my mind a fraction is a value between 0 and 1, but 1.15 is not in this 
range.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3535][Mesos] Add 15% task memory overhe...

2014-09-18 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2401#issuecomment-56120575
  
  [QA tests have 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20564/consoleFull)
 for   PR 2401 at commit 
[`d51b74f`](https://github.com/apache/spark/commit/d51b74f26b67ff3719b7ba3182d0693e45301a81).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3535][Mesos] Add 15% task memory overhe...

2014-09-18 Thread brndnmtthws
Github user brndnmtthws commented on the pull request:

https://github.com/apache/spark/pull/2401#issuecomment-56120347
  
Forgot to mention: I also set the executor CPUs correctly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3535][Mesos] Add 15% task memory overhe...

2014-09-18 Thread brndnmtthws
Github user brndnmtthws commented on the pull request:

https://github.com/apache/spark/pull/2401#issuecomment-56120329
  
Updated as per @andrewor14's suggestions.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



  1   2   3   >