[GitHub] spark pull request: [SPARK-2062][GraphX] VertexRDD.apply does not ...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/1903 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2062][GraphX] VertexRDD.apply does not ...
Github user ankurdave commented on the pull request: https://github.com/apache/spark/pull/1903#issuecomment-56140430 Thanks! Merged into master and branch-1.1. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3536][SQL] SELECT on empty parquet tabl...
GitHub user ravipesala opened a pull request: https://github.com/apache/spark/pull/2456 [SPARK-3536][SQL] SELECT on empty parquet table throws exception It return null metadata from parquet if querying on empty parquet file while calculating splits.So added null check and returns the empty splits. Author : ravipesala ravindra.pes...@huawei.com You can merge this pull request into a Git repository by running: $ git pull https://github.com/ravipesala/spark SPARK-3536 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/2456.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #2456 commit 1e81a50631b1f44ad7de65b83408a40218234745 Author: ravipesala Date: 2014-09-18T18:02:46Z Fixed the issue when querying on empty parquet file. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3584] sbin/slaves doesn't work when we ...
Github user sarutak commented on a diff in the pull request: https://github.com/apache/spark/pull/2444#discussion_r17769855 --- Diff: .gitignore --- @@ -19,6 +19,7 @@ conf/*.sh conf/*.properties conf/*.conf conf/*.xml +conf/slaves --- End diff -- So, this file will not to be edited? User should use another slave list file by SPARK_SLAVES variable right? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3584] sbin/slaves doesn't work when we ...
Github user sarutak commented on a diff in the pull request: https://github.com/apache/spark/pull/2444#discussion_r17769822 --- Diff: sbin/slaves.sh --- @@ -67,20 +69,26 @@ fi if [ "$HOSTLIST" = "" ]; then if [ "$SPARK_SLAVES" = "" ]; then -export HOSTLIST="${SPARK_CONF_DIR}/slaves" +if [ -f "${SPARK_CONF_DIR}/slaves" ]; then + HOSTLIST=`cat "${SPARK_CONF_DIR}/slaves"` +else + HOSTLIST=localhost +fi else -export HOSTLIST="${SPARK_SLAVES}" +HOSTLIST=`cat "${SPARK_SLAVES}"` --- End diff -- This is to use HOSTLIST as List of Host, not file. It's to use localhost as a default host list entry. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3547]Using a special exit code instead ...
Github user sarutak commented on the pull request: https://github.com/apache/spark/pull/2421#issuecomment-56137301 @liancheng Ah, I saw sometimes Jenkins ignores us... but recently he is friendly :D --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-1987] EdgePartitionBuilder: More memory...
Github user ankurdave commented on the pull request: https://github.com/apache/spark/pull/2446#issuecomment-56137274 Thanks! ok to test --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3218, SPARK-3219, SPARK-3261, SPARK-342...
Github user mengxr commented on the pull request: https://github.com/apache/spark/pull/2419#issuecomment-56136714 @derrickburns I cannot see the Jenkins log. Let's call Jenkins again. test this please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [MLLIB] fix a unresolved reference variable 'n...
Github user mengxr commented on the pull request: https://github.com/apache/spark/pull/2423#issuecomment-56136584 @OdinLin Thanks for catching the bug! As @davies mentioned, #2378 will completely replace the current SerDe. Could you close this PR? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3491] [MLlib] [PySpark] use pickle to s...
Github user mengxr commented on the pull request: https://github.com/apache/spark/pull/2378#issuecomment-56136476 test this please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3418] Sparse Matrix support (CCS) and a...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/2294 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3418] Sparse Matrix support (CCS) and a...
Github user mengxr commented on the pull request: https://github.com/apache/spark/pull/2294#issuecomment-56136224 LGTM. I'm merging this into master. (We might need to make slight changes to some methods before the 1.2 release, but let's not block the multi-model training PR for now.) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-927] detect numpy at time of use
Github user jkbradley commented on the pull request: https://github.com/apache/spark/pull/2313#issuecomment-56135962 Philosophically, I agree with @erikerlandson about it being OK for random generators to be, well, random. If problems are caused by the output of a randomized process not being reproducible, then then output probably isn't being used/tested correctly. Practically, I second @mengxr in saying we should encourage reproducibility by requiring numpy in MLlib. But avoiding it where possible sounds good, assuming the performance hit is not too bad. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3250] Implement Gap Sampling optimizati...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/2455#discussion_r17769391 --- Diff: core/src/main/scala/org/apache/spark/util/random/RandomSampler.scala --- @@ -43,66 +46,218 @@ trait RandomSampler[T, U] extends Pseudorandom with Cloneable with Serializable throw new NotImplementedError("clone() is not implemented.") } + +object RandomSampler { + // Default random number generator used by random samplers + def rngDefault: Random = new XORShiftRandom + + // Default gap sampling maximum + // For sampling fractions <= this value, the gap sampling optimization will be applied. + // Above this value, it is assumed that "tradtional" bernoulli sampling is faster. The + // optimal value for this will depend on the RNG. More expensive RNGs will tend to make + // the optimal value higher. The most reliable way to determine this value for a given RNG + // is to experiment. I would expect a value of 0.5 to be close in most cases. + def gsmDefault: Double = 0.4 + + // Default gap sampling epsilon + // When sampling random floating point values the gap sampling logic requires value > 0. An + // optimal value for this parameter is at or near the minimum positive floating point value + // returned by nextDouble() for the RNG being used. + def epsDefault: Double = 5e-11 +} + + /** * :: DeveloperApi :: * A sampler based on Bernoulli trials. * - * @param lb lower bound of the acceptance range - * @param ub upper bound of the acceptance range - * @param complement whether to use the complement of the range specified, default to false + * @param fraction the sampling fraction, aka Bernoulli sampling probability * @tparam T item type */ @DeveloperApi -class BernoulliSampler[T](lb: Double, ub: Double, complement: Boolean = false) - extends RandomSampler[T, T] { +class BernoulliSampler[T: ClassTag](fraction: Double) extends RandomSampler[T, T] { - private[random] var rng: Random = new XORShiftRandom + require(fraction >= 0.0 && fraction <= 1.0, "Sampling fraction must be on interval [0, 1]") - def this(ratio: Double) = this(0.0d, ratio) + def this(lb: Double, ub: Double, complement: Boolean = false) = +this(if (complement) (1.0 - (ub - lb)) else (ub - lb)) + + private val rng: Random = RandomSampler.rngDefault override def setSeed(seed: Long) = rng.setSeed(seed) override def sample(items: Iterator[T]): Iterator[T] = { -items.filter { item => - val x = rng.nextDouble() - (x >= lb && x < ub) ^ complement +fraction match { + case f if (f <= 0.0) => Iterator.empty + case f if (f >= 1.0) => items + case f if (f <= RandomSampler.gsmDefault) => +new GapSamplingIterator(items, f, rng, RandomSampler.epsDefault) + case _ => items.filter(_ => (rng.nextDouble() <= fraction)) --- End diff -- Did you test whether `rdd.randomSplit()` will produce non-overlapping subsets with this change? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-927] detect numpy at time of use
Github user mengxr commented on the pull request: https://github.com/apache/spark/pull/2313#issuecomment-56135525 @JoshRosen PySpark/MLlib requires NumPy to run, and I don't think we claimed that we support different versions of NumPy. `sample()` in core is different. Maybe we can test the overhead of using Python's own random. The overhead may not be huge, due to serialization/deserialization. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [WIP][SPARK-1486][MLlib] Multi Model Training ...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/2451#discussion_r17769270 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/linalg/BLAS.scala --- @@ -197,4 +201,452 @@ private[mllib] object BLAS extends Serializable { throw new IllegalArgumentException(s"scal doesn't support vector type ${x.getClass}.") } } + + // For level-3 routines, we use the native BLAS. + private def nativeBLAS: NetlibBLAS = { +if (_nativeBLAS == null) { + _nativeBLAS = NativeBLAS +} +_nativeBLAS + } + + /** + * C := alpha * A * B + beta * C + * @param transA whether to use the transpose of matrix A (true), or A itself (false). + * @param transB whether to use the transpose of matrix B (true), or B itself (false). + * @param alpha a scalar to scale the multiplication A * B. + * @param A the matrix A that will be left multiplied to B. Size of m x k. + * @param B the matrix B that will be left multiplied by A. Size of k x n. + * @param beta a scalar that can be used to scale matrix C. + * @param C the resulting matrix C. Size of m x n. + */ + def gemm( + transA: Boolean, + transB: Boolean, + alpha: Double, + A: Matrix, + B: Matrix, + beta: Double, + C: DenseMatrix): Unit = { +if (alpha == 0.0) { + logDebug("gemm: alpha is equal to 0. Returning C.") +} else { + A match { +case sparse: SparseMatrix => + B match { +case dB: DenseMatrix => gemm(transA, transB, alpha, sparse, dB, beta, C) +case sB: SparseMatrix => + throw new IllegalArgumentException(s"gemm doesn't support sparse-sparse matrix " + +s"multiplication") +case _ => + throw new IllegalArgumentException(s"gemm doesn't support matrix type ${B.getClass}.") + } +case dense: DenseMatrix => + B match { +case dB: DenseMatrix => gemm(transA, transB, alpha, dense, dB, beta, C) +case sB: SparseMatrix => gemm(transA, transB, alpha, dense, sB, beta, C) +case _ => + throw new IllegalArgumentException(s"gemm doesn't support matrix type ${B.getClass}.") + } +case _ => + throw new IllegalArgumentException(s"gemm doesn't support matrix type ${A.getClass}.") + } +} + } + + /** + * C := alpha * A * B + beta * C + * + * @param alpha a scalar to scale the multiplication A * B. + * @param A the matrix A that will be left multiplied to B. Size of m x k. + * @param B the matrix B that will be left multiplied by A. Size of k x n. + * @param beta a scalar that can be used to scale matrix C. + * @param C the resulting matrix C. Size of m x n. + */ + def gemm( + alpha: Double, + A: Matrix, + B: Matrix, + beta: Double, + C: DenseMatrix): Unit = { +gemm(false, false, alpha, A, B, beta, C) + } + + /** + * C := alpha * A * B + beta * C + * For `DenseMatrix` A. + */ + private def gemm( + transA: Boolean, + transB: Boolean, + alpha: Double, + A: DenseMatrix, + B: DenseMatrix, + beta: Double, + C: DenseMatrix): Unit = { +val mA: Int = if (!transA) A.numRows else A.numCols +val nB: Int = if (!transB) B.numCols else B.numRows +val kA: Int = if (!transA) A.numCols else A.numRows +val kB: Int = if (!transB) B.numRows else B.numCols +val tAstr = if (!transA) "N" else "T" +val tBstr = if (!transB) "N" else "T" + +require(kA == kB, s"The columns of A don't match the rows of B. A: $kA, B: $kB") +require(mA == C.numRows, s"The rows of C don't match the rows of A. C: ${C.numRows}, A: $mA") +require(nB == C.numCols, + s"The columns of C don't match the columns of B. C: ${C.numCols}, A: $nB") + +nativeBLAS.dgemm(tAstr, tBstr, mA, nB, kA, alpha, A.values, A.numRows, B.values, B.numRows, + beta, C.values, C.numRows) + } + + /** + * C := alpha * A * B + beta * C + * For `SparseMatrix` A. + */ + private def gemm( + transA: Boolean, + transB: Boolean, + alpha: Double, + A: SparseMatrix, + B: DenseMatrix, + beta: Double, + C: DenseMatrix): Unit = { +val mA: Int = if (!transA) A.numRows else A.numCols +val nB: Int = if (!transB) B.numCols else B.numRows +val kA: Int = if (!transA) A.numCols else A.numRows +val kB: Int = if (!transB) B.numRows else B.numCols + +r
[GitHub] spark pull request: SPARK-1767: Prefer HDFS-cached replicas when s...
Github user aarondav commented on the pull request: https://github.com/apache/spark/pull/1486#issuecomment-56135114 I only had a few minor comments about documentation while trying to do a quick read-through of this patch. No substantive comments. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-1767: Prefer HDFS-cached replicas when s...
Github user aarondav commented on a diff in the pull request: https://github.com/apache/spark/pull/1486#discussion_r17769069 --- Diff: core/src/main/scala/org/apache/spark/scheduler/TaskLocation.scala --- @@ -22,13 +22,35 @@ package org.apache.spark.scheduler * In the latter case, we will prefer to launch the task on that executorID, but our next level * of preference will be executors on the same host if this is not possible. */ -private[spark] -class TaskLocation private (val host: String, val executorId: Option[String]) extends Serializable { - override def toString: String = "TaskLocation(" + host + ", " + executorId + ")" +private[spark] sealed abstract class TaskLocation(val host: String) { +} + +private [spark] case class ExecutorCacheTaskLocation(override val host: String, +val executorId: String) extends TaskLocation(host) { +} + +private [spark] case class HDFSCachedTaskLocation(override val host: String) +extends TaskLocation(host) { + override def toString = TaskLocation.in_memory_location_tag + host +} + +private [spark] case class HostTaskLocation(override val host: String) extends TaskLocation(host) { + override def toString = host } private[spark] object TaskLocation { - def apply(host: String, executorId: String) = new TaskLocation(host, Some(executorId)) + // We identify hosts on which the block is cached with this prefix. Because this prefix contains + // underscores, which are not legal characters in hostnames, there should be no potential for + // confusion. See RFC 952 and RFC 1123 for information about the format of hostnames. + val in_memory_location_tag = "_hdfs_cache_" + + def apply(host: String, executorId: String) = new ExecutorCacheTaskLocation(host, executorId) - def apply(host: String) = new TaskLocation(host, None) + def apply(str: String) = { --- End diff -- The contract of this method is kinda sketchy -- taking in a "str" which is either a host name or a tag. Would you mind adding a bit of Javadoc to explain that this is what is happening? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-1767: Prefer HDFS-cached replicas when s...
Github user aarondav commented on a diff in the pull request: https://github.com/apache/spark/pull/1486#discussion_r17769055 --- Diff: core/src/main/scala/org/apache/spark/scheduler/TaskLocation.scala --- @@ -22,13 +22,35 @@ package org.apache.spark.scheduler * In the latter case, we will prefer to launch the task on that executorID, but our next level * of preference will be executors on the same host if this is not possible. */ -private[spark] -class TaskLocation private (val host: String, val executorId: Option[String]) extends Serializable { - override def toString: String = "TaskLocation(" + host + ", " + executorId + ")" +private[spark] sealed abstract class TaskLocation(val host: String) { +} + +private [spark] case class ExecutorCacheTaskLocation(override val host: String, +val executorId: String) extends TaskLocation(host) { +} + +private [spark] case class HDFSCachedTaskLocation(override val host: String) +extends TaskLocation(host) { + override def toString = TaskLocation.in_memory_location_tag + host +} + +private [spark] case class HostTaskLocation(override val host: String) extends TaskLocation(host) { + override def toString = host } private[spark] object TaskLocation { - def apply(host: String, executorId: String) = new TaskLocation(host, Some(executorId)) + // We identify hosts on which the block is cached with this prefix. Because this prefix contains + // underscores, which are not legal characters in hostnames, there should be no potential for + // confusion. See RFC 952 and RFC 1123 for information about the format of hostnames. + val in_memory_location_tag = "_hdfs_cache_" + + def apply(host: String, executorId: String) = new ExecutorCacheTaskLocation(host, executorId) - def apply(host: String) = new TaskLocation(host, None) + def apply(str: String) = { +if (str.startsWith(in_memory_location_tag)) { + new HDFSCachedTaskLocation(str.substring(in_memory_location_tag.length)) --- End diff -- nit: `str.stripPrefix(in_memory_location_tag)` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-1767: Prefer HDFS-cached replicas when s...
Github user aarondav commented on a diff in the pull request: https://github.com/apache/spark/pull/1486#discussion_r17769053 --- Diff: core/src/main/scala/org/apache/spark/scheduler/TaskLocation.scala --- @@ -22,13 +22,35 @@ package org.apache.spark.scheduler * In the latter case, we will prefer to launch the task on that executorID, but our next level * of preference will be executors on the same host if this is not possible. */ -private[spark] -class TaskLocation private (val host: String, val executorId: Option[String]) extends Serializable { - override def toString: String = "TaskLocation(" + host + ", " + executorId + ")" +private[spark] sealed abstract class TaskLocation(val host: String) { +} + +private [spark] case class ExecutorCacheTaskLocation(override val host: String, +val executorId: String) extends TaskLocation(host) { +} + +private [spark] case class HDFSCachedTaskLocation(override val host: String) +extends TaskLocation(host) { + override def toString = TaskLocation.in_memory_location_tag + host +} + +private [spark] case class HostTaskLocation(override val host: String) extends TaskLocation(host) { + override def toString = host } private[spark] object TaskLocation { - def apply(host: String, executorId: String) = new TaskLocation(host, Some(executorId)) + // We identify hosts on which the block is cached with this prefix. Because this prefix contains + // underscores, which are not legal characters in hostnames, there should be no potential for + // confusion. See RFC 952 and RFC 1123 for information about the format of hostnames. + val in_memory_location_tag = "_hdfs_cache_" --- End diff -- Also, nit: could you use camel case: `inMemoryLocationTag`, or all caps with underscores if you prefer. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-1767: Prefer HDFS-cached replicas when s...
Github user aarondav commented on a diff in the pull request: https://github.com/apache/spark/pull/1486#discussion_r17768989 --- Diff: core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala --- @@ -309,4 +323,42 @@ private[spark] object HadoopRDD { f(inputSplit, firstParent[T].iterator(split, context)) } } + + private[spark] class SplitInfoReflections { +val inputSplitWithLocationInfo = + Class.forName("org.apache.hadoop.mapred.InputSplitWithLocationInfo") +val getLocationInfo = inputSplitWithLocationInfo.getMethod("getLocationInfo") +val newInputSplit = Class.forName("org.apache.hadoop.mapreduce.InputSplit") +val newGetLocationInfo = newInputSplit.getMethod("getLocationInfo") +val splitLocationInfo = Class.forName("org.apache.hadoop.mapred.SplitLocationInfo") +val isInMemory = splitLocationInfo.getMethod("isInMemory") +val getLocation = splitLocationInfo.getMethod("getLocation") + } + + private[spark] val SPLIT_INFO_REFLECTIONS = try { +Some(new SplitInfoReflections) + } catch { +case e: Exception => + logDebug("SplitLocationInfo and other new Hadoop classes are " + + "unavailable. Using the older Hadoop location info code.", e) + None + } + + private[spark] def convertSplitLocationInfo(infos: Array[AnyRef]) :Seq[String] = { --- End diff -- nit: '): Seq[String]` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-1767: Prefer HDFS-cached replicas when s...
Github user aarondav commented on a diff in the pull request: https://github.com/apache/spark/pull/1486#discussion_r17768962 --- Diff: core/src/main/scala/org/apache/spark/scheduler/TaskLocation.scala --- @@ -22,13 +22,35 @@ package org.apache.spark.scheduler * In the latter case, we will prefer to launch the task on that executorID, but our next level * of preference will be executors on the same host if this is not possible. */ --- End diff -- Would you mind beefing up the documentation here a bit? I am having trouble reading through and quickly finding out the difference between HostTaskLocation and ExecutorCacheTaskLocation. I guess the latter is exclusively used for the BlockManager cache, but it would be good to be explicit. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-1767: Prefer HDFS-cached replicas when s...
Github user aarondav commented on a diff in the pull request: https://github.com/apache/spark/pull/1486#discussion_r17768928 --- Diff: core/src/main/scala/org/apache/spark/scheduler/TaskLocation.scala --- @@ -22,13 +22,35 @@ package org.apache.spark.scheduler * In the latter case, we will prefer to launch the task on that executorID, but our next level * of preference will be executors on the same host if this is not possible. */ -private[spark] -class TaskLocation private (val host: String, val executorId: Option[String]) extends Serializable { - override def toString: String = "TaskLocation(" + host + ", " + executorId + ")" +private[spark] sealed abstract class TaskLocation(val host: String) { +} + +private [spark] case class ExecutorCacheTaskLocation(override val host: String, +val executorId: String) extends TaskLocation(host) { +} + +private [spark] case class HDFSCachedTaskLocation(override val host: String) +extends TaskLocation(host) { + override def toString = TaskLocation.in_memory_location_tag + host +} + +private [spark] case class HostTaskLocation(override val host: String) extends TaskLocation(host) { --- End diff -- Minor, but `override val` on something that exports the same parameter is kinda weird, I think this could be cleaned up just slightly by making TaskLocation a trait instead with a `def host: String`. Then this still works and is the sole implementation. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-1767: Prefer HDFS-cached replicas when s...
Github user pwendell commented on the pull request: https://github.com/apache/spark/pull/1486#issuecomment-5619 Yes, this appears to be an issue with our checker and adding an exclusion is fine for now. The class is private. Just had really minor comments and I can address them on merge if you want. This is looking good to me. Any other changes or is this good from your side? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-1767: Prefer HDFS-cached replicas when s...
Github user pwendell commented on a diff in the pull request: https://github.com/apache/spark/pull/1486#discussion_r17768506 --- Diff: core/src/main/scala/org/apache/spark/scheduler/TaskLocation.scala --- @@ -22,13 +22,35 @@ package org.apache.spark.scheduler * In the latter case, we will prefer to launch the task on that executorID, but our next level * of preference will be executors on the same host if this is not possible. */ -private[spark] -class TaskLocation private (val host: String, val executorId: Option[String]) extends Serializable { - override def toString: String = "TaskLocation(" + host + ", " + executorId + ")" +private[spark] sealed abstract class TaskLocation(val host: String) { +} + +private [spark] case class ExecutorCacheTaskLocation(override val host: String, +val executorId: String) extends TaskLocation(host) { +} + +private [spark] case class HDFSCachedTaskLocation(override val host: String) +extends TaskLocation(host) { + override def toString = TaskLocation.in_memory_location_tag + host +} + +private [spark] case class HostTaskLocation(override val host: String) extends TaskLocation(host) { + override def toString = host } private[spark] object TaskLocation { - def apply(host: String, executorId: String) = new TaskLocation(host, Some(executorId)) + // We identify hosts on which the block is cached with this prefix. Because this prefix contains + // underscores, which are not legal characters in hostnames, there should be no potential for + // confusion. See RFC 952 and RFC 1123 for information about the format of hostnames. + val in_memory_location_tag = "_hdfs_cache_" --- End diff -- could you drop the prefixing `_` here to make it consistent with blockid? Having a trailing underscore seems sufficient to distinguish it from a real hostname. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3250] Implement Gap Sampling optimizati...
GitHub user erikerlandson opened a pull request: https://github.com/apache/spark/pull/2455 [SPARK-3250] Implement Gap Sampling optimization for random sampling More efficient sampling, based on Gap Sampling optimization: http://erikerlandson.github.io/blog/2014/09/11/faster-random-samples-with-gap-sampling/ You can merge this pull request into a Git repository by running: $ git pull https://github.com/erikerlandson/spark spark-3250-pr Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/2455.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #2455 commit 82f461f95dc18ff44e0cd6f2e4784d3557ad3d0f Author: Erik Erlandson Date: 2014-09-18T22:26:49Z [SPARK-3250] Implement Gap Sampling optimization for random sampling --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-1767: Prefer HDFS-cached replicas when s...
Github user pwendell commented on a diff in the pull request: https://github.com/apache/spark/pull/1486#discussion_r17768479 --- Diff: core/src/main/scala/org/apache/spark/scheduler/TaskLocation.scala --- @@ -22,13 +22,35 @@ package org.apache.spark.scheduler * In the latter case, we will prefer to launch the task on that executorID, but our next level * of preference will be executors on the same host if this is not possible. */ -private[spark] -class TaskLocation private (val host: String, val executorId: Option[String]) extends Serializable { - override def toString: String = "TaskLocation(" + host + ", " + executorId + ")" +private[spark] sealed abstract class TaskLocation(val host: String) { +} + +private [spark] case class ExecutorCacheTaskLocation(override val host: String, +val executorId: String) extends TaskLocation(host) { +} + +private [spark] case class HDFSCachedTaskLocation(override val host: String) --- End diff -- should this be `HDFSCacheTaskLocation` to be consistent with `ExecutorCacheTaskLocation`? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-1767: Prefer HDFS-cached replicas when s...
Github user pwendell commented on a diff in the pull request: https://github.com/apache/spark/pull/1486#discussion_r17768467 --- Diff: core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala --- @@ -309,4 +323,42 @@ private[spark] object HadoopRDD { f(inputSplit, firstParent[T].iterator(split, context)) } } + + private[spark] class SplitInfoReflections { +val inputSplitWithLocationInfo = + Class.forName("org.apache.hadoop.mapred.InputSplitWithLocationInfo") +val getLocationInfo = inputSplitWithLocationInfo.getMethod("getLocationInfo") +val newInputSplit = Class.forName("org.apache.hadoop.mapreduce.InputSplit") +val newGetLocationInfo = newInputSplit.getMethod("getLocationInfo") +val splitLocationInfo = Class.forName("org.apache.hadoop.mapred.SplitLocationInfo") +val isInMemory = splitLocationInfo.getMethod("isInMemory") +val getLocation = splitLocationInfo.getMethod("getLocation") + } + + private[spark] val SPLIT_INFO_REFLECTIONS = try { --- End diff -- did you decide you'd prefer not to do this? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-3574. Shuffle finish time always reporte...
Github user sryza commented on the pull request: https://github.com/apache/spark/pull/2440#issuecomment-56132627 Removing this sounds good to me too. Will upload a patch. I think a measure of how long a task spends in shuffle would be useful though, as it helps users understand whether task slowness is caused by their code or the framework. (I've found a similar metric useful when trying to debug MR performance issues.) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: Modify default YARN memory_overhead-- from an ...
Github user sryza commented on the pull request: https://github.com/apache/spark/pull/1391#issuecomment-56132497 These changes look good to me. This addresses what continues to be the #1 issue that we see in Cloudera customer YARN deployments. It's worth considering boosting this when using PySpark, but that's probably work for another JIRA. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: Modify default YARN memory_overhead-- from an ...
Github user sryza commented on the pull request: https://github.com/apache/spark/pull/1391#issuecomment-56132524 @nishkamravi2 mind resolving the merge conflicts? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-1701] Clarify slice vs partition in the...
Github user mattf commented on the pull request: https://github.com/apache/spark/pull/2305#issuecomment-56130009 > This patch fails unit tests. i'm getting HTTP 503 from jenkins, but i'm gonna go out on a limb and say this doc change didn't break the unit tests. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [YARN] SPARK-2668: Add variable of yarn log di...
Github user renozhang commented on the pull request: https://github.com/apache/spark/pull/1573#issuecomment-56129984 Sorry @tgravescs , these days very busy, I'll address them this weekend. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-927] detect numpy at time of use
Github user mattf commented on the pull request: https://github.com/apache/spark/pull/2313#issuecomment-56129891 that's a very good point, especially about how it's an unsolved problem in general, at least on our existing operating systems. iirc, systems like plan9 tried to address complete reproducibility, but i may be misremembering the specifics. the four stated cases are: a) driver w/ numpy, worker w/ numpy - numpy used, no message emitted b) driver w/ numpy, worker w/o numpy - numpy used on driver, not used on workers, warning emitted c) driver w/o numpy, worker w/ numpy - numpy not used on driver nor worker, no message emitted d) driver w/o numpy, worker w/o numpy - numpy not used on driver nor worker, no message emitted case (a) is not a concern because numpy is used consistently throughout case (b) is not a concern because python random is used consistely throught the workers case (c) and (d) are not a concern because pythons random module are used throughout however, there's a fifth case: d) driver w/ numpy, some workers w/ numpy, some workers w/o numpy there's actually a sixth case, but it's intractable for spark and shouldn't be considered: different implementations of python random or numpy's random across workers. this is something that should be managed outside of spark. in (d), some workers will use numpy and others will use random. previously, all workers w/o numpy would error out, potentially terminating the computation. now, a warning will be emitted (though it'll be emitted to /dev/null) and execution will complete. i'd solve this with a systems approach: remove the python random code and require numpy to be present, or remove the numpy code. and, i'd lean toward using the faster code (numpy). however, that might not be palatable for the project. if it is, i'm more than happy to redo scrap this ticket and create another to simplify the RDDSampler. as i see it, to proceed we evaluate - ``` if acceptable to require numpy: matt.dothat() else: if acceptable to potentially compromise re-computability w/ warning: commit this else: scrap this ``` (i'm left out the case where we decide to simply by always using the slower python code, because i'd rather not trade off performance to avoid an error message and i think adding a numpy dep is straightforward) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3599]Avoid loaing properties file frequ...
GitHub user WangTaoTheTonic opened a pull request: https://github.com/apache/spark/pull/2454 [SPARK-3599]Avoid loaing properties file frequently https://issues.apache.org/jira/browse/SPARK-3599 You can merge this pull request into a Git repository by running: $ git pull https://github.com/WangTaoTheTonic/spark avoidLoadingFrequently Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/2454.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #2454 commit 2a79f26497f9232465aa2e9b496b0d54b9ccda75 Author: WangTaoTheTonic Date: 2014-09-19T01:52:37Z Avoid loaing properties file frequently --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3562]Periodic cleanup event logs
Github user viper-kun commented on the pull request: https://github.com/apache/spark/pull/2391#issuecomment-56129613 @andrewor14 i have checked Hadoop's JobHistoryServer. it is JobHistoryServer's responsibility to delete the application logs. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3589][Minor]remove redundant code
Github user WangTaoTheTonic commented on a diff in the pull request: https://github.com/apache/spark/pull/2445#discussion_r17766944 --- Diff: bin/spark-class --- @@ -169,7 +169,6 @@ if [ -n "$SPARK_SUBMIT_BOOTSTRAP_DRIVER" ]; then # This is used only if the properties file actually contains these special configs # Export the environment variables needed by SparkSubmitDriverBootstrapper export RUNNER - export CLASSPATH --- End diff -- ok, it dose make sense. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3562]Periodic cleanup event logs
Github user viper-kun commented on a diff in the pull request: https://github.com/apache/spark/pull/2391#discussion_r17766911 --- Diff: core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala --- @@ -100,6 +125,12 @@ private[history] class FsHistoryProvider(conf: SparkConf) extends ApplicationHis checkForLogs() logCheckingThread.setDaemon(true) logCheckingThread.start() + +if(conf.getBoolean("spark.history.cleaner.enable", false)) { + cleanLogs() + logCleaningThread.setDaemon(true) + logCleaningThread.start() +} --- End diff -- yesï¼you are right. it is no need to clean it on initialization. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-1701] Clarify slice vs partition in the...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2305#issuecomment-56128662 [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20567/consoleFull) for PR 2305 at commit [`c0af05d`](https://github.com/apache/spark/commit/c0af05d387c85d8dce79d1b4839c5950da443dbc). * This patch **fails** unit tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3218, SPARK-3219, SPARK-3261, SPARK-342...
Github user derrickburns commented on the pull request: https://github.com/apache/spark/pull/2419#issuecomment-56127880 I don't understand the test failure. Can someone help me? Sent from my iPhone > On Sep 16, 2014, at 6:59 PM, Nicholas Chammas wrote: > > cc @mengxr > > â > Reply to this email directly or view it on GitHub. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3407][SQL]Add Date type support
Github user adrian-wang commented on a diff in the pull request: https://github.com/apache/spark/pull/2344#discussion_r17766373 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/types/dataTypes.scala --- @@ -170,6 +170,18 @@ case object TimestampType extends NativeType { def simpleString: String = "timestamp" } +case object DateType extends NativeType { + private[sql] type JvmType = Date + + @transient private[sql] lazy val tag = ScalaReflectionLock.synchronized { typeTag[JvmType] } + + private[sql] val ordering = new Ordering[JvmType] { +def compare(x: Date, y: Date) = x.compareTo(y) --- End diff -- Do we also need to modify `compareTo` from `TimestampType` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-927] detect numpy at time of use
Github user JoshRosen commented on the pull request: https://github.com/apache/spark/pull/2313#issuecomment-56127611 This is a tricky issue. Exact reproducibility / determinism crops up in two different senses here: re-running an entire job and re-computing a lost partition. Spark's lineage-based fault-tolerance is built on the idea that partitions can be deterministically recomputed. Tasks that have dependencies on the external environment may violate this determinism property (e.g. by reading the current system time to set a random seed). Workers using different versions of libraries which give different results is one way that the environment can leak into tasks and make them non-deterministic based on where they're run. There are some scenarios where exact reproducibility might be desirable. Imagine that I train a ML model on some data, make predictions with it, and want to go back and understand the lineage that led to that model being created. To do this, I may want to deterministically re-run the job with additional internal logging. This use-case is tricky in general, though: details of the execution environment might creep in via other means. We might see different results due to rounding errors / numerical instability if we run on environments with different BLAS libraries, etc (I guess we could say "deterministic within some rounding error / to _k_ bits of precision). Exact long-term reproducibility of computational results is a hard, unsolved problem in general. /cc @mengxr @jkbradley; since you work on MLlib; what do you think we should do here? Is there any precedent in MLlib and its use of native libraries? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3597][Mesos] Implement `killTask`.
GitHub user brndnmtthws opened a pull request: https://github.com/apache/spark/pull/2453 [SPARK-3597][Mesos] Implement `killTask`. The MesosSchedulerBackend did not previously implement `killTask`, resulting in an exception. You can merge this pull request into a Git repository by running: $ git pull https://github.com/brndnmtthws/spark implement-killtask Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/2453.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #2453 commit 40c447861de272fde227a165cb6b1d05bd668447 Author: Brenden Matthews Date: 2014-09-19T02:02:50Z [SPARK-3597][Mesos] Implement `killTask`. The MesosSchedulerBackend did not previously implement `killTask`, resulting in an exception. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3529] [SQL] Delete the temp files after...
Github user mattf commented on the pull request: https://github.com/apache/spark/pull/2393#issuecomment-56127248 +1 lgtm fyi, i checked, deleteOnExit isn't an option because it cannot recursively delete --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2098] All Spark processes should suppor...
Github user witgo commented on a diff in the pull request: https://github.com/apache/spark/pull/2379#discussion_r17766151 --- Diff: core/src/main/scala/org/apache/spark/deploy/history/HistoryServerArguments.scala --- @@ -44,30 +50,19 @@ private[spark] class HistoryServerArguments(conf: SparkConf, args: Array[String] } } + // Use common defaults file, if not specified by user + propertiesFile = Option(propertiesFile).getOrElse(Utils.getDefaultConfigFile) + private def printUsageAndExit(exitCode: Int) { System.err.println( """ - |Usage: HistoryServer - | - |Configuration options can be set by setting the corresponding JVM system property. - |History Server options are always available; additional options depend on the provider. - | - |History Server options: - | - | spark.history.ui.port Port where server will listen for connections - | (default 18080) - | spark.history.acls.enable Whether to enable view acls for all applications - | (default false) - | spark.history.provider Name of history provider class (defaults to - | file system-based provider) - | spark.history.retainedApplications Max number of application UIs to keep loaded in memory - | (default 50) - |FsHistoryProvider options: - | - | spark.history.fs.logDirectory Directory where app logs are stored (required) - | spark.history.fs.updateIntervalHow often to reload log data from storage (in seconds, - | default 10) - |""".stripMargin) --- End diff -- Only here to show the `spark.*` options is confusing. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3535][Mesos] Add 15% task memory overhe...
Github user brndnmtthws commented on the pull request: https://github.com/apache/spark/pull/2401#issuecomment-56126568 I've cleaned up the patch again. I spent about an hour trying to apply this to the YARN code, but it was pretty difficult to follow so I gave up. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3407][SQL]Add Date type support
Github user chenghao-intel commented on a diff in the pull request: https://github.com/apache/spark/pull/2344#discussion_r17765726 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/types/dataTypes.scala --- @@ -170,6 +170,18 @@ case object TimestampType extends NativeType { def simpleString: String = "timestamp" } +case object DateType extends NativeType { + private[sql] type JvmType = Date + + @transient private[sql] lazy val tag = ScalaReflectionLock.synchronized { typeTag[JvmType] } + + private[sql] val ordering = new Ordering[JvmType] { +def compare(x: Date, y: Date) = x.compareTo(y) --- End diff -- I've checked the logic of `java.sql.Date`.`compareTo`, and it is not the same as `DateWritable`.`compareTo`, which is the internal representation in Hive. The former will compare its milliseconds, but the later only compare the `days since Epoch`, probably we need to follow the same semantic with Hive here. BTW: If we change the logic here, does that also mean we needn't cast the `date` to `string` for `BinaryPredicate` in `HiveTypeCoercion`? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3407][SQL]Add Date type support
Github user chenghao-intel commented on a diff in the pull request: https://github.com/apache/spark/pull/2344#discussion_r17765733 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/types/dataTypes.scala --- @@ -170,6 +170,18 @@ case object TimestampType extends NativeType { def simpleString: String = "timestamp" } +case object DateType extends NativeType { + private[sql] type JvmType = Date + + @transient private[sql] lazy val tag = ScalaReflectionLock.synchronized { typeTag[JvmType] } + + private[sql] val ordering = new Ordering[JvmType] { +def compare(x: Date, y: Date) = x.compareTo(y) --- End diff -- I've checked the logic of `java.sql.Date`.`compareTo`, and it is not the same as `DateWritable`.`compareTo`, which is the internal representation in Hive. The former will compare its milliseconds, but the later only compare the `days since Epoch`, probably we need to follow the same semantic with Hive here. BTW: If we change the logic here, does that also mean we needn't cast the `date` to `string` for `BinaryPredicate` in `HiveTypeCoercion`? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-1701] Clarify slice vs partition in the...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2305#issuecomment-56126150 [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20567/consoleFull) for PR 2305 at commit [`c0af05d`](https://github.com/apache/spark/commit/c0af05d387c85d8dce79d1b4839c5950da443dbc). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2713] Executors of same application in ...
Github user li-zhihui commented on the pull request: https://github.com/apache/spark/pull/1616#issuecomment-56126098 @andrewor14 any more comments? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3589][Minor]remove redundant code
Github user andrewor14 commented on a diff in the pull request: https://github.com/apache/spark/pull/2445#discussion_r17765764 --- Diff: bin/spark-class --- @@ -169,7 +169,6 @@ if [ -n "$SPARK_SUBMIT_BOOTSTRAP_DRIVER" ]; then # This is used only if the properties file actually contains these special configs # Export the environment variables needed by SparkSubmitDriverBootstrapper export RUNNER - export CLASSPATH --- End diff -- Oh I see, we already exported it out there. I put this to make it more explicit that these are the variables needed by `SparkSubmitDriverBootstrapper`, so even though this line is somewhat redundant, it's here for code readability reasons. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3407][SQL]Add Date type support
Github user chenghao-intel commented on a diff in the pull request: https://github.com/apache/spark/pull/2344#discussion_r17765755 --- Diff: sql/core/src/main/java/org/apache/spark/sql/api/java/DateType.java --- @@ -0,0 +1,27 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.api.java; + +/** + * The data type representing java.sql.Timestamp values. + * + * {@code TimestampType} is represented by the singleton object {@link DataType#TimestampType}. --- End diff -- Update the comment, please. :) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-1701] Clarify slice vs partition in the...
Github user mattf commented on the pull request: https://github.com/apache/spark/pull/2305#issuecomment-56125812 thanks for the feedback. i've changed the language to be more inline with your suggestion. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3535][Mesos] Add 15% task memory overhe...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2401#issuecomment-56125354 [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20564/consoleFull) for PR 2401 at commit [`d51b74f`](https://github.com/apache/spark/commit/d51b74f26b67ff3719b7ba3182d0693e45301a81). * This patch **passes** unit tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-1767: Prefer HDFS-cached replicas when s...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1486#issuecomment-56125277 [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20563/consoleFull) for PR 1486 at commit [`d1f9fe3`](https://github.com/apache/spark/commit/d1f9fe36392ab18e36e8491cae4598e0063e59fa). * This patch **passes** unit tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [WIP][SPARK-1486][MLlib] Multi Model Training ...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/2451#discussion_r17765442 --- Diff: mllib/src/test/scala/org/apache/spark/mllib/linalg/BLASSuite.scala --- @@ -126,4 +126,142 @@ class BLASSuite extends FunSuite { } } } + + test("gemm") { --- End diff -- Shouldn't this test all 4 options for transA,transB? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-1545] [mllib] Add Random Forests
Github user chouqin commented on a diff in the pull request: https://github.com/apache/spark/pull/2435#discussion_r17765427 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/impl/DTStatsAggregator.scala --- @@ -189,6 +160,230 @@ private[tree] class DTStatsAggregator( } this } +} + +/** + * DecisionTree statistics aggregator. + * This holds a flat array of statistics for a set of (nodes, features, bins) + * and helps with indexing. + * + * This instance of [[DTStatsAggregator]] is used when not subsampling features. + * + * @param numNodes Number of nodes to collect statistics for. + */ +private[tree] class DTStatsAggregatorFixedFeatures( +metadata: DecisionTreeMetadata, +numNodes: Int) extends DTStatsAggregator(metadata) { + + /** + * Offset for each feature for calculating indices into the [[_allStats]] array. + * Mapping: featureIndex --> offset + */ + private val featureOffsets: Array[Int] = { +metadata.numBins.scanLeft(0)((total, nBins) => total + statsSize * nBins) + } + + /** + * Number of elements for each node, corresponding to stride between nodes in [[_allStats]]. + */ + private val nodeStride: Int = featureOffsets.last + + /** + * Total number of elements stored in this aggregator. + */ + def allStatsSize: Int = numNodes * nodeStride + + /** + * Flat array of elements. + * Index for start of stats for a (node, feature, bin) is: + * index = nodeIndex * nodeStride + featureOffsets(featureIndex) + binIndex * statsSize + * Note: For unordered features, the left child stats precede the right child stats + * in the binIndex order. + */ + protected val _allStats: Array[Double] = new Array[Double](allStatsSize) + + /** + * Get flat array of elements stored in this aggregator. + */ + protected def allStats: Array[Double] = _allStats + + /** + * Update the stats for a given (node, feature, bin) for ordered features, using the given label. + */ + def update( + nodeIndex: Int, + featureIndex: Int, + binIndex: Int, + label: Double, + instanceWeight: Double): Unit = { +val i = nodeIndex * nodeStride + featureOffsets(featureIndex) + binIndex * statsSize +impurityAggregator.update(_allStats, i, label, instanceWeight) --- End diff -- In `DTStatsAggregatorFixedFeatures` and `DTStatsAggregatorSubsampledFeatures`, `impurityAggregator.update` is called with `_allStats`, and in `DTStatsAggregator` it is called with `allStats`. I think we make it consistent by making a function `updateAggregator` in `DTStatsAggregator` like this: ``` def updateAggregator(offset: Int, label: Double, instanceWeight: Double): Unit = { impurityAggregator.update(allStats, offset, label, instanceWeight) } ``` all other functions that need to call `impurityAggragator.update` call this function. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-1701] Clarify slice vs partition in the...
Github user JoshRosen commented on a diff in the pull request: https://github.com/apache/spark/pull/2305#discussion_r17765356 --- Diff: docs/programming-guide.md --- @@ -286,7 +286,7 @@ We describe operations on distributed datasets later on. -One important parameter for parallel collections is the number of *slices* to cut the dataset into. Spark will run one task for each slice of the cluster. Typically you want 2-4 slices for each CPU in your cluster. Normally, Spark tries to set the number of slices automatically based on your cluster. However, you can also set it manually by passing it as a second parameter to `parallelize` (e.g. `sc.parallelize(data, 10)`). +One important parameter for parallel collections is the number of *partitions* to cut the dataset into. Spark will run one task for each partition of the cluster. Typically you want 2-4 partitions for each CPU in your cluster. Normally, Spark tries to set the number of partitions automatically based on your cluster. However, you can also set it manually by passing it as a second parameter to `parallelize` (e.g. `sc.parallelize(data, 10)`). Note: the parameter is called numSlices (not numPartitions) to maintain backward compatibility. --- End diff -- Maybe the "Note:" should mention that in _some_ places we still say numSlices (for backwards compatibility with earlier versions of Spark) and that "slices" should be considered as a synonym for "partitions"; there are a lot of places that use `numPartitions`, etc, so we may want to emphasize that this discrepancy only occurs in a few places. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-1701] Clarify slice vs partition in the...
Github user JoshRosen commented on the pull request: https://github.com/apache/spark/pull/2305#issuecomment-56124864 Sorry for not reviewing this until now; it sort of fell off my radar. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3589][Minor]remove redundant code
Github user WangTaoTheTonic commented on a diff in the pull request: https://github.com/apache/spark/pull/2445#discussion_r17765294 --- Diff: bin/spark-class --- @@ -169,7 +169,6 @@ if [ -n "$SPARK_SUBMIT_BOOTSTRAP_DRIVER" ]; then # This is used only if the properties file actually contains these special configs # Export the environment variables needed by SparkSubmitDriverBootstrapper export RUNNER - export CLASSPATH --- End diff -- @pwendell I mean we just export CLASSPATH before entering if conditional(line 161). Are the two different? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3529] [SQL] Delete the temp files after...
Github user chenghao-intel commented on the pull request: https://github.com/apache/spark/pull/2393#issuecomment-56124852 Thank you all, I've removed the `Signal` and use the `Utils.deleteRecursively` instead. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3554] [PySpark] use broadcast automatic...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/2417 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3554] [PySpark] use broadcast automatic...
Github user JoshRosen commented on the pull request: https://github.com/apache/spark/pull/2417#issuecomment-56124617 LGTM. Surprising that the broadcast variable removal code was never triggered in the test suite before; thanks for fixing that! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [WIP][SPARK-1486][MLlib] Multi Model Training ...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/2451#discussion_r17765188 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/linalg/BLAS.scala --- @@ -197,4 +201,452 @@ private[mllib] object BLAS extends Serializable { throw new IllegalArgumentException(s"scal doesn't support vector type ${x.getClass}.") } } + + // For level-3 routines, we use the native BLAS. + private def nativeBLAS: NetlibBLAS = { +if (_nativeBLAS == null) { + _nativeBLAS = NativeBLAS +} +_nativeBLAS + } + + /** + * C := alpha * A * B + beta * C + * @param transA whether to use the transpose of matrix A (true), or A itself (false). + * @param transB whether to use the transpose of matrix B (true), or B itself (false). + * @param alpha a scalar to scale the multiplication A * B. + * @param A the matrix A that will be left multiplied to B. Size of m x k. + * @param B the matrix B that will be left multiplied by A. Size of k x n. + * @param beta a scalar that can be used to scale matrix C. + * @param C the resulting matrix C. Size of m x n. + */ + def gemm( + transA: Boolean, + transB: Boolean, + alpha: Double, + A: Matrix, + B: Matrix, + beta: Double, + C: DenseMatrix): Unit = { +if (alpha == 0.0) { + logDebug("gemm: alpha is equal to 0. Returning C.") +} else { + A match { +case sparse: SparseMatrix => + B match { +case dB: DenseMatrix => gemm(transA, transB, alpha, sparse, dB, beta, C) +case sB: SparseMatrix => + throw new IllegalArgumentException(s"gemm doesn't support sparse-sparse matrix " + +s"multiplication") +case _ => + throw new IllegalArgumentException(s"gemm doesn't support matrix type ${B.getClass}.") + } +case dense: DenseMatrix => + B match { +case dB: DenseMatrix => gemm(transA, transB, alpha, dense, dB, beta, C) +case sB: SparseMatrix => gemm(transA, transB, alpha, dense, sB, beta, C) +case _ => + throw new IllegalArgumentException(s"gemm doesn't support matrix type ${B.getClass}.") + } +case _ => + throw new IllegalArgumentException(s"gemm doesn't support matrix type ${A.getClass}.") + } +} + } + + /** + * C := alpha * A * B + beta * C + * + * @param alpha a scalar to scale the multiplication A * B. + * @param A the matrix A that will be left multiplied to B. Size of m x k. + * @param B the matrix B that will be left multiplied by A. Size of k x n. + * @param beta a scalar that can be used to scale matrix C. + * @param C the resulting matrix C. Size of m x n. + */ + def gemm( + alpha: Double, + A: Matrix, + B: Matrix, + beta: Double, + C: DenseMatrix): Unit = { +gemm(false, false, alpha, A, B, beta, C) + } + + /** + * C := alpha * A * B + beta * C + * For `DenseMatrix` A. + */ + private def gemm( + transA: Boolean, + transB: Boolean, + alpha: Double, + A: DenseMatrix, + B: DenseMatrix, + beta: Double, + C: DenseMatrix): Unit = { +val mA: Int = if (!transA) A.numRows else A.numCols +val nB: Int = if (!transB) B.numCols else B.numRows +val kA: Int = if (!transA) A.numCols else A.numRows +val kB: Int = if (!transB) B.numRows else B.numCols +val tAstr = if (!transA) "N" else "T" +val tBstr = if (!transB) "N" else "T" + +require(kA == kB, s"The columns of A don't match the rows of B. A: $kA, B: $kB") +require(mA == C.numRows, s"The rows of C don't match the rows of A. C: ${C.numRows}, A: $mA") +require(nB == C.numCols, + s"The columns of C don't match the columns of B. C: ${C.numCols}, A: $nB") + +nativeBLAS.dgemm(tAstr, tBstr, mA, nB, kA, alpha, A.values, A.numRows, B.values, B.numRows, + beta, C.values, C.numRows) + } + + /** + * C := alpha * A * B + beta * C + * For `SparseMatrix` A. + */ + private def gemm( + transA: Boolean, + transB: Boolean, + alpha: Double, + A: SparseMatrix, + B: DenseMatrix, + beta: Double, + C: DenseMatrix): Unit = { +val mA: Int = if (!transA) A.numRows else A.numCols +val nB: Int = if (!transB) B.numCols else B.numRows +val kA: Int = if (!transA) A.numCols else A.numRows +val kB: Int = if (!transB) B.numRows else B.numCols + +r
[GitHub] spark pull request: [SPARK-1545] [mllib] Add Random Forests
Github user chouqin commented on a diff in the pull request: https://github.com/apache/spark/pull/2435#discussion_r17765183 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/impl/DTStatsAggregator.scala --- @@ -189,6 +160,230 @@ private[tree] class DTStatsAggregator( } this } +} + +/** + * DecisionTree statistics aggregator. + * This holds a flat array of statistics for a set of (nodes, features, bins) + * and helps with indexing. + * + * This instance of [[DTStatsAggregator]] is used when not subsampling features. + * + * @param numNodes Number of nodes to collect statistics for. + */ +private[tree] class DTStatsAggregatorFixedFeatures( +metadata: DecisionTreeMetadata, +numNodes: Int) extends DTStatsAggregator(metadata) { + + /** + * Offset for each feature for calculating indices into the [[_allStats]] array. + * Mapping: featureIndex --> offset + */ + private val featureOffsets: Array[Int] = { +metadata.numBins.scanLeft(0)((total, nBins) => total + statsSize * nBins) + } + + /** + * Number of elements for each node, corresponding to stride between nodes in [[_allStats]]. + */ + private val nodeStride: Int = featureOffsets.last + + /** + * Total number of elements stored in this aggregator. + */ + def allStatsSize: Int = numNodes * nodeStride + + /** + * Flat array of elements. + * Index for start of stats for a (node, feature, bin) is: + * index = nodeIndex * nodeStride + featureOffsets(featureIndex) + binIndex * statsSize + * Note: For unordered features, the left child stats precede the right child stats + * in the binIndex order. + */ + protected val _allStats: Array[Double] = new Array[Double](allStatsSize) + + /** + * Get flat array of elements stored in this aggregator. + */ + protected def allStats: Array[Double] = _allStats + + /** + * Update the stats for a given (node, feature, bin) for ordered features, using the given label. + */ + def update( + nodeIndex: Int, + featureIndex: Int, + binIndex: Int, + label: Double, + instanceWeight: Double): Unit = { +val i = nodeIndex * nodeStride + featureOffsets(featureIndex) + binIndex * statsSize +impurityAggregator.update(_allStats, i, label, instanceWeight) + } + + /** + * Pre-compute node offset for use with [[nodeUpdate]]. + */ + def getNodeOffset(nodeIndex: Int): Int = nodeIndex * nodeStride + + /** + * Faster version of [[update]]. + * Update the stats for a given (node, feature, bin) for ordered features, using the given label. + * @param nodeOffset Pre-computed node offset from [[getNodeOffset]]. + */ + def nodeUpdate( + nodeOffset: Int, + nodeIndex: Int, + featureIndex: Int, + binIndex: Int, + label: Double, + instanceWeight: Double): Unit = { +val i = nodeOffset + featureOffsets(featureIndex) + binIndex * statsSize +impurityAggregator.update(_allStats, i, label, instanceWeight) + } + + /** + * Pre-compute (node, feature) offset for use with [[nodeFeatureUpdate]]. + * For ordered features only. + */ + def getNodeFeatureOffset(nodeIndex: Int, featureIndex: Int): Int = { +require(!isUnordered(featureIndex), + s"DTStatsAggregator.getNodeFeatureOffset is for ordered features only, but was called" + +s" for unordered feature $featureIndex.") +nodeIndex * nodeStride + featureOffsets(featureIndex) + } + + /** + * Pre-compute (node, feature) offset for use with [[nodeFeatureUpdate]]. + * For unordered features only. + */ + def getLeftRightNodeFeatureOffsets(nodeIndex: Int, featureIndex: Int): (Int, Int) = { +require(isUnordered(featureIndex), --- End diff -- Can we promote this function to the base class `DTStatsAggregator` like this: ```scala def getLeftRightNodeFeatureOffsets(nodeIndex: Int, featureIndex: Int): (Int, Int) = { val baseOffset = getNodeFeatureOffset(nodeIndex, featureIndex) (baseOffset, baseOffset + (metadata.numBins(featureIndex) >> 1) * statsSize) } ``` This can avoid duplicate code in derived class. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubs
[GitHub] spark pull request: [WIP][SPARK-1486][MLlib] Multi Model Training ...
Github user brkyvz commented on a diff in the pull request: https://github.com/apache/spark/pull/2451#discussion_r17765187 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/linalg/BLAS.scala --- @@ -197,4 +201,452 @@ private[mllib] object BLAS extends Serializable { throw new IllegalArgumentException(s"scal doesn't support vector type ${x.getClass}.") } } + + // For level-3 routines, we use the native BLAS. + private def nativeBLAS: NetlibBLAS = { +if (_nativeBLAS == null) { + _nativeBLAS = NativeBLAS +} +_nativeBLAS + } + + /** + * C := alpha * A * B + beta * C + * @param transA whether to use the transpose of matrix A (true), or A itself (false). + * @param transB whether to use the transpose of matrix B (true), or B itself (false). + * @param alpha a scalar to scale the multiplication A * B. + * @param A the matrix A that will be left multiplied to B. Size of m x k. + * @param B the matrix B that will be left multiplied by A. Size of k x n. + * @param beta a scalar that can be used to scale matrix C. + * @param C the resulting matrix C. Size of m x n. + */ + def gemm( + transA: Boolean, + transB: Boolean, + alpha: Double, + A: Matrix, + B: Matrix, + beta: Double, + C: DenseMatrix): Unit = { +if (alpha == 0.0) { + logDebug("gemm: alpha is equal to 0. Returning C.") +} else { + A match { +case sparse: SparseMatrix => + B match { +case dB: DenseMatrix => gemm(transA, transB, alpha, sparse, dB, beta, C) +case sB: SparseMatrix => + throw new IllegalArgumentException(s"gemm doesn't support sparse-sparse matrix " + +s"multiplication") +case _ => + throw new IllegalArgumentException(s"gemm doesn't support matrix type ${B.getClass}.") + } +case dense: DenseMatrix => + B match { +case dB: DenseMatrix => gemm(transA, transB, alpha, dense, dB, beta, C) +case sB: SparseMatrix => gemm(transA, transB, alpha, dense, sB, beta, C) +case _ => + throw new IllegalArgumentException(s"gemm doesn't support matrix type ${B.getClass}.") + } +case _ => + throw new IllegalArgumentException(s"gemm doesn't support matrix type ${A.getClass}.") + } +} + } + + /** + * C := alpha * A * B + beta * C + * + * @param alpha a scalar to scale the multiplication A * B. + * @param A the matrix A that will be left multiplied to B. Size of m x k. + * @param B the matrix B that will be left multiplied by A. Size of k x n. + * @param beta a scalar that can be used to scale matrix C. + * @param C the resulting matrix C. Size of m x n. + */ + def gemm( + alpha: Double, + A: Matrix, + B: Matrix, + beta: Double, + C: DenseMatrix): Unit = { +gemm(false, false, alpha, A, B, beta, C) + } + + /** + * C := alpha * A * B + beta * C + * For `DenseMatrix` A. + */ + private def gemm( + transA: Boolean, + transB: Boolean, + alpha: Double, + A: DenseMatrix, + B: DenseMatrix, + beta: Double, + C: DenseMatrix): Unit = { +val mA: Int = if (!transA) A.numRows else A.numCols +val nB: Int = if (!transB) B.numCols else B.numRows +val kA: Int = if (!transA) A.numCols else A.numRows +val kB: Int = if (!transB) B.numRows else B.numCols +val tAstr = if (!transA) "N" else "T" +val tBstr = if (!transB) "N" else "T" + +require(kA == kB, s"The columns of A don't match the rows of B. A: $kA, B: $kB") +require(mA == C.numRows, s"The rows of C don't match the rows of A. C: ${C.numRows}, A: $mA") +require(nB == C.numCols, + s"The columns of C don't match the columns of B. C: ${C.numCols}, A: $nB") + +nativeBLAS.dgemm(tAstr, tBstr, mA, nB, kA, alpha, A.values, A.numRows, B.values, B.numRows, + beta, C.values, C.numRows) + } + + /** + * C := alpha * A * B + beta * C + * For `SparseMatrix` A. + */ + private def gemm( + transA: Boolean, + transB: Boolean, + alpha: Double, + A: SparseMatrix, + B: DenseMatrix, + beta: Double, + C: DenseMatrix): Unit = { +val mA: Int = if (!transA) A.numRows else A.numCols +val nB: Int = if (!transB) B.numCols else B.numRows +val kA: Int = if (!transA) A.numCols else A.numRows +val kB: Int = if (!transB) B.numRows else B.numCols + +requ
[GitHub] spark pull request: [WIP][SPARK-1486][MLlib] Multi Model Training ...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/2451#discussion_r17765178 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/linalg/BLAS.scala --- @@ -197,4 +201,452 @@ private[mllib] object BLAS extends Serializable { throw new IllegalArgumentException(s"scal doesn't support vector type ${x.getClass}.") } } + + // For level-3 routines, we use the native BLAS. + private def nativeBLAS: NetlibBLAS = { +if (_nativeBLAS == null) { + _nativeBLAS = NativeBLAS +} +_nativeBLAS + } + + /** + * C := alpha * A * B + beta * C + * @param transA whether to use the transpose of matrix A (true), or A itself (false). + * @param transB whether to use the transpose of matrix B (true), or B itself (false). + * @param alpha a scalar to scale the multiplication A * B. + * @param A the matrix A that will be left multiplied to B. Size of m x k. + * @param B the matrix B that will be left multiplied by A. Size of k x n. + * @param beta a scalar that can be used to scale matrix C. + * @param C the resulting matrix C. Size of m x n. + */ + def gemm( + transA: Boolean, + transB: Boolean, + alpha: Double, + A: Matrix, + B: Matrix, + beta: Double, + C: DenseMatrix): Unit = { +if (alpha == 0.0) { + logDebug("gemm: alpha is equal to 0. Returning C.") +} else { + A match { +case sparse: SparseMatrix => + B match { +case dB: DenseMatrix => gemm(transA, transB, alpha, sparse, dB, beta, C) +case sB: SparseMatrix => + throw new IllegalArgumentException(s"gemm doesn't support sparse-sparse matrix " + +s"multiplication") +case _ => + throw new IllegalArgumentException(s"gemm doesn't support matrix type ${B.getClass}.") + } +case dense: DenseMatrix => + B match { +case dB: DenseMatrix => gemm(transA, transB, alpha, dense, dB, beta, C) +case sB: SparseMatrix => gemm(transA, transB, alpha, dense, sB, beta, C) +case _ => + throw new IllegalArgumentException(s"gemm doesn't support matrix type ${B.getClass}.") + } +case _ => + throw new IllegalArgumentException(s"gemm doesn't support matrix type ${A.getClass}.") + } +} + } + + /** + * C := alpha * A * B + beta * C + * + * @param alpha a scalar to scale the multiplication A * B. + * @param A the matrix A that will be left multiplied to B. Size of m x k. + * @param B the matrix B that will be left multiplied by A. Size of k x n. + * @param beta a scalar that can be used to scale matrix C. + * @param C the resulting matrix C. Size of m x n. + */ + def gemm( + alpha: Double, + A: Matrix, + B: Matrix, + beta: Double, + C: DenseMatrix): Unit = { +gemm(false, false, alpha, A, B, beta, C) + } + + /** + * C := alpha * A * B + beta * C + * For `DenseMatrix` A. + */ + private def gemm( + transA: Boolean, + transB: Boolean, + alpha: Double, + A: DenseMatrix, + B: DenseMatrix, + beta: Double, + C: DenseMatrix): Unit = { +val mA: Int = if (!transA) A.numRows else A.numCols +val nB: Int = if (!transB) B.numCols else B.numRows +val kA: Int = if (!transA) A.numCols else A.numRows +val kB: Int = if (!transB) B.numRows else B.numCols +val tAstr = if (!transA) "N" else "T" +val tBstr = if (!transB) "N" else "T" + +require(kA == kB, s"The columns of A don't match the rows of B. A: $kA, B: $kB") +require(mA == C.numRows, s"The rows of C don't match the rows of A. C: ${C.numRows}, A: $mA") +require(nB == C.numCols, + s"The columns of C don't match the columns of B. C: ${C.numCols}, A: $nB") + +nativeBLAS.dgemm(tAstr, tBstr, mA, nB, kA, alpha, A.values, A.numRows, B.values, B.numRows, + beta, C.values, C.numRows) + } + + /** + * C := alpha * A * B + beta * C + * For `SparseMatrix` A. + */ + private def gemm( + transA: Boolean, + transB: Boolean, + alpha: Double, + A: SparseMatrix, + B: DenseMatrix, + beta: Double, + C: DenseMatrix): Unit = { +val mA: Int = if (!transA) A.numRows else A.numCols +val nB: Int = if (!transB) B.numCols else B.numRows +val kA: Int = if (!transA) A.numCols else A.numRows +val kB: Int = if (!transB) B.numRows else B.numCols + +r
[GitHub] spark pull request: [WIP][SPARK-1486][MLlib] Multi Model Training ...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/2451#discussion_r17765175 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/linalg/BLAS.scala --- @@ -197,4 +201,452 @@ private[mllib] object BLAS extends Serializable { throw new IllegalArgumentException(s"scal doesn't support vector type ${x.getClass}.") } } + + // For level-3 routines, we use the native BLAS. + private def nativeBLAS: NetlibBLAS = { +if (_nativeBLAS == null) { + _nativeBLAS = NativeBLAS +} +_nativeBLAS + } + + /** + * C := alpha * A * B + beta * C + * @param transA whether to use the transpose of matrix A (true), or A itself (false). + * @param transB whether to use the transpose of matrix B (true), or B itself (false). + * @param alpha a scalar to scale the multiplication A * B. + * @param A the matrix A that will be left multiplied to B. Size of m x k. + * @param B the matrix B that will be left multiplied by A. Size of k x n. + * @param beta a scalar that can be used to scale matrix C. + * @param C the resulting matrix C. Size of m x n. + */ + def gemm( + transA: Boolean, + transB: Boolean, + alpha: Double, + A: Matrix, + B: Matrix, + beta: Double, + C: DenseMatrix): Unit = { +if (alpha == 0.0) { + logDebug("gemm: alpha is equal to 0. Returning C.") +} else { + A match { +case sparse: SparseMatrix => + B match { +case dB: DenseMatrix => gemm(transA, transB, alpha, sparse, dB, beta, C) +case sB: SparseMatrix => + throw new IllegalArgumentException(s"gemm doesn't support sparse-sparse matrix " + +s"multiplication") +case _ => + throw new IllegalArgumentException(s"gemm doesn't support matrix type ${B.getClass}.") + } +case dense: DenseMatrix => + B match { +case dB: DenseMatrix => gemm(transA, transB, alpha, dense, dB, beta, C) +case sB: SparseMatrix => gemm(transA, transB, alpha, dense, sB, beta, C) +case _ => + throw new IllegalArgumentException(s"gemm doesn't support matrix type ${B.getClass}.") + } +case _ => + throw new IllegalArgumentException(s"gemm doesn't support matrix type ${A.getClass}.") + } +} + } + + /** + * C := alpha * A * B + beta * C + * + * @param alpha a scalar to scale the multiplication A * B. + * @param A the matrix A that will be left multiplied to B. Size of m x k. + * @param B the matrix B that will be left multiplied by A. Size of k x n. + * @param beta a scalar that can be used to scale matrix C. + * @param C the resulting matrix C. Size of m x n. + */ + def gemm( + alpha: Double, + A: Matrix, + B: Matrix, + beta: Double, + C: DenseMatrix): Unit = { +gemm(false, false, alpha, A, B, beta, C) + } + + /** + * C := alpha * A * B + beta * C + * For `DenseMatrix` A. + */ + private def gemm( + transA: Boolean, + transB: Boolean, + alpha: Double, + A: DenseMatrix, + B: DenseMatrix, + beta: Double, + C: DenseMatrix): Unit = { +val mA: Int = if (!transA) A.numRows else A.numCols +val nB: Int = if (!transB) B.numCols else B.numRows +val kA: Int = if (!transA) A.numCols else A.numRows +val kB: Int = if (!transB) B.numRows else B.numCols +val tAstr = if (!transA) "N" else "T" +val tBstr = if (!transB) "N" else "T" + +require(kA == kB, s"The columns of A don't match the rows of B. A: $kA, B: $kB") +require(mA == C.numRows, s"The rows of C don't match the rows of A. C: ${C.numRows}, A: $mA") +require(nB == C.numCols, + s"The columns of C don't match the columns of B. C: ${C.numCols}, A: $nB") + +nativeBLAS.dgemm(tAstr, tBstr, mA, nB, kA, alpha, A.values, A.numRows, B.values, B.numRows, + beta, C.values, C.numRows) + } + + /** + * C := alpha * A * B + beta * C + * For `SparseMatrix` A. + */ + private def gemm( + transA: Boolean, + transB: Boolean, + alpha: Double, + A: SparseMatrix, + B: DenseMatrix, + beta: Double, + C: DenseMatrix): Unit = { +val mA: Int = if (!transA) A.numRows else A.numCols +val nB: Int = if (!transB) B.numCols else B.numRows +val kA: Int = if (!transA) A.numCols else A.numRows +val kB: Int = if (!transB) B.numRows else B.numCols + +r
[GitHub] spark pull request: [WIP][SPARK-1486][MLlib] Multi Model Training ...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/2451#discussion_r17765173 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/linalg/BLAS.scala --- @@ -197,4 +201,452 @@ private[mllib] object BLAS extends Serializable { throw new IllegalArgumentException(s"scal doesn't support vector type ${x.getClass}.") } } + + // For level-3 routines, we use the native BLAS. + private def nativeBLAS: NetlibBLAS = { +if (_nativeBLAS == null) { + _nativeBLAS = NativeBLAS +} +_nativeBLAS + } + + /** + * C := alpha * A * B + beta * C + * @param transA whether to use the transpose of matrix A (true), or A itself (false). + * @param transB whether to use the transpose of matrix B (true), or B itself (false). + * @param alpha a scalar to scale the multiplication A * B. + * @param A the matrix A that will be left multiplied to B. Size of m x k. + * @param B the matrix B that will be left multiplied by A. Size of k x n. + * @param beta a scalar that can be used to scale matrix C. + * @param C the resulting matrix C. Size of m x n. + */ + def gemm( + transA: Boolean, + transB: Boolean, + alpha: Double, + A: Matrix, + B: Matrix, + beta: Double, + C: DenseMatrix): Unit = { +if (alpha == 0.0) { + logDebug("gemm: alpha is equal to 0. Returning C.") +} else { + A match { +case sparse: SparseMatrix => + B match { +case dB: DenseMatrix => gemm(transA, transB, alpha, sparse, dB, beta, C) +case sB: SparseMatrix => + throw new IllegalArgumentException(s"gemm doesn't support sparse-sparse matrix " + +s"multiplication") +case _ => + throw new IllegalArgumentException(s"gemm doesn't support matrix type ${B.getClass}.") + } +case dense: DenseMatrix => + B match { +case dB: DenseMatrix => gemm(transA, transB, alpha, dense, dB, beta, C) +case sB: SparseMatrix => gemm(transA, transB, alpha, dense, sB, beta, C) +case _ => + throw new IllegalArgumentException(s"gemm doesn't support matrix type ${B.getClass}.") + } +case _ => + throw new IllegalArgumentException(s"gemm doesn't support matrix type ${A.getClass}.") + } +} + } + + /** + * C := alpha * A * B + beta * C + * + * @param alpha a scalar to scale the multiplication A * B. + * @param A the matrix A that will be left multiplied to B. Size of m x k. + * @param B the matrix B that will be left multiplied by A. Size of k x n. + * @param beta a scalar that can be used to scale matrix C. + * @param C the resulting matrix C. Size of m x n. + */ + def gemm( + alpha: Double, + A: Matrix, + B: Matrix, + beta: Double, + C: DenseMatrix): Unit = { +gemm(false, false, alpha, A, B, beta, C) + } + + /** + * C := alpha * A * B + beta * C + * For `DenseMatrix` A. + */ + private def gemm( + transA: Boolean, + transB: Boolean, + alpha: Double, + A: DenseMatrix, + B: DenseMatrix, + beta: Double, + C: DenseMatrix): Unit = { +val mA: Int = if (!transA) A.numRows else A.numCols +val nB: Int = if (!transB) B.numCols else B.numRows +val kA: Int = if (!transA) A.numCols else A.numRows +val kB: Int = if (!transB) B.numRows else B.numCols +val tAstr = if (!transA) "N" else "T" +val tBstr = if (!transB) "N" else "T" + +require(kA == kB, s"The columns of A don't match the rows of B. A: $kA, B: $kB") +require(mA == C.numRows, s"The rows of C don't match the rows of A. C: ${C.numRows}, A: $mA") +require(nB == C.numCols, + s"The columns of C don't match the columns of B. C: ${C.numCols}, A: $nB") + +nativeBLAS.dgemm(tAstr, tBstr, mA, nB, kA, alpha, A.values, A.numRows, B.values, B.numRows, + beta, C.values, C.numRows) + } + + /** + * C := alpha * A * B + beta * C + * For `SparseMatrix` A. + */ + private def gemm( + transA: Boolean, + transB: Boolean, + alpha: Double, + A: SparseMatrix, + B: DenseMatrix, + beta: Double, + C: DenseMatrix): Unit = { +val mA: Int = if (!transA) A.numRows else A.numCols +val nB: Int = if (!transB) B.numCols else B.numRows +val kA: Int = if (!transA) A.numCols else A.numRows +val kB: Int = if (!transB) B.numRows else B.numCols + +r
[GitHub] spark pull request: [WIP][SPARK-1486][MLlib] Multi Model Training ...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/2451#discussion_r17765167 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/linalg/BLAS.scala --- @@ -197,4 +201,452 @@ private[mllib] object BLAS extends Serializable { throw new IllegalArgumentException(s"scal doesn't support vector type ${x.getClass}.") } } + + // For level-3 routines, we use the native BLAS. + private def nativeBLAS: NetlibBLAS = { +if (_nativeBLAS == null) { + _nativeBLAS = NativeBLAS +} +_nativeBLAS + } + + /** + * C := alpha * A * B + beta * C + * @param transA whether to use the transpose of matrix A (true), or A itself (false). + * @param transB whether to use the transpose of matrix B (true), or B itself (false). + * @param alpha a scalar to scale the multiplication A * B. + * @param A the matrix A that will be left multiplied to B. Size of m x k. + * @param B the matrix B that will be left multiplied by A. Size of k x n. + * @param beta a scalar that can be used to scale matrix C. + * @param C the resulting matrix C. Size of m x n. + */ + def gemm( + transA: Boolean, + transB: Boolean, + alpha: Double, + A: Matrix, + B: Matrix, + beta: Double, + C: DenseMatrix): Unit = { +if (alpha == 0.0) { + logDebug("gemm: alpha is equal to 0. Returning C.") +} else { + A match { +case sparse: SparseMatrix => + B match { +case dB: DenseMatrix => gemm(transA, transB, alpha, sparse, dB, beta, C) +case sB: SparseMatrix => + throw new IllegalArgumentException(s"gemm doesn't support sparse-sparse matrix " + +s"multiplication") +case _ => + throw new IllegalArgumentException(s"gemm doesn't support matrix type ${B.getClass}.") + } +case dense: DenseMatrix => + B match { +case dB: DenseMatrix => gemm(transA, transB, alpha, dense, dB, beta, C) +case sB: SparseMatrix => gemm(transA, transB, alpha, dense, sB, beta, C) +case _ => + throw new IllegalArgumentException(s"gemm doesn't support matrix type ${B.getClass}.") + } +case _ => + throw new IllegalArgumentException(s"gemm doesn't support matrix type ${A.getClass}.") + } +} + } + + /** + * C := alpha * A * B + beta * C + * + * @param alpha a scalar to scale the multiplication A * B. + * @param A the matrix A that will be left multiplied to B. Size of m x k. + * @param B the matrix B that will be left multiplied by A. Size of k x n. + * @param beta a scalar that can be used to scale matrix C. + * @param C the resulting matrix C. Size of m x n. + */ + def gemm( + alpha: Double, + A: Matrix, + B: Matrix, + beta: Double, + C: DenseMatrix): Unit = { +gemm(false, false, alpha, A, B, beta, C) + } + + /** + * C := alpha * A * B + beta * C + * For `DenseMatrix` A. + */ + private def gemm( + transA: Boolean, + transB: Boolean, + alpha: Double, + A: DenseMatrix, + B: DenseMatrix, + beta: Double, + C: DenseMatrix): Unit = { +val mA: Int = if (!transA) A.numRows else A.numCols +val nB: Int = if (!transB) B.numCols else B.numRows +val kA: Int = if (!transA) A.numCols else A.numRows +val kB: Int = if (!transB) B.numRows else B.numCols +val tAstr = if (!transA) "N" else "T" +val tBstr = if (!transB) "N" else "T" + +require(kA == kB, s"The columns of A don't match the rows of B. A: $kA, B: $kB") +require(mA == C.numRows, s"The rows of C don't match the rows of A. C: ${C.numRows}, A: $mA") +require(nB == C.numCols, + s"The columns of C don't match the columns of B. C: ${C.numCols}, A: $nB") + +nativeBLAS.dgemm(tAstr, tBstr, mA, nB, kA, alpha, A.values, A.numRows, B.values, B.numRows, + beta, C.values, C.numRows) + } + + /** + * C := alpha * A * B + beta * C + * For `SparseMatrix` A. + */ + private def gemm( + transA: Boolean, + transB: Boolean, + alpha: Double, + A: SparseMatrix, + B: DenseMatrix, + beta: Double, + C: DenseMatrix): Unit = { +val mA: Int = if (!transA) A.numRows else A.numCols +val nB: Int = if (!transB) B.numCols else B.numRows +val kA: Int = if (!transA) A.numCols else A.numRows +val kB: Int = if (!transB) B.numRows else B.numCols + +r
[GitHub] spark pull request: [WIP][SPARK-1486][MLlib] Multi Model Training ...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/2451#discussion_r17765077 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/linalg/BLAS.scala --- @@ -197,4 +201,452 @@ private[mllib] object BLAS extends Serializable { throw new IllegalArgumentException(s"scal doesn't support vector type ${x.getClass}.") } } + + // For level-3 routines, we use the native BLAS. + private def nativeBLAS: NetlibBLAS = { +if (_nativeBLAS == null) { + _nativeBLAS = NativeBLAS +} +_nativeBLAS + } + + /** + * C := alpha * A * B + beta * C + * @param transA whether to use the transpose of matrix A (true), or A itself (false). + * @param transB whether to use the transpose of matrix B (true), or B itself (false). + * @param alpha a scalar to scale the multiplication A * B. + * @param A the matrix A that will be left multiplied to B. Size of m x k. + * @param B the matrix B that will be left multiplied by A. Size of k x n. + * @param beta a scalar that can be used to scale matrix C. + * @param C the resulting matrix C. Size of m x n. + */ + def gemm( + transA: Boolean, + transB: Boolean, + alpha: Double, + A: Matrix, + B: Matrix, + beta: Double, + C: DenseMatrix): Unit = { +if (alpha == 0.0) { + logDebug("gemm: alpha is equal to 0. Returning C.") +} else { + A match { +case sparse: SparseMatrix => + B match { +case dB: DenseMatrix => gemm(transA, transB, alpha, sparse, dB, beta, C) +case sB: SparseMatrix => + throw new IllegalArgumentException(s"gemm doesn't support sparse-sparse matrix " + +s"multiplication") +case _ => + throw new IllegalArgumentException(s"gemm doesn't support matrix type ${B.getClass}.") + } +case dense: DenseMatrix => + B match { +case dB: DenseMatrix => gemm(transA, transB, alpha, dense, dB, beta, C) +case sB: SparseMatrix => gemm(transA, transB, alpha, dense, sB, beta, C) +case _ => + throw new IllegalArgumentException(s"gemm doesn't support matrix type ${B.getClass}.") + } +case _ => + throw new IllegalArgumentException(s"gemm doesn't support matrix type ${A.getClass}.") + } +} + } + + /** + * C := alpha * A * B + beta * C + * + * @param alpha a scalar to scale the multiplication A * B. + * @param A the matrix A that will be left multiplied to B. Size of m x k. + * @param B the matrix B that will be left multiplied by A. Size of k x n. + * @param beta a scalar that can be used to scale matrix C. + * @param C the resulting matrix C. Size of m x n. + */ + def gemm( + alpha: Double, + A: Matrix, + B: Matrix, + beta: Double, + C: DenseMatrix): Unit = { +gemm(false, false, alpha, A, B, beta, C) + } + + /** + * C := alpha * A * B + beta * C + * For `DenseMatrix` A. + */ + private def gemm( + transA: Boolean, + transB: Boolean, + alpha: Double, + A: DenseMatrix, + B: DenseMatrix, + beta: Double, + C: DenseMatrix): Unit = { +val mA: Int = if (!transA) A.numRows else A.numCols +val nB: Int = if (!transB) B.numCols else B.numRows +val kA: Int = if (!transA) A.numCols else A.numRows +val kB: Int = if (!transB) B.numRows else B.numCols +val tAstr = if (!transA) "N" else "T" +val tBstr = if (!transB) "N" else "T" + +require(kA == kB, s"The columns of A don't match the rows of B. A: $kA, B: $kB") +require(mA == C.numRows, s"The rows of C don't match the rows of A. C: ${C.numRows}, A: $mA") +require(nB == C.numCols, + s"The columns of C don't match the columns of B. C: ${C.numCols}, A: $nB") + +nativeBLAS.dgemm(tAstr, tBstr, mA, nB, kA, alpha, A.values, A.numRows, B.values, B.numRows, + beta, C.values, C.numRows) + } + + /** + * C := alpha * A * B + beta * C + * For `SparseMatrix` A. + */ + private def gemm( + transA: Boolean, + transB: Boolean, + alpha: Double, + A: SparseMatrix, + B: DenseMatrix, + beta: Double, + C: DenseMatrix): Unit = { +val mA: Int = if (!transA) A.numRows else A.numCols +val nB: Int = if (!transB) B.numCols else B.numRows +val kA: Int = if (!transA) A.numCols else A.numRows +val kB: Int = if (!transB) B.numRows else B.numCols + +r
[GitHub] spark pull request: [SPARK-1545] [mllib] Add Random Forests
Github user chouqin commented on a diff in the pull request: https://github.com/apache/spark/pull/2435#discussion_r17765048 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/impl/DTStatsAggregator.scala --- @@ -189,6 +160,230 @@ private[tree] class DTStatsAggregator( } this } +} + +/** + * DecisionTree statistics aggregator. + * This holds a flat array of statistics for a set of (nodes, features, bins) + * and helps with indexing. + * + * This instance of [[DTStatsAggregator]] is used when not subsampling features. + * + * @param numNodes Number of nodes to collect statistics for. + */ +private[tree] class DTStatsAggregatorFixedFeatures( +metadata: DecisionTreeMetadata, +numNodes: Int) extends DTStatsAggregator(metadata) { + + /** + * Offset for each feature for calculating indices into the [[_allStats]] array. + * Mapping: featureIndex --> offset + */ + private val featureOffsets: Array[Int] = { +metadata.numBins.scanLeft(0)((total, nBins) => total + statsSize * nBins) + } + + /** + * Number of elements for each node, corresponding to stride between nodes in [[_allStats]]. + */ + private val nodeStride: Int = featureOffsets.last + + /** + * Total number of elements stored in this aggregator. + */ + def allStatsSize: Int = numNodes * nodeStride + + /** + * Flat array of elements. + * Index for start of stats for a (node, feature, bin) is: + * index = nodeIndex * nodeStride + featureOffsets(featureIndex) + binIndex * statsSize + * Note: For unordered features, the left child stats precede the right child stats + * in the binIndex order. + */ + protected val _allStats: Array[Double] = new Array[Double](allStatsSize) + + /** + * Get flat array of elements stored in this aggregator. + */ + protected def allStats: Array[Double] = _allStats + + /** + * Update the stats for a given (node, feature, bin) for ordered features, using the given label. + */ + def update( + nodeIndex: Int, + featureIndex: Int, + binIndex: Int, + label: Double, + instanceWeight: Double): Unit = { +val i = nodeIndex * nodeStride + featureOffsets(featureIndex) + binIndex * statsSize +impurityAggregator.update(_allStats, i, label, instanceWeight) + } + + /** + * Pre-compute node offset for use with [[nodeUpdate]]. + */ + def getNodeOffset(nodeIndex: Int): Int = nodeIndex * nodeStride + + /** + * Faster version of [[update]]. + * Update the stats for a given (node, feature, bin) for ordered features, using the given label. --- End diff -- just curious, is this function really faster than `update`. I think it just saves one multiplication and in `DTStatsAggregatorSubsampledFeatures` it saves nothing. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [WIP][SPARK-1486][MLlib] Multi Model Training ...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/2451#discussion_r17765001 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/linalg/BLAS.scala --- @@ -197,4 +201,452 @@ private[mllib] object BLAS extends Serializable { throw new IllegalArgumentException(s"scal doesn't support vector type ${x.getClass}.") } } + + // For level-3 routines, we use the native BLAS. + private def nativeBLAS: NetlibBLAS = { +if (_nativeBLAS == null) { + _nativeBLAS = NativeBLAS +} +_nativeBLAS + } + + /** + * C := alpha * A * B + beta * C + * @param transA whether to use the transpose of matrix A (true), or A itself (false). + * @param transB whether to use the transpose of matrix B (true), or B itself (false). + * @param alpha a scalar to scale the multiplication A * B. + * @param A the matrix A that will be left multiplied to B. Size of m x k. + * @param B the matrix B that will be left multiplied by A. Size of k x n. + * @param beta a scalar that can be used to scale matrix C. + * @param C the resulting matrix C. Size of m x n. + */ + def gemm( + transA: Boolean, + transB: Boolean, + alpha: Double, + A: Matrix, + B: Matrix, + beta: Double, + C: DenseMatrix): Unit = { +if (alpha == 0.0) { + logDebug("gemm: alpha is equal to 0. Returning C.") +} else { + A match { +case sparse: SparseMatrix => + B match { +case dB: DenseMatrix => gemm(transA, transB, alpha, sparse, dB, beta, C) +case sB: SparseMatrix => + throw new IllegalArgumentException(s"gemm doesn't support sparse-sparse matrix " + +s"multiplication") +case _ => + throw new IllegalArgumentException(s"gemm doesn't support matrix type ${B.getClass}.") + } +case dense: DenseMatrix => + B match { +case dB: DenseMatrix => gemm(transA, transB, alpha, dense, dB, beta, C) +case sB: SparseMatrix => gemm(transA, transB, alpha, dense, sB, beta, C) +case _ => + throw new IllegalArgumentException(s"gemm doesn't support matrix type ${B.getClass}.") + } +case _ => + throw new IllegalArgumentException(s"gemm doesn't support matrix type ${A.getClass}.") + } +} + } + + /** + * C := alpha * A * B + beta * C + * + * @param alpha a scalar to scale the multiplication A * B. + * @param A the matrix A that will be left multiplied to B. Size of m x k. + * @param B the matrix B that will be left multiplied by A. Size of k x n. + * @param beta a scalar that can be used to scale matrix C. + * @param C the resulting matrix C. Size of m x n. + */ + def gemm( + alpha: Double, + A: Matrix, + B: Matrix, + beta: Double, + C: DenseMatrix): Unit = { +gemm(false, false, alpha, A, B, beta, C) + } + + /** + * C := alpha * A * B + beta * C + * For `DenseMatrix` A. + */ + private def gemm( + transA: Boolean, + transB: Boolean, + alpha: Double, + A: DenseMatrix, + B: DenseMatrix, + beta: Double, + C: DenseMatrix): Unit = { +val mA: Int = if (!transA) A.numRows else A.numCols +val nB: Int = if (!transB) B.numCols else B.numRows +val kA: Int = if (!transA) A.numCols else A.numRows +val kB: Int = if (!transB) B.numRows else B.numCols +val tAstr = if (!transA) "N" else "T" +val tBstr = if (!transB) "N" else "T" + +require(kA == kB, s"The columns of A don't match the rows of B. A: $kA, B: $kB") +require(mA == C.numRows, s"The rows of C don't match the rows of A. C: ${C.numRows}, A: $mA") +require(nB == C.numCols, + s"The columns of C don't match the columns of B. C: ${C.numCols}, A: $nB") + +nativeBLAS.dgemm(tAstr, tBstr, mA, nB, kA, alpha, A.values, A.numRows, B.values, B.numRows, + beta, C.values, C.numRows) + } + + /** + * C := alpha * A * B + beta * C + * For `SparseMatrix` A. + */ + private def gemm( + transA: Boolean, + transB: Boolean, + alpha: Double, + A: SparseMatrix, + B: DenseMatrix, + beta: Double, + C: DenseMatrix): Unit = { +val mA: Int = if (!transA) A.numRows else A.numCols +val nB: Int = if (!transB) B.numCols else B.numRows +val kA: Int = if (!transA) A.numCols else A.numRows +val kB: Int = if (!transB) B.numRows else B.numCols + +r
[GitHub] spark pull request: [SPARK-1987] EdgePartitionBuilder: More memory...
Github user larryxiao commented on the pull request: https://github.com/apache/spark/pull/2446#issuecomment-56123977 Thanks Ankur! You are really efficient! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [WIP][SPARK-1486][MLlib] Multi Model Training ...
Github user brkyvz commented on a diff in the pull request: https://github.com/apache/spark/pull/2451#discussion_r17764905 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/linalg/BLAS.scala --- @@ -197,4 +201,452 @@ private[mllib] object BLAS extends Serializable { throw new IllegalArgumentException(s"scal doesn't support vector type ${x.getClass}.") } } + + // For level-3 routines, we use the native BLAS. + private def nativeBLAS: NetlibBLAS = { +if (_nativeBLAS == null) { + _nativeBLAS = NativeBLAS +} +_nativeBLAS + } + + /** + * C := alpha * A * B + beta * C + * @param transA whether to use the transpose of matrix A (true), or A itself (false). + * @param transB whether to use the transpose of matrix B (true), or B itself (false). + * @param alpha a scalar to scale the multiplication A * B. + * @param A the matrix A that will be left multiplied to B. Size of m x k. + * @param B the matrix B that will be left multiplied by A. Size of k x n. + * @param beta a scalar that can be used to scale matrix C. + * @param C the resulting matrix C. Size of m x n. + */ + def gemm( + transA: Boolean, + transB: Boolean, + alpha: Double, + A: Matrix, + B: Matrix, + beta: Double, + C: DenseMatrix): Unit = { +if (alpha == 0.0) { + logDebug("gemm: alpha is equal to 0. Returning C.") +} else { + A match { +case sparse: SparseMatrix => + B match { +case dB: DenseMatrix => gemm(transA, transB, alpha, sparse, dB, beta, C) +case sB: SparseMatrix => + throw new IllegalArgumentException(s"gemm doesn't support sparse-sparse matrix " + +s"multiplication") +case _ => + throw new IllegalArgumentException(s"gemm doesn't support matrix type ${B.getClass}.") + } +case dense: DenseMatrix => + B match { +case dB: DenseMatrix => gemm(transA, transB, alpha, dense, dB, beta, C) +case sB: SparseMatrix => gemm(transA, transB, alpha, dense, sB, beta, C) +case _ => + throw new IllegalArgumentException(s"gemm doesn't support matrix type ${B.getClass}.") + } +case _ => + throw new IllegalArgumentException(s"gemm doesn't support matrix type ${A.getClass}.") + } +} + } + + /** + * C := alpha * A * B + beta * C + * + * @param alpha a scalar to scale the multiplication A * B. + * @param A the matrix A that will be left multiplied to B. Size of m x k. + * @param B the matrix B that will be left multiplied by A. Size of k x n. + * @param beta a scalar that can be used to scale matrix C. + * @param C the resulting matrix C. Size of m x n. + */ + def gemm( + alpha: Double, + A: Matrix, + B: Matrix, + beta: Double, + C: DenseMatrix): Unit = { +gemm(false, false, alpha, A, B, beta, C) + } + + /** + * C := alpha * A * B + beta * C + * For `DenseMatrix` A. + */ + private def gemm( + transA: Boolean, + transB: Boolean, + alpha: Double, + A: DenseMatrix, + B: DenseMatrix, + beta: Double, + C: DenseMatrix): Unit = { +val mA: Int = if (!transA) A.numRows else A.numCols +val nB: Int = if (!transB) B.numCols else B.numRows +val kA: Int = if (!transA) A.numCols else A.numRows +val kB: Int = if (!transB) B.numRows else B.numCols +val tAstr = if (!transA) "N" else "T" +val tBstr = if (!transB) "N" else "T" + +require(kA == kB, s"The columns of A don't match the rows of B. A: $kA, B: $kB") +require(mA == C.numRows, s"The rows of C don't match the rows of A. C: ${C.numRows}, A: $mA") +require(nB == C.numCols, + s"The columns of C don't match the columns of B. C: ${C.numCols}, A: $nB") + +nativeBLAS.dgemm(tAstr, tBstr, mA, nB, kA, alpha, A.values, A.numRows, B.values, B.numRows, + beta, C.values, C.numRows) + } + + /** + * C := alpha * A * B + beta * C + * For `SparseMatrix` A. + */ + private def gemm( + transA: Boolean, + transB: Boolean, + alpha: Double, + A: SparseMatrix, + B: DenseMatrix, + beta: Double, + C: DenseMatrix): Unit = { +val mA: Int = if (!transA) A.numRows else A.numCols +val nB: Int = if (!transB) B.numCols else B.numRows +val kA: Int = if (!transA) A.numCols else A.numRows +val kB: Int = if (!transB) B.numRows else B.numCols + +requ
[GitHub] spark pull request: [SPARK-3418] Sparse Matrix support (CCS) and a...
Github user brkyvz commented on the pull request: https://github.com/apache/spark/pull/2294#issuecomment-56123894 @ScrapCodes THANKS A LOT! That fixed it! I didn't realize I didn't update my local repo for such a long time. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [WIP][SPARK-1486][MLlib] Multi Model Training ...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/2451#discussion_r17764836 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/linalg/BLAS.scala --- @@ -197,4 +201,452 @@ private[mllib] object BLAS extends Serializable { throw new IllegalArgumentException(s"scal doesn't support vector type ${x.getClass}.") } } + + // For level-3 routines, we use the native BLAS. + private def nativeBLAS: NetlibBLAS = { +if (_nativeBLAS == null) { + _nativeBLAS = NativeBLAS +} +_nativeBLAS + } + + /** + * C := alpha * A * B + beta * C + * @param transA whether to use the transpose of matrix A (true), or A itself (false). + * @param transB whether to use the transpose of matrix B (true), or B itself (false). + * @param alpha a scalar to scale the multiplication A * B. + * @param A the matrix A that will be left multiplied to B. Size of m x k. + * @param B the matrix B that will be left multiplied by A. Size of k x n. + * @param beta a scalar that can be used to scale matrix C. + * @param C the resulting matrix C. Size of m x n. + */ + def gemm( + transA: Boolean, + transB: Boolean, + alpha: Double, + A: Matrix, + B: Matrix, + beta: Double, + C: DenseMatrix): Unit = { +if (alpha == 0.0) { + logDebug("gemm: alpha is equal to 0. Returning C.") +} else { + A match { +case sparse: SparseMatrix => + B match { +case dB: DenseMatrix => gemm(transA, transB, alpha, sparse, dB, beta, C) +case sB: SparseMatrix => + throw new IllegalArgumentException(s"gemm doesn't support sparse-sparse matrix " + +s"multiplication") +case _ => + throw new IllegalArgumentException(s"gemm doesn't support matrix type ${B.getClass}.") + } +case dense: DenseMatrix => + B match { +case dB: DenseMatrix => gemm(transA, transB, alpha, dense, dB, beta, C) +case sB: SparseMatrix => gemm(transA, transB, alpha, dense, sB, beta, C) +case _ => + throw new IllegalArgumentException(s"gemm doesn't support matrix type ${B.getClass}.") + } +case _ => + throw new IllegalArgumentException(s"gemm doesn't support matrix type ${A.getClass}.") + } +} + } + + /** + * C := alpha * A * B + beta * C + * + * @param alpha a scalar to scale the multiplication A * B. + * @param A the matrix A that will be left multiplied to B. Size of m x k. + * @param B the matrix B that will be left multiplied by A. Size of k x n. + * @param beta a scalar that can be used to scale matrix C. + * @param C the resulting matrix C. Size of m x n. + */ + def gemm( + alpha: Double, + A: Matrix, + B: Matrix, + beta: Double, + C: DenseMatrix): Unit = { +gemm(false, false, alpha, A, B, beta, C) + } + + /** + * C := alpha * A * B + beta * C + * For `DenseMatrix` A. + */ + private def gemm( + transA: Boolean, + transB: Boolean, + alpha: Double, + A: DenseMatrix, + B: DenseMatrix, + beta: Double, + C: DenseMatrix): Unit = { +val mA: Int = if (!transA) A.numRows else A.numCols +val nB: Int = if (!transB) B.numCols else B.numRows +val kA: Int = if (!transA) A.numCols else A.numRows +val kB: Int = if (!transB) B.numRows else B.numCols +val tAstr = if (!transA) "N" else "T" +val tBstr = if (!transB) "N" else "T" + +require(kA == kB, s"The columns of A don't match the rows of B. A: $kA, B: $kB") +require(mA == C.numRows, s"The rows of C don't match the rows of A. C: ${C.numRows}, A: $mA") +require(nB == C.numCols, + s"The columns of C don't match the columns of B. C: ${C.numCols}, A: $nB") + +nativeBLAS.dgemm(tAstr, tBstr, mA, nB, kA, alpha, A.values, A.numRows, B.values, B.numRows, + beta, C.values, C.numRows) + } + + /** + * C := alpha * A * B + beta * C + * For `SparseMatrix` A. + */ + private def gemm( + transA: Boolean, + transB: Boolean, + alpha: Double, + A: SparseMatrix, + B: DenseMatrix, + beta: Double, + C: DenseMatrix): Unit = { +val mA: Int = if (!transA) A.numRows else A.numCols +val nB: Int = if (!transB) B.numCols else B.numRows +val kA: Int = if (!transA) A.numCols else A.numRows +val kB: Int = if (!transB) B.numRows else B.numCols + +r
[GitHub] spark pull request: [WIP][SPARK-1486][MLlib] Multi Model Training ...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/2451#discussion_r17764833 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/linalg/BLAS.scala --- @@ -197,4 +201,452 @@ private[mllib] object BLAS extends Serializable { throw new IllegalArgumentException(s"scal doesn't support vector type ${x.getClass}.") } } + + // For level-3 routines, we use the native BLAS. + private def nativeBLAS: NetlibBLAS = { +if (_nativeBLAS == null) { + _nativeBLAS = NativeBLAS +} +_nativeBLAS + } + + /** + * C := alpha * A * B + beta * C + * @param transA whether to use the transpose of matrix A (true), or A itself (false). + * @param transB whether to use the transpose of matrix B (true), or B itself (false). + * @param alpha a scalar to scale the multiplication A * B. + * @param A the matrix A that will be left multiplied to B. Size of m x k. + * @param B the matrix B that will be left multiplied by A. Size of k x n. + * @param beta a scalar that can be used to scale matrix C. + * @param C the resulting matrix C. Size of m x n. + */ + def gemm( + transA: Boolean, + transB: Boolean, + alpha: Double, + A: Matrix, + B: Matrix, + beta: Double, + C: DenseMatrix): Unit = { +if (alpha == 0.0) { + logDebug("gemm: alpha is equal to 0. Returning C.") +} else { + A match { +case sparse: SparseMatrix => + B match { +case dB: DenseMatrix => gemm(transA, transB, alpha, sparse, dB, beta, C) +case sB: SparseMatrix => + throw new IllegalArgumentException(s"gemm doesn't support sparse-sparse matrix " + +s"multiplication") +case _ => + throw new IllegalArgumentException(s"gemm doesn't support matrix type ${B.getClass}.") + } +case dense: DenseMatrix => + B match { +case dB: DenseMatrix => gemm(transA, transB, alpha, dense, dB, beta, C) +case sB: SparseMatrix => gemm(transA, transB, alpha, dense, sB, beta, C) +case _ => + throw new IllegalArgumentException(s"gemm doesn't support matrix type ${B.getClass}.") + } +case _ => + throw new IllegalArgumentException(s"gemm doesn't support matrix type ${A.getClass}.") + } +} + } + + /** + * C := alpha * A * B + beta * C + * + * @param alpha a scalar to scale the multiplication A * B. + * @param A the matrix A that will be left multiplied to B. Size of m x k. + * @param B the matrix B that will be left multiplied by A. Size of k x n. + * @param beta a scalar that can be used to scale matrix C. + * @param C the resulting matrix C. Size of m x n. + */ + def gemm( + alpha: Double, + A: Matrix, + B: Matrix, + beta: Double, + C: DenseMatrix): Unit = { +gemm(false, false, alpha, A, B, beta, C) + } + + /** + * C := alpha * A * B + beta * C + * For `DenseMatrix` A. + */ + private def gemm( + transA: Boolean, + transB: Boolean, + alpha: Double, + A: DenseMatrix, + B: DenseMatrix, + beta: Double, + C: DenseMatrix): Unit = { +val mA: Int = if (!transA) A.numRows else A.numCols +val nB: Int = if (!transB) B.numCols else B.numRows +val kA: Int = if (!transA) A.numCols else A.numRows +val kB: Int = if (!transB) B.numRows else B.numCols +val tAstr = if (!transA) "N" else "T" +val tBstr = if (!transB) "N" else "T" + +require(kA == kB, s"The columns of A don't match the rows of B. A: $kA, B: $kB") +require(mA == C.numRows, s"The rows of C don't match the rows of A. C: ${C.numRows}, A: $mA") +require(nB == C.numCols, + s"The columns of C don't match the columns of B. C: ${C.numCols}, A: $nB") + +nativeBLAS.dgemm(tAstr, tBstr, mA, nB, kA, alpha, A.values, A.numRows, B.values, B.numRows, + beta, C.values, C.numRows) + } + + /** + * C := alpha * A * B + beta * C + * For `SparseMatrix` A. + */ + private def gemm( + transA: Boolean, + transB: Boolean, + alpha: Double, + A: SparseMatrix, + B: DenseMatrix, + beta: Double, + C: DenseMatrix): Unit = { +val mA: Int = if (!transA) A.numRows else A.numCols +val nB: Int = if (!transB) B.numCols else B.numRows +val kA: Int = if (!transA) A.numCols else A.numRows +val kB: Int = if (!transB) B.numRows else B.numCols + +r
[GitHub] spark pull request: [SPARK-3535][Mesos] Add 15% task memory overhe...
Github user brndnmtthws commented on the pull request: https://github.com/apache/spark/pull/2401#issuecomment-56123833 I can emulate the YARN behaviour, but it seems better to just do the same thing with both Mesos and YARN. Thoughts? I can refactor this (including the YARN code) to make it consistent. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3418] Sparse Matrix support (CCS) and a...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2294#issuecomment-56123782 [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20562/consoleFull) for PR 2294 at commit [`88814ed`](https://github.com/apache/spark/commit/88814ed215e81bb6b945c8fff60bdb822ac2cb96). * This patch **passes** unit tests. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `sealed trait Matrix extends Serializable ` * `class SparseMatrix(` * `sealed trait Vector extends Serializable ` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3535][Mesos] Add 15% task memory overhe...
Github user andrewor14 commented on the pull request: https://github.com/apache/spark/pull/2401#issuecomment-56123690 Yes @willb I have the same concern. I think in other contexts "overhead" refers to the additional space, not the total space, so an "overhead fraction" of 0.15 means we use extra space that is equal to 15% of the original memory. It will be confusing if we expose a "fraction" config that has a value of 1.15. However, even with the latest changes this is still inconsistent with what we do for yarn. I think there is value in having an equivalent config for Mesos rather than one that is semantically different. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [Minor Hot Fix] Move a line in SparkSubmit to ...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/2452 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [Minor Hot Fix] Move a line in SparkSubmit to ...
Github user andrewor14 commented on the pull request: https://github.com/apache/spark/pull/2452#issuecomment-56123334 Thanks @andrewor14, I'm merging this into master and 1.1! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-927] detect numpy at time of use
Github user mattf commented on the pull request: https://github.com/apache/spark/pull/2313#issuecomment-56123348 @JoshRosen it looks like @davies and i are on the same page. how would you like to proceed? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [Minor Hot Fix] Move a line in SparkSubmit to ...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2452#issuecomment-56122936 [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20561/consoleFull) for PR 2452 at commit [`d5190ca`](https://github.com/apache/spark/commit/d5190ca661991b11a6dc6682c92600beab032175). * This patch **passes** unit tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [WIP][SPARK-1486][MLlib] Multi Model Training ...
Github user jkbradley commented on the pull request: https://github.com/apache/spark/pull/2451#issuecomment-56122908 @brkyvz Just wondering: Which reference library are you using to determine the order of arguments for BLAS routines? E.g., it's different from [Netlib LAPACK](http://www.netlib.org/lapack/explore-html/d7/d2b/dgemm_8f.html). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3491] [MLlib] [PySpark] use pickle to s...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2378#issuecomment-56122608 [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20560/consoleFull) for PR 2378 at commit [`810f97f`](https://github.com/apache/spark/commit/810f97f53befdb262e55e736500d909b5f869f1a). * This patch **passes** unit tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-1545] [mllib] Add Random Forests
Github user chouqin commented on a diff in the pull request: https://github.com/apache/spark/pull/2435#discussion_r17764228 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/impl/DTStatsAggregator.scala --- @@ -189,6 +160,230 @@ private[tree] class DTStatsAggregator( } this } +} + +/** + * DecisionTree statistics aggregator. + * This holds a flat array of statistics for a set of (nodes, features, bins) + * and helps with indexing. + * + * This instance of [[DTStatsAggregator]] is used when not subsampling features. + * + * @param numNodes Number of nodes to collect statistics for. + */ +private[tree] class DTStatsAggregatorFixedFeatures( +metadata: DecisionTreeMetadata, +numNodes: Int) extends DTStatsAggregator(metadata) { + + /** + * Offset for each feature for calculating indices into the [[_allStats]] array. + * Mapping: featureIndex --> offset + */ + private val featureOffsets: Array[Int] = { +metadata.numBins.scanLeft(0)((total, nBins) => total + statsSize * nBins) + } + + /** + * Number of elements for each node, corresponding to stride between nodes in [[_allStats]]. + */ + private val nodeStride: Int = featureOffsets.last + + /** + * Total number of elements stored in this aggregator. + */ + def allStatsSize: Int = numNodes * nodeStride + + /** + * Flat array of elements. + * Index for start of stats for a (node, feature, bin) is: + * index = nodeIndex * nodeStride + featureOffsets(featureIndex) + binIndex * statsSize + * Note: For unordered features, the left child stats precede the right child stats + * in the binIndex order. + */ + protected val _allStats: Array[Double] = new Array[Double](allStatsSize) + + /** + * Get flat array of elements stored in this aggregator. + */ + protected def allStats: Array[Double] = _allStats + + /** + * Update the stats for a given (node, feature, bin) for ordered features, using the given label. + */ + def update( + nodeIndex: Int, + featureIndex: Int, + binIndex: Int, + label: Double, + instanceWeight: Double): Unit = { +val i = nodeIndex * nodeStride + featureOffsets(featureIndex) + binIndex * statsSize +impurityAggregator.update(_allStats, i, label, instanceWeight) + } + + /** + * Pre-compute node offset for use with [[nodeUpdate]]. + */ + def getNodeOffset(nodeIndex: Int): Int = nodeIndex * nodeStride + + /** + * Faster version of [[update]]. + * Update the stats for a given (node, feature, bin) for ordered features, using the given label. + * @param nodeOffset Pre-computed node offset from [[getNodeOffset]]. + */ + def nodeUpdate( + nodeOffset: Int, + nodeIndex: Int, --- End diff -- sorry, `nodeIndex` is needed in `DTStatsAggregatorSubsampledFeatures`. Close this comment. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3535][Mesos] Add 15% task memory overhe...
Github user vanzin commented on the pull request: https://github.com/apache/spark/pull/2401#issuecomment-56122505 @willb in the case of Yarn, the parameter is called "overhead" because it actually sets the amount of overhead you want to add to the requested heap memory. The PR being referenced just uses a multiplier to set the default overhead value; the user-visible parameter doesn't set the multiplier. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-1545] [mllib] Add Random Forests
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/2435#discussion_r17764224 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/impl/DTStatsAggregator.scala --- @@ -189,6 +160,230 @@ private[tree] class DTStatsAggregator( } this } +} + +/** + * DecisionTree statistics aggregator. + * This holds a flat array of statistics for a set of (nodes, features, bins) + * and helps with indexing. + * + * This instance of [[DTStatsAggregator]] is used when not subsampling features. + * + * @param numNodes Number of nodes to collect statistics for. + */ +private[tree] class DTStatsAggregatorFixedFeatures( +metadata: DecisionTreeMetadata, +numNodes: Int) extends DTStatsAggregator(metadata) { + + /** + * Offset for each feature for calculating indices into the [[_allStats]] array. + * Mapping: featureIndex --> offset + */ + private val featureOffsets: Array[Int] = { +metadata.numBins.scanLeft(0)((total, nBins) => total + statsSize * nBins) + } + + /** + * Number of elements for each node, corresponding to stride between nodes in [[_allStats]]. + */ + private val nodeStride: Int = featureOffsets.last + + /** + * Total number of elements stored in this aggregator. + */ + def allStatsSize: Int = numNodes * nodeStride + + /** + * Flat array of elements. + * Index for start of stats for a (node, feature, bin) is: + * index = nodeIndex * nodeStride + featureOffsets(featureIndex) + binIndex * statsSize + * Note: For unordered features, the left child stats precede the right child stats + * in the binIndex order. + */ + protected val _allStats: Array[Double] = new Array[Double](allStatsSize) + + /** + * Get flat array of elements stored in this aggregator. + */ + protected def allStats: Array[Double] = _allStats + + /** + * Update the stats for a given (node, feature, bin) for ordered features, using the given label. + */ + def update( + nodeIndex: Int, + featureIndex: Int, + binIndex: Int, + label: Double, + instanceWeight: Double): Unit = { +val i = nodeIndex * nodeStride + featureOffsets(featureIndex) + binIndex * statsSize +impurityAggregator.update(_allStats, i, label, instanceWeight) + } + + /** + * Pre-compute node offset for use with [[nodeUpdate]]. + */ + def getNodeOffset(nodeIndex: Int): Int = nodeIndex * nodeStride + + /** + * Faster version of [[update]]. + * Update the stats for a given (node, feature, bin) for ordered features, using the given label. + * @param nodeOffset Pre-computed node offset from [[getNodeOffset]]. + */ + def nodeUpdate( + nodeOffset: Int, + nodeIndex: Int, --- End diff -- This is to stay consistent with the abstract class API. The other instance of this class (for subsampling features) does require nodeIndex. I agree this is awkward, but I wanted to keep the same API (so that the RF aggregation code does not need to know the difference). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-1545] [mllib] Add Random Forests
Github user chouqin commented on a diff in the pull request: https://github.com/apache/spark/pull/2435#discussion_r17764131 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/impl/DTStatsAggregator.scala --- @@ -189,6 +160,230 @@ private[tree] class DTStatsAggregator( } this } +} + +/** + * DecisionTree statistics aggregator. + * This holds a flat array of statistics for a set of (nodes, features, bins) + * and helps with indexing. + * + * This instance of [[DTStatsAggregator]] is used when not subsampling features. + * + * @param numNodes Number of nodes to collect statistics for. + */ +private[tree] class DTStatsAggregatorFixedFeatures( +metadata: DecisionTreeMetadata, +numNodes: Int) extends DTStatsAggregator(metadata) { + + /** + * Offset for each feature for calculating indices into the [[_allStats]] array. + * Mapping: featureIndex --> offset + */ + private val featureOffsets: Array[Int] = { +metadata.numBins.scanLeft(0)((total, nBins) => total + statsSize * nBins) + } + + /** + * Number of elements for each node, corresponding to stride between nodes in [[_allStats]]. + */ + private val nodeStride: Int = featureOffsets.last + + /** + * Total number of elements stored in this aggregator. + */ + def allStatsSize: Int = numNodes * nodeStride + + /** + * Flat array of elements. + * Index for start of stats for a (node, feature, bin) is: + * index = nodeIndex * nodeStride + featureOffsets(featureIndex) + binIndex * statsSize + * Note: For unordered features, the left child stats precede the right child stats + * in the binIndex order. + */ + protected val _allStats: Array[Double] = new Array[Double](allStatsSize) + + /** + * Get flat array of elements stored in this aggregator. + */ + protected def allStats: Array[Double] = _allStats + + /** + * Update the stats for a given (node, feature, bin) for ordered features, using the given label. + */ + def update( + nodeIndex: Int, + featureIndex: Int, + binIndex: Int, + label: Double, + instanceWeight: Double): Unit = { +val i = nodeIndex * nodeStride + featureOffsets(featureIndex) + binIndex * statsSize +impurityAggregator.update(_allStats, i, label, instanceWeight) + } + + /** + * Pre-compute node offset for use with [[nodeUpdate]]. + */ + def getNodeOffset(nodeIndex: Int): Int = nodeIndex * nodeStride + + /** + * Faster version of [[update]]. + * Update the stats for a given (node, feature, bin) for ordered features, using the given label. + * @param nodeOffset Pre-computed node offset from [[getNodeOffset]]. + */ + def nodeUpdate( + nodeOffset: Int, + nodeIndex: Int, --- End diff -- `nodeIndex` is never used, because we already have `nodeOffset`, which is computed by `nodeIndex`. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [Minor Hot Fix] Move a line in SparkSubmit to ...
Github user vanzin commented on the pull request: https://github.com/apache/spark/pull/2452#issuecomment-56122218 lgtm --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [WIP][SPARK-2816][SQL] Type-safe SQL Queries
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1759#issuecomment-56121439 [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20558/consoleFull) for PR 1759 at commit [`677fa3d`](https://github.com/apache/spark/commit/677fa3d1cfffa7129cd5c8ab01b8918234cf4ce5). * This patch **fails** unit tests. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `trait ScalaReflection ` * ` case class Schema(dataType: DataType, nullable: Boolean)` * ` class Macros[C <: Context](val c: C) extends ScalaReflection ` * `trait InterpolatedItem ` * `case class InterpolatedUDF(index: Int, expr: c.Expr[Any], returnType: DataType)` * `case class InterpolatedTable(index: Int, expr: c.Expr[Any], schema: StructType)` * `case class RecSchema(name: String, index: Int, cType: DataType, tpe: Type)` * ` case class ImplSchema(name: String, tpe: Type, impl: Tree)` * `trait TypedSQL ` * ` implicit class SQLInterpolation(val strCtx: StringContext) ` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3535][Mesos] Add 15% task memory overhe...
Github user willb commented on the pull request: https://github.com/apache/spark/pull/2401#issuecomment-56121260 @andrewor14 This bothers me too, but in a slightly different way: calling the parameter âoverheadâ when it really refers to how to scale requested memory to accommodate anticipated overhead seems wrong. (115% overhead would be appalling indeed!) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3535][Mesos] Add 15% task memory overhe...
Github user brndnmtthws commented on a diff in the pull request: https://github.com/apache/spark/pull/2401#discussion_r17763516 --- Diff: core/src/main/scala/org/apache/spark/scheduler/cluster/mesos/CommonProps.scala --- @@ -0,0 +1,27 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.scheduler.cluster.mesos + +import org.apache.spark.SparkContext + +object CommonProps { + def memoryOverheadFraction(sc: SparkContext) = +sc.conf.getOption("spark.executor.memory.overhead.fraction") + .getOrElse("1.15") --- End diff -- I was trying to keep the code looking clean, but I'll change `fraction` to `multiplier` in the method, if that seems more sensible. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: Modify default YARN memory_overhead-- from an ...
Github user nishkamravi2 commented on the pull request: https://github.com/apache/spark/pull/1391#issuecomment-56120931 Updated as per @andrewor14 's comments --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3535][Mesos] Add 15% task memory overhe...
Github user andrewor14 commented on a diff in the pull request: https://github.com/apache/spark/pull/2401#discussion_r17763285 --- Diff: core/src/main/scala/org/apache/spark/scheduler/cluster/mesos/CommonProps.scala --- @@ -0,0 +1,27 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.scheduler.cluster.mesos + +import org.apache.spark.SparkContext + +object CommonProps { + def memoryOverheadFraction(sc: SparkContext) = +sc.conf.getOption("spark.executor.memory.overhead.fraction") + .getOrElse("1.15") --- End diff -- In my mind a fraction is a value between 0 and 1, but 1.15 is not in this range. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3535][Mesos] Add 15% task memory overhe...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2401#issuecomment-56120575 [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20564/consoleFull) for PR 2401 at commit [`d51b74f`](https://github.com/apache/spark/commit/d51b74f26b67ff3719b7ba3182d0693e45301a81). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3535][Mesos] Add 15% task memory overhe...
Github user brndnmtthws commented on the pull request: https://github.com/apache/spark/pull/2401#issuecomment-56120347 Forgot to mention: I also set the executor CPUs correctly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3535][Mesos] Add 15% task memory overhe...
Github user brndnmtthws commented on the pull request: https://github.com/apache/spark/pull/2401#issuecomment-56120329 Updated as per @andrewor14's suggestions. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org