[GitHub] spark pull request: [SPARK-6517][mllib] Implement the Algorithm of...

yu-iskw Mon, 09 Nov 2015 11:55:08 -0800

Github user yu-iskw commented on a diff in the pull request:

    https://github.com/apache/spark/pull/5267#discussion_r44322215
  
    --- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/clustering/BisectingKMeans.scala ---
    @@ -0,0 +1,489 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.clustering
    +
    +import java.util.Random
    +
    +import scala.collection.mutable
    +
    +import org.apache.spark.Logging
    +import org.apache.spark.annotation.{Experimental, Since}
    +import org.apache.spark.api.java.JavaRDD
    +import org.apache.spark.mllib.linalg.{BLAS, Vector, Vectors}
    +import org.apache.spark.mllib.util.MLUtils
    +import org.apache.spark.rdd.RDD
    +import org.apache.spark.storage.StorageLevel
    +
    +/**
    + * A bisecting k-means algorithm based on the paper "A comparison of 
document clustering techniques"
    + * by Steinbach, Karypis, and Kumar, with modification to fit Spark.
    + * The algorithm starts from a single cluster that contains all points.
    + * Iteratively it finds divisible clusters on the bottom level and bisects 
each of them using
    + * k-means, until there are `k` leaf clusters in total or no leaf clusters 
are divisible.
    + * The bisecting steps of clusters on the same level are grouped together 
to increase parallelism.
    + * If bisecting all divisible clusters on the bottom level would result 
more than `k` leaf clusters,
    + * larger clusters get higher priority.
    + *
    + * @param k the desired number of leaf clusters (default: 4). The actual 
number could be smaller if
    + *          there are no divisible leaf clusters.
    + * @param maxIterations the max number of k-means iterations to split 
clusters (default: 20)
    + * @param minDivisibleClusterSize the minimum number of points (if >= 1.0) 
or the minimum proportion
    + *                                of points (if < 1.0) of a divisible 
cluster (default: 1)
    + * @param seed a random seed (default: hash value of the class name)
    + *
    + * @see 
[[http://glaros.dtc.umn.edu/gkhome/fetch/papers/docclusterKDDTMW00.pdf
    + *     Steinbach, Karypis, and Kumar, A comparison of document clustering 
techniques,
    + *     KDD Workshop on Text Mining, 2000.]]
    + */
    +@Since("1.6.0")
    +@Experimental
    +class BisectingKMeans private (
    +    private var k: Int,
    +    private var maxIterations: Int,
    +    private var minDivisibleClusterSize: Double,
    +    private var seed: Long) extends Logging {
    +
    +  import BisectingKMeans._
    +
    +  /**
    +   * Constructs with the default configuration
    +   */
    +  @Since("1.6.0")
    +  def this() = this(4, 20, 1.0, classOf[BisectingKMeans].getName.##)
    +
    +  /**
    +   * Sets the desired number of leaf clusters (default: 4).
    +   * The actual number could be smaller if there are no divisible leaf 
clusters.
    +   */
    +  @Since("1.6.0")
    +  def setK(k: Int): this.type = {
    +    require(k > 0, s"k must be positive but got $k.")
    +    this.k = k
    +    this
    +  }
    +
    +  /**
    +   * Gets the desired number of leaf clusters.
    +   */
    +  @Since("1.6.0")
    +  def getK: Int = this.k
    +
    +  /**
    +   * Sets the max number of k-means iterations to split clusters (default: 
20).
    +   */
    +  @Since("1.6.0")
    +  def setMaxIterations(maxIterations: Int): this.type = {
    +    require(maxIterations > 0, s"maxIterations must be positive but got 
$maxIterations.")
    +    this.maxIterations = maxIterations
    +    this
    +  }
    +
    +  /**
    +   * Gets the max number of k-means iterations to split clusters.
    +   */
    +  @Since("1.6.0")
    +  def getMaxIterations: Int = this.maxIterations
    +
    +  /**
    +   * Sets the minimum number of points (if >= `1.0`) or the minimum 
proportion of points
    +   * (if < `1.0`) of a divisible cluster (default: 1).
    +   */
    +  @Since("1.6.0")
    +  def setMinDivisibleClusterSize(minDivisibleClusterSize: Double): 
this.type = {
    +    require(minDivisibleClusterSize > 0.0,
    +      s"minDivisibleClusterSize must be positive but got 
$minDivisibleClusterSize.")
    +    this.minDivisibleClusterSize = minDivisibleClusterSize
    +    this
    +  }
    +
    +  /**
    +   * Gets the minimum number of points (if >= `1.0`) or the minimum 
proportion of points
    +   * (if < `1.0`) of a divisible cluster.
    +   */
    +  @Since("1.6.0")
    +  def getMinDivisibleClusterSize: Double = minDivisibleClusterSize
    +
    +  /**
    +   * Sets the random seed (default: hash value of the class name).
    +   */
    +  @Since("1.6.0")
    +  def setSeed(seed: Long): this.type = {
    +    this.seed = seed
    +    this
    +  }
    +
    +  /**
    +   * Gets the random seed.
    +   */
    +  @Since("1.6.0")
    +  def getSeed: Long = this.seed
    +
    +  /**
    +   * Runs the bisecting k-means algorithm.
    +   * @param input RDD of vectors
    +   * @return model for the bisecting kmeans
    +   */
    +  @Since("1.6.0")
    +  def run(input: RDD[Vector]): BisectingKMeansModel = {
    +    if (input.getStorageLevel == StorageLevel.NONE) {
    +      logWarning(s"The input RDD ${input.id} is not directly cached, which 
may hurt performance if"
    +        + " its parent RDDs are also not cached.")
    +    }
    +    val d = input.map(_.size).first()
    +    logInfo(s"Feature dimension: $d.")
    +    // Compute and cache vector norms for fast distance computation.
    +    val norms = input.map(v => Vectors.norm(v, 
2.0)).persist(StorageLevel.MEMORY_AND_DISK)
    +    val vectors = input.zip(norms).map { case (x, norm) => new 
VectorWithNorm(x, norm) }
    +    var assignments = vectors.map(v => (ROOT_INDEX, v))
    +    var activeClusters = summarize(d, assignments)
    +    val rootSummary = activeClusters(ROOT_INDEX)
    +    val n = rootSummary.size
    +    logInfo(s"Number of points: $n.")
    +    logInfo(s"Initial cost: ${rootSummary.cost}.")
    +    val minSize = if (minDivisibleClusterSize >= 1.0) {
    +      math.ceil(minDivisibleClusterSize).toLong
    +    } else {
    +      math.ceil(minDivisibleClusterSize * n).toLong
    +    }
    +    logInfo(s"The minimum number of points of a divisible cluster is 
$minSize.")
    +    var inactiveClusters = mutable.Seq.empty[(Long, ClusterSummary)]
    +    val random = new Random(seed)
    +    var numLeafClustersNeeded = k - 1
    +    var level = 1
    +    while (activeClusters.nonEmpty && numLeafClustersNeeded > 0 && level < 
63) {
    +      // Divisible clusters are sufficiently large and have non-trivial 
cost.
    +      var divisibleClusters = activeClusters.filter { case (_, summary) =>
    +        (summary.size >= minSize) && (summary.cost > MLUtils.EPSILON * 
summary.size)
    +      }
    +      // If we don't need all divisible clusters, take the larger ones.
    +      if (divisibleClusters.size > numLeafClustersNeeded) {
    +        divisibleClusters = divisibleClusters.toSeq.sortBy { case (_, 
summary) =>
    +            -summary.size
    +          }.take(numLeafClustersNeeded)
    +          .toMap
    +      }
    +      if (divisibleClusters.nonEmpty) {
    +        val divisibleIndices = divisibleClusters.keys.toSet
    +        logInfo(s"Dividing ${divisibleIndices.size} clusters on level 
$level.")
    +        var newClusterCenters = divisibleClusters.flatMap { case (index, 
summary) =>
    +          val (left, right) = splitCenter(summary.center, random)
    +          Iterator((leftChildIndex(index), left), (rightChildIndex(index), 
right))
    +        }.map(identity) // workaround for a Scala bug (SI-7005) that 
produces a not serializable map
    +        var newClusters: Map[Long, ClusterSummary] = null
    +        var newAssignments: RDD[(Long, VectorWithNorm)] = null
    +        for (iter <- 0 until maxIterations) {
    --- End diff --
    
    I think we should stop the iteration, when the total cost is enough 
saturated.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-6517][mllib] Implement the Algorithm of...

Reply via email to