[GitHub] spark pull request #17575: [SPARK-20265][MLlib] Improve Prefix'span pre-proc...

Syrux Mon, 10 Apr 2017 07:18:12 -0700

Github user Syrux commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17575#discussion_r110667171
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/fpm/PrefixSpan.scala 
---
    @@ -232,6 +200,68 @@ class PrefixSpan private (
     object PrefixSpan extends Logging {
     
       /**
    +   * This methods finds all frequent items in a input dataset.
    +   *
    +   * @param data Sequences of itemsets.
    +   * @param minCount The minimal number of sequence an item should be 
present in to be frequent
    +   *
    +   * @return An array of Item containing only frequent items.
    +   */
    +  private[fpm] def findFrequentItems[Item: ClassTag](data : 
RDD[Array[Array[Item]]],
    +                                                     minCount : Long): 
Array[Item] = {
    +
    +    data.flatMap { itemsets =>
    +      val uniqItems = mutable.Set.empty[Item]
    +      itemsets.foreach { _.foreach { item =>
    +        uniqItems += item
    +      }}
    +      uniqItems.toIterator.map((_, 1L))
    +    }.reduceByKey(_ + _).filter { case (_, count) =>
    +        count >= minCount
    +    }.sortBy(-_._2).map(_._1).collect()
    +  }
    +
    +  /**
    +   * This methods cleans the input dataset from un-frequent items, and 
translate it's item
    +   * to their corresponding Int identifier.
    +   *
    +   * @param data Sequences of itemsets.
    +   * @param itemToInt A map allowing translation of frequent Items to 
their Int Identifier.
    +   *                  The map should only contain frequent item.
    +   *
    +   * @return The internal repr of the inputted dataset. With properly 
placed zero delimiter.
    +   */
    +  private[fpm] def toDatabaseInternalRepr[Item: ClassTag](data : 
RDD[Array[Array[Item]]],
    +                                                        itemToInt : 
Map[Item, Int]):
    +  RDD[Array[Int]] = {
    +
    +    data.flatMap { itemsets =>
    +      val allItems = mutable.ArrayBuilder.make[Int]
    +      var containsFreqItems = false
    +      allItems += 0
    +      itemsets.foreach { itemsets =>
    +        val items = mutable.ArrayBuilder.make[Int]
    +        itemsets.foreach { item =>
    +          if (itemToInt.contains(item)) {
    +            items += itemToInt(item) + 1 // using 1-indexing in internal 
format
    +          }
    +        }
    +        val result = items.result()
    +        if (result.nonEmpty) {
    +          containsFreqItems = true
    +          allItems ++= result.sorted
    +          allItems += 0
    +        }
    +      }
    +      if (containsFreqItems) {
    --- End diff --
    
    I am not sure about the performance of a pre-append on arrayBuilder. I will 
check them first. Back in a few minutes.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17575: [SPARK-20265][MLlib] Improve Prefix'span pre-proc...

Reply via email to