Github user mgaido91 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22236#discussion_r212940988
  
    --- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/fpm/AssociationRules.scala ---
    @@ -61,6 +61,18 @@ class AssociationRules private[fpm] (
        */
       @Since("1.5.0")
       def run[Item: ClassTag](freqItemsets: RDD[FreqItemset[Item]]): 
RDD[Rule[Item]] = {
    +    run(freqItemsets, Map.empty[Item, Long])
    +  }
    +
    +  /**
    +   * Computes the association rules with confidence above `minConfidence`.
    +   * @param freqItemsets frequent itemset model obtained from [[FPGrowth]]
    +   * @return a `Set[Rule[Item]]` containing the association rules. The 
rules will be able to
    +   *         compute also the lift metric.
    +   */
    +  @Since("2.4.0")
    +  def run[Item: ClassTag](freqItemsets: RDD[FreqItemset[Item]],
    +      itemSupport: Map[Item, Long]): RDD[Rule[Item]] = {
    --- End diff --
    
    Actually we can compute it by filtering the `freqItemsets` and get the 
items with length one. The reason why I haven't done that is to avoid 
performance regression. Since we already computed this before, it seems a 
unneeded waste to recompute it here.
    
    I can use this approach of computing them when loading the model, though, I 
agree. Also in that case I haven't done for optimization reasons (since we 
would need to read 2 times the freqItemsets dataset which is surely much 
larger).
    
    If you think this is needed, I can change it, but for performance reasons I 
prefer the current approach, in order not to affect existing users not 
interested in the `lift` metric. I don't think any compatibility issue can 
arise as if the value is not preset, null is returned for the lift metric.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to