Github user mgaido91 commented on a diff in the pull request: https://github.com/apache/spark/pull/22236#discussion_r212940988 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/fpm/AssociationRules.scala --- @@ -61,6 +61,18 @@ class AssociationRules private[fpm] ( */ @Since("1.5.0") def run[Item: ClassTag](freqItemsets: RDD[FreqItemset[Item]]): RDD[Rule[Item]] = { + run(freqItemsets, Map.empty[Item, Long]) + } + + /** + * Computes the association rules with confidence above `minConfidence`. + * @param freqItemsets frequent itemset model obtained from [[FPGrowth]] + * @return a `Set[Rule[Item]]` containing the association rules. The rules will be able to + * compute also the lift metric. + */ + @Since("2.4.0") + def run[Item: ClassTag](freqItemsets: RDD[FreqItemset[Item]], + itemSupport: Map[Item, Long]): RDD[Rule[Item]] = { --- End diff -- Actually we can compute it by filtering the `freqItemsets` and get the items with length one. The reason why I haven't done that is to avoid performance regression. Since we already computed this before, it seems a unneeded waste to recompute it here. I can use this approach of computing them when loading the model, though, I agree. Also in that case I haven't done for optimization reasons (since we would need to read 2 times the freqItemsets dataset which is surely much larger). If you think this is needed, I can change it, but for performance reasons I prefer the current approach, in order not to affect existing users not interested in the `lift` metric. I don't think any compatibility issue can arise as if the value is not preset, null is returned for the lift metric.
--- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org