[ 
https://issues.apache.org/jira/browse/SPARK-14503?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15864872#comment-15864872
 ] 

Nick Pentreath commented on SPARK-14503:
----------------------------------------

Seems {{PrefixSpan}} even takes different input: {{Array[Array[T]]}} vs 
FPGrowth: {{Array[T]}}. So it may be tricky to unify.

However we do have the case where e.g. {{QuantileDiscretizer}} returns a 
{{Bucketizer}} as {{Model}} from {{fit}}. In that case {{Bucketizer}} can be 
instantiated directly and independently, but it could in theory be the case 
that some other estimator returns a {{Bucketizer}} as its model.

So we could perhaps think about both {{FPGrowth}} and {{PrefixSpan}} returning 
an {{AssociationRuleModel}} from {{fit}}. It could work if the input can be 
generalized to {{Seq[T]}} where for {{FPGrowth}} it would be {{Seq[Item]}} and 
for {{PrefixSpan}} it would be {{Seq[Seq[Item]]}}. The output of {{transform}} 
for the model would be the predicted items as above. It would expose 
{{getFreqItems}} and {{getAssociationRules}} both returning a {{DataFrame}}.

Is there something in the nature of {{PrefixSpan}} vs {{FPGrowth}} that makes 
this too difficult? (I'll have to go read the papers when I get some time!)

But having said that it could be pretty complex to try to support this. If so, 
unless there's a compelling argument I'd go for [~josephkb]'s suggestion above, 
and hide the association rule class for now (can expose later as needed). Then 
{{PrefixSpan}} will be totally independent and return its own 
{{PrefixSpanModel}} (that may also expose a {{transform}} method that has 
similar semantics but different internals).

> spark.ml Scala API for FPGrowth
> -------------------------------
>
>                 Key: SPARK-14503
>                 URL: https://issues.apache.org/jira/browse/SPARK-14503
>             Project: Spark
>          Issue Type: Sub-task
>          Components: ML
>            Reporter: Joseph K. Bradley
>
> This task is the first port of spark.mllib.fpm functionality to spark.ml 
> (Scala).
> This will require a brief design doc to confirm a reasonable DataFrame-based 
> API, with details for this class.  The doc could also look ahead to the other 
> fpm classes, especially if their API decisions will affect FPGrowth.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to