Github user ludatabricks commented on a diff in the pull request: https://github.com/apache/spark/pull/21265#discussion_r187144226 --- Diff: python/pyspark/ml/fpm.py --- @@ -243,3 +244,75 @@ def setParams(self, minSupport=0.3, minConfidence=0.8, itemsCol="items", def _create_model(self, java_model): return FPGrowthModel(java_model) + + +class PrefixSpan(object): + """ + .. note:: Experimental + + A parallel PrefixSpan algorithm to mine frequent sequential patterns. + The PrefixSpan algorithm is described in J. Pei, et al., PrefixSpan: Mining Sequential Patterns + Efficiently by Prefix-Projected Pattern Growth + (see <a href="http://doi.org/10.1109/ICDE.2001.914830">here</a>). + + .. versionadded:: 2.4.0 + + """ + @staticmethod + @since("2.4.0") + def findFrequentSequentialPatterns(dataset, + sequenceCol, + minSupport, + maxPatternLength, + maxLocalProjDBSize): + """ + .. note:: Experimental + Finds the complete set of frequent sequential patterns in the input sequences of itemsets. + + :param dataset: A dataset or a dataframe containing a sequence column which is + `Seq[Seq[_]]` type. + :param sequenceCol: The name of the sequence column in dataset, rows with nulls in this + column are ignored. + :param minSupport: The minimal support level of the sequential pattern, any pattern that + appears more than (minSupport * size-of-the-dataset) times will be + output (recommended value: `0.1`). + :param maxPatternLength: The maximal length of the sequential pattern + (recommended value: `10`). + :param maxLocalProjDBSize: The maximum number of items (including delimiters used in the + internal storage format) allowed in a projected database before + local processing. If a projected database exceeds this size, + another iteration of distributed prefix growth is run + (recommended value: `32000000`). + :return: A `DataFrame` that contains columns of sequence and corresponding frequency. + The schema of it will be: + - `sequence: Seq[Seq[T]]` (T is the item type) + - `freq: Long` + + >>> from pyspark.ml.fpm import PrefixSpan + >>> from pyspark.sql import Row + >>> df = sc.parallelize([Row(sequence=[[1, 2], [3]]), --- End diff -- One question: Should we add something in the example to show some special case or how these parameters works? For example - add pattern which is larger than ``maxPatternLength`` - add nulls in the column
--- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org