from:"zhangjiajin"

[GitHub] spark pull request: [SPARK-9540] [MLLIB] optimize PrefixSpan imple...

2015-08-04 Thread zhangjiajin

Github user zhangjiajin commented on a diff in the pull request:

https://github.com/apache/spark/pull/7937#discussion_r36264370
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/mllib/fpm/PrefixSpanSuite.scala ---
@@ -35,83 +35,81 @@ class PrefixSpanSuite extends SparkFunSuite with 
MLlibTestSparkContext {
 */
 
 val sequences = Array(
-  Array(1, -1, 3, -1, 4, -1, 5),
-  Array(2, -1, 3, -1, 1),
-  Array(2, -1, 4, -1, 1),
-  Array(3, -1, 1, -1, 3, -1, 4, -1, 5),
-  Array(3, -1, 4, -1, 4, -1, 3),
-  Array(6, -1, 5, -1, 3))
+  Array(0, 1, 0, 3, 0, 4, 0, 5, 0),
+  Array(0, 2, 0, 3, 0, 1, 0),
+  Array(0, 2, 0, 4, 0, 1, 0),
+  Array(0, 3, 0, 1, 0, 3, 0, 4, 0, 5, 0),
+  Array(0, 3, 0, 4, 0, 4, 0, 3, 0),
+  Array(0, 6, 0, 5, 0, 3, 0))
 
 val rdd = sc.parallelize(sequences, 2).cache()
 
-val prefixspan = new PrefixSpan()
-  .setMinSupport(0.33)
-  .setMaxPatternLength(50)
-val result1 = prefixspan.run(rdd)
+val result1 = PrefixSpan.genFreqPatterns(
+  rdd, minCount = 2L, maxPatternLength = 50, maxLocalProjDBSize = 16L)
 val expectedValue1 = Array(
-  (Array(1), 4L),
-  (Array(1, -1, 3), 2L),
-  (Array(1, -1, 3, -1, 4), 2L),
-  (Array(1, -1, 3, -1, 4, -1, 5), 2L),
-  (Array(1, -1, 3, -1, 5), 2L),
-  (Array(1, -1, 4), 2L),
-  (Array(1, -1, 4, -1, 5), 2L),
-  (Array(1, -1, 5), 2L),
-  (Array(2), 2L),
-  (Array(2, -1, 1), 2L),
-  (Array(3), 5L),
-  (Array(3, -1, 1), 2L),
-  (Array(3, -1, 3), 2L),
-  (Array(3, -1, 4), 3L),
-  (Array(3, -1, 4, -1, 5), 2L),
-  (Array(3, -1, 5), 2L),
-  (Array(4), 4L),
-  (Array(4, -1, 5), 2L),
-  (Array(5), 3L)
+  (Array(0, 1, 0), 4L),
+  (Array(0, 1, 0, 3, 0), 2L),
+  (Array(0, 1, 0, 3, 0, 4, 0), 2L),
+  (Array(0, 1, 0, 3, 0, 4, 0, 5, 0), 2L),
+  (Array(0, 1, 0, 3, 0, 5, 0), 2L),
+  (Array(0, 1, 0, 4, 0), 2L),
+  (Array(0, 1, 0, 4, 0, 5, 0), 2L),
+  (Array(0, 1, 0, 5, 0), 2L),
+  (Array(0, 2, 0), 2L),
+  (Array(0, 2, 0, 1, 0), 2L),
+  (Array(0, 3, 0), 5L),
+  (Array(0, 3, 0, 1, 0), 2L),
+  (Array(0, 3, 0, 3, 0), 2L),
+  (Array(0, 3, 0, 4, 0), 3L),
+  (Array(0, 3, 0, 4, 0, 5, 0), 2L),
+  (Array(0, 3, 0, 5, 0), 2L),
+  (Array(0, 4, 0), 4L),
+  (Array(0, 4, 0, 5, 0), 2L),
+  (Array(0, 5, 0), 3L)
 )
 compareInternalResults(expectedValue1, result1.collect())
 
-prefixspan.setMinSupport(0.5).setMaxPatternLength(50)
-val result2 = prefixspan.run(rdd)
+val result2 = PrefixSpan.genFreqPatterns(
+  rdd, minCount = 3, maxPatternLength = 50, maxLocalProjDBSize = 32L)
 val expectedValue2 = Array(
-  (Array(1), 4L),
-  (Array(3), 5L),
-  (Array(3, -1, 4), 3L),
-  (Array(4), 4L),
-  (Array(5), 3L)
+  (Array(0, 1, 0), 4L),
+  (Array(0, 3, 0), 5L),
+  (Array(0, 3, 0, 4, 0), 3L),
+  (Array(0, 4, 0), 4L),
+  (Array(0, 5, 0), 3L)
 )
 compareInternalResults(expectedValue2, result2.collect())
 
-prefixspan.setMinSupport(0.33).setMaxPatternLength(2)
-val result3 = prefixspan.run(rdd)
+val result3 = PrefixSpan.genFreqPatterns(
+  rdd, minCount = 2, maxPatternLength = 2, maxLocalProjDBSize = 32L)
 val expectedValue3 = Array(
-  (Array(1), 4L),
-  (Array(1, -1, 3), 2L),
-  (Array(1, -1, 4), 2L),
-  (Array(1, -1, 5), 2L),
-  (Array(2, -1, 1), 2L),
-  (Array(2), 2L),
-  (Array(3), 5L),
-  (Array(3, -1, 1), 2L),
-  (Array(3, -1, 3), 2L),
-  (Array(3, -1, 4), 3L),
-  (Array(3, -1, 5), 2L),
-  (Array(4), 4L),
-  (Array(4, -1, 5), 2L),
-  (Array(5), 3L)
+  (Array(0, 1, 0), 4L),
+  (Array(0, 1, 0, 3, 0), 2L),
+  (Array(0, 1, 0, 4, 0), 2L),
+  (Array(0, 1, 0, 5, 0), 2L),
+  (Array(0, 2, 0, 1, 0), 2L),
+  (Array(0, 2, 0), 2L),
+  (Array(0, 3, 0), 5L),
+  (Array(0, 3, 0, 1, 0), 2L),
+  (Array(0, 3, 0, 3, 0), 2L),
+  (Array(0, 3, 0, 4, 0), 3L),
+  (Array(0, 3, 0, 5, 0), 2L),
+  (Array(0, 4, 0), 4L),
+  (Array(0, 4, 0, 5, 0), 2L),
+  (Array(0, 5, 0), 3L)
 )
 compareInternalResults(expectedValue3, result3.collect())
   }
 
   test("PrefixSpan internal (integer seq, -1 delim) run, variable-size 
itemsets") {
 val sequences = Array(
-  Array(1, -1, 1, 2, 3, -1, 1, 3, -1, 4, -1, 3, 6),
-  Array(1, 4, -1, 3, -1, 2, 3, -1, 1, 5),
-  Array(5, 6, -1

[GitHub] spark pull request: [SPARK-9540] [MLLIB] optimize PrefixSpan imple...

2015-08-04 Thread zhangjiajin

Github user zhangjiajin commented on a diff in the pull request:

https://github.com/apache/spark/pull/7937#discussion_r36263427
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/mllib/fpm/PrefixSpanSuite.scala ---
@@ -35,83 +35,81 @@ class PrefixSpanSuite extends SparkFunSuite with 
MLlibTestSparkContext {
 */
 
 val sequences = Array(
-  Array(1, -1, 3, -1, 4, -1, 5),
-  Array(2, -1, 3, -1, 1),
-  Array(2, -1, 4, -1, 1),
-  Array(3, -1, 1, -1, 3, -1, 4, -1, 5),
-  Array(3, -1, 4, -1, 4, -1, 3),
-  Array(6, -1, 5, -1, 3))
+  Array(0, 1, 0, 3, 0, 4, 0, 5, 0),
--- End diff --

If we use Array(1, 0, 3, 0, 4, 0, 5), we can save two integer space.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8998][MLlib] Collect enough frequent pr...

2015-07-29 Thread zhangjiajin

Github user zhangjiajin commented on a diff in the pull request:

https://github.com/apache/spark/pull/7412#discussion_r35828463
  
--- Diff: mllib/src/main/scala/org/apache/spark/mllib/fpm/PrefixSpan.scala 
---
@@ -44,27 +46,43 @@ class PrefixSpan private (
 private var maxPatternLength: Int) extends Logging with Serializable {
 
   /**
+   * The maximum number of items allowed in a projected database before 
local processing. If a
+   * projected database exceeds this size, another iteration of 
distributed PrefixSpan is run.
+   */
+  private val maxLocalProjDBSize: Long = 1
+
+  /**
* Constructs a default instance with default parameters
* {minSupport: `0.1`, maxPatternLength: `10`}.
*/
   def this() = this(0.1, 10)
 
   /**
+   * Get the minimal support (i.e. the frequency of occurrence before a 
pattern is considered
+   * frequent).
+   */
+  def getMinSupport(): Double = this.minSupport
+
+  /**
* Sets the minimal support level (default: `0.1`).
*/
   def setMinSupport(minSupport: Double): this.type = {
-require(minSupport >= 0 && minSupport <= 1,
-  "The minimum support value must be between 0 and 1, including 0 and 
1.")
+require(minSupport >= 0 && minSupport <= 1, "The minimum support value 
must be in [0, 1].")
 this.minSupport = minSupport
 this
   }
 
   /**
+   * Gets the maximal pattern length (i.e. the length of the longest 
sequential pattern to consider.
+   */
+  def getMaxPatternLength(): Double = this.maxPatternLength
--- End diff --

OK


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8998][MLlib] Collect enough frequent pr...

2015-07-29 Thread zhangjiajin

Github user zhangjiajin commented on a diff in the pull request:

https://github.com/apache/spark/pull/7412#discussion_r35828436
  
--- Diff: mllib/src/main/scala/org/apache/spark/mllib/fpm/PrefixSpan.scala 
---
@@ -44,27 +46,43 @@ class PrefixSpan private (
 private var maxPatternLength: Int) extends Logging with Serializable {
 
   /**
+   * The maximum number of items allowed in a projected database before 
local processing. If a
+   * projected database exceeds this size, another iteration of 
distributed PrefixSpan is run.
+   */
+  private val maxLocalProjDBSize: Long = 1
+
+  /**
* Constructs a default instance with default parameters
* {minSupport: `0.1`, maxPatternLength: `10`}.
*/
   def this() = this(0.1, 10)
 
   /**
+   * Get the minimal support (i.e. the frequency of occurrence before a 
pattern is considered
+   * frequent).
+   */
+  def getMinSupport(): Double = this.minSupport
--- End diff --

OK


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8998][MLlib] Collect enough frequent pr...

2015-07-29 Thread zhangjiajin

Github user zhangjiajin commented on a diff in the pull request:

https://github.com/apache/spark/pull/7412#discussion_r35828391
  
--- Diff: mllib/src/main/scala/org/apache/spark/mllib/fpm/PrefixSpan.scala 
---
@@ -44,27 +46,43 @@ class PrefixSpan private (
 private var maxPatternLength: Int) extends Logging with Serializable {
 
   /**
+   * The maximum number of items allowed in a projected database before 
local processing. If a
+   * projected database exceeds this size, another iteration of 
distributed PrefixSpan is run.
+   */
+  private val maxLocalProjDBSize: Long = 1
--- End diff --

OK


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8998][MLlib] Collect enough frequent pr...

2015-07-26 Thread zhangjiajin

Github user zhangjiajin commented on the pull request:

https://github.com/apache/spark/pull/7412#issuecomment-125079131
  
@feynmanliang You are right, it is worth to prevent executor failure. I 
very much agree with it. And I tested it according to your suggestion.
The results of the performance test are not stable, I try to find out the 
reason, perhaps due to the environment, I will post it after solving this 
problem.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8998][MLlib] Collect enough frequent pr...

2015-07-26 Thread zhangjiajin

Github user zhangjiajin commented on the pull request:

https://github.com/apache/spark/pull/7412#issuecomment-125075528
  
@feynmanliang About splitPrefixSuffixPairs,  I compared these two methods. 
I find your method's running time more than mine. And the result is not 
correct. I don't know why it was so, please check it, thank you.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8998][MLlib] Collect enough frequent pr...

2015-07-26 Thread zhangjiajin

Github user zhangjiajin commented on a diff in the pull request:

https://github.com/apache/spark/pull/7412#discussion_r35504114
  
--- Diff: mllib/src/main/scala/org/apache/spark/mllib/fpm/PrefixSpan.scala 
---
@@ -82,20 +86,106 @@ class PrefixSpan private (
   logWarning("Input data is not cached.")
 }
 val minCount = getMinCount(sequences)
-val lengthOnePatternsAndCounts =
-  getFreqItemAndCounts(minCount, sequences).collect()
-val prefixAndProjectedDatabase = getPrefixAndProjectedDatabase(
-  lengthOnePatternsAndCounts.map(_._1), sequences)
-val groupedProjectedDatabase = prefixAndProjectedDatabase
-  .map(x => (x._1.toSeq, x._2))
+val lengthOnePatternsAndCounts = getFreqItemAndCounts(minCount, 
sequences)
+val prefixSuffixPairs = getPrefixSuffixPairs(
+  lengthOnePatternsAndCounts.map(_._1).collect(), sequences)
+var patternsCount: Long = lengthOnePatternsAndCounts.count()
+var allPatternAndCounts = lengthOnePatternsAndCounts.map(x => 
(ArrayBuffer(x._1), x._2))
+var (smallPrefixSuffixPairs, largePrefixSuffixPairs) =
+  splitPrefixSuffixPairs(prefixSuffixPairs)
+largePrefixSuffixPairs.persist(StorageLevel.MEMORY_AND_DISK)
+var patternLength: Int = 1
+while (patternLength < maxPatternLength &&
+  largePrefixSuffixPairs.count() != 0) {
+  val (nextPatternAndCounts, nextPrefixSuffixPairs) =
+getPatternCountsAndPrefixSuffixPairs(minCount, 
largePrefixSuffixPairs)
+  patternsCount = nextPatternAndCounts.count()
+  largePrefixSuffixPairs.unpersist()
+  val splitedPrefixSuffixPairs = 
splitPrefixSuffixPairs(nextPrefixSuffixPairs)
+  largePrefixSuffixPairs = splitedPrefixSuffixPairs._2
+  largePrefixSuffixPairs.persist(StorageLevel.MEMORY_AND_DISK)
+  smallPrefixSuffixPairs = smallPrefixSuffixPairs ++ 
splitedPrefixSuffixPairs._1
+  allPatternAndCounts = allPatternAndCounts ++ nextPatternAndCounts
+  patternLength = patternLength + 1
+}
+if (smallPrefixSuffixPairs.count() > 0) {
+  val projectedDatabase = smallPrefixSuffixPairs
+.map(x => (x._1.toSeq, x._2))
+.groupByKey()
+.map(x => (x._1.toArray, x._2.toArray))
+  val nextPatternAndCounts = getPatternsInLocal(minCount, 
projectedDatabase)
+  allPatternAndCounts = allPatternAndCounts ++ nextPatternAndCounts
+}
+allPatternAndCounts.map { case (pattern, count) => (pattern.toArray, 
count) }
+  }
+
+
+  /**
+   * Split prefix suffix pairs to two parts:
+   * suffixes' size less than maxSuffixesBeforeLocalProcessing and
+   * suffixes' size more than maxSuffixesBeforeLocalProcessing
+   * @param prefixSuffixPairs prefix (length n) and suffix pairs,
+   * @return small size prefix suffix pairs and big size prefix suffix 
pairs
+   * (RDD[prefix, suffix], RDD[prefix, suffix ])
+   */
+  private def splitPrefixSuffixPairs(
+  prefixSuffixPairs: RDD[(ArrayBuffer[Int], Array[Int])]):
+  (RDD[(ArrayBuffer[Int], Array[Int])], RDD[(ArrayBuffer[Int], 
Array[Int])]) = {
+val suffixSizeMap = prefixSuffixPairs
+  .map(x => (x._1, x._2.length))
+  .reduceByKey(_ + _)
--- End diff --

@feynmanliang  I compared these two methods. I find your method's running 
time more  than mine. And the result is not correct. I don't know why it was 
so, please check it, thank you.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8998][MLlib] Collect enough frequent pr...

2015-07-26 Thread zhangjiajin

Github user zhangjiajin commented on a diff in the pull request:

https://github.com/apache/spark/pull/7412#discussion_r35502442
  
--- Diff: mllib/src/main/scala/org/apache/spark/mllib/fpm/PrefixSpan.scala 
---
@@ -82,20 +86,106 @@ class PrefixSpan private (
   logWarning("Input data is not cached.")
 }
 val minCount = getMinCount(sequences)
-val lengthOnePatternsAndCounts =
-  getFreqItemAndCounts(minCount, sequences).collect()
-val prefixAndProjectedDatabase = getPrefixAndProjectedDatabase(
-  lengthOnePatternsAndCounts.map(_._1), sequences)
-val groupedProjectedDatabase = prefixAndProjectedDatabase
-  .map(x => (x._1.toSeq, x._2))
+val lengthOnePatternsAndCounts = getFreqItemAndCounts(minCount, 
sequences)
+val prefixSuffixPairs = getPrefixSuffixPairs(
+  lengthOnePatternsAndCounts.map(_._1).collect(), sequences)
+var patternsCount: Long = lengthOnePatternsAndCounts.count()
+var allPatternAndCounts = lengthOnePatternsAndCounts.map(x => 
(ArrayBuffer(x._1), x._2))
+var (smallPrefixSuffixPairs, largePrefixSuffixPairs) =
+  splitPrefixSuffixPairs(prefixSuffixPairs)
+largePrefixSuffixPairs.persist(StorageLevel.MEMORY_AND_DISK)
--- End diff --

OK, I tested, and the prefixSuffixPairs.persist (7s)  is better than 
largePrefixSuffixPairs.persist (11s) .


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8998][MLlib] Collect enough frequent pr...

2015-07-26 Thread zhangjiajin

Github user zhangjiajin commented on a diff in the pull request:

https://github.com/apache/spark/pull/7412#discussion_r35500886
  
--- Diff: mllib/src/main/scala/org/apache/spark/mllib/fpm/PrefixSpan.scala 
---
@@ -43,6 +45,8 @@ class PrefixSpan private (
 private var minSupport: Double,
 private var maxPatternLength: Int) extends Logging with Serializable {
 
+  private val maxSuffixesBeforeLocalProcessing: Long = 1
--- End diff --

OK


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8998][MLlib] Collect enough frequent pr...

2015-07-26 Thread zhangjiajin

Github user zhangjiajin commented on a diff in the pull request:

https://github.com/apache/spark/pull/7412#discussion_r35500812
  
--- Diff: mllib/src/main/scala/org/apache/spark/mllib/fpm/PrefixSpan.scala 
---
@@ -82,20 +86,106 @@ class PrefixSpan private (
   logWarning("Input data is not cached.")
 }
 val minCount = getMinCount(sequences)
-val lengthOnePatternsAndCounts =
-  getFreqItemAndCounts(minCount, sequences).collect()
-val prefixAndProjectedDatabase = getPrefixAndProjectedDatabase(
-  lengthOnePatternsAndCounts.map(_._1), sequences)
-val groupedProjectedDatabase = prefixAndProjectedDatabase
-  .map(x => (x._1.toSeq, x._2))
+val lengthOnePatternsAndCounts = getFreqItemAndCounts(minCount, 
sequences)
+val prefixSuffixPairs = getPrefixSuffixPairs(
+  lengthOnePatternsAndCounts.map(_._1).collect(), sequences)
+var patternsCount: Long = lengthOnePatternsAndCounts.count()
+var allPatternAndCounts = lengthOnePatternsAndCounts.map(x => 
(ArrayBuffer(x._1), x._2))
+var (smallPrefixSuffixPairs, largePrefixSuffixPairs) =
+  splitPrefixSuffixPairs(prefixSuffixPairs)
+largePrefixSuffixPairs.persist(StorageLevel.MEMORY_AND_DISK)
+var patternLength: Int = 1
+while (patternLength < maxPatternLength &&
+  largePrefixSuffixPairs.count() != 0) {
+  val (nextPatternAndCounts, nextPrefixSuffixPairs) =
+getPatternCountsAndPrefixSuffixPairs(minCount, 
largePrefixSuffixPairs)
+  patternsCount = nextPatternAndCounts.count()
+  largePrefixSuffixPairs.unpersist()
+  val splitedPrefixSuffixPairs = 
splitPrefixSuffixPairs(nextPrefixSuffixPairs)
+  largePrefixSuffixPairs = splitedPrefixSuffixPairs._2
+  largePrefixSuffixPairs.persist(StorageLevel.MEMORY_AND_DISK)
+  smallPrefixSuffixPairs = smallPrefixSuffixPairs ++ 
splitedPrefixSuffixPairs._1
+  allPatternAndCounts = allPatternAndCounts ++ nextPatternAndCounts
+  patternLength = patternLength + 1
+}
+if (smallPrefixSuffixPairs.count() > 0) {
+  val projectedDatabase = smallPrefixSuffixPairs
+.map(x => (x._1.toSeq, x._2))
+.groupByKey()
+.map(x => (x._1.toArray, x._2.toArray))
+  val nextPatternAndCounts = getPatternsInLocal(minCount, 
projectedDatabase)
+  allPatternAndCounts = allPatternAndCounts ++ nextPatternAndCounts
--- End diff --

OK


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8998][MLlib] Collect enough frequent pr...

2015-07-26 Thread zhangjiajin

Github user zhangjiajin commented on a diff in the pull request:

https://github.com/apache/spark/pull/7412#discussion_r35500699
  
--- Diff: mllib/src/main/scala/org/apache/spark/mllib/fpm/PrefixSpan.scala 
---
@@ -82,20 +86,106 @@ class PrefixSpan private (
   logWarning("Input data is not cached.")
 }
 val minCount = getMinCount(sequences)
-val lengthOnePatternsAndCounts =
-  getFreqItemAndCounts(minCount, sequences).collect()
-val prefixAndProjectedDatabase = getPrefixAndProjectedDatabase(
-  lengthOnePatternsAndCounts.map(_._1), sequences)
-val groupedProjectedDatabase = prefixAndProjectedDatabase
-  .map(x => (x._1.toSeq, x._2))
+val lengthOnePatternsAndCounts = getFreqItemAndCounts(minCount, 
sequences)
+val prefixSuffixPairs = getPrefixSuffixPairs(
+  lengthOnePatternsAndCounts.map(_._1).collect(), sequences)
+var patternsCount: Long = lengthOnePatternsAndCounts.count()
+var allPatternAndCounts = lengthOnePatternsAndCounts.map(x => 
(ArrayBuffer(x._1), x._2))
+var (smallPrefixSuffixPairs, largePrefixSuffixPairs) =
+  splitPrefixSuffixPairs(prefixSuffixPairs)
+largePrefixSuffixPairs.persist(StorageLevel.MEMORY_AND_DISK)
+var patternLength: Int = 1
+while (patternLength < maxPatternLength &&
+  largePrefixSuffixPairs.count() != 0) {
+  val (nextPatternAndCounts, nextPrefixSuffixPairs) =
+getPatternCountsAndPrefixSuffixPairs(minCount, 
largePrefixSuffixPairs)
+  patternsCount = nextPatternAndCounts.count()
+  largePrefixSuffixPairs.unpersist()
+  val splitedPrefixSuffixPairs = 
splitPrefixSuffixPairs(nextPrefixSuffixPairs)
--- End diff --

OK


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8998][MLlib] Collect enough frequent pr...

2015-07-26 Thread zhangjiajin

Github user zhangjiajin commented on a diff in the pull request:

https://github.com/apache/spark/pull/7412#discussion_r35500517
  
--- Diff: mllib/src/main/scala/org/apache/spark/mllib/fpm/PrefixSpan.scala 
---
@@ -82,20 +86,106 @@ class PrefixSpan private (
   logWarning("Input data is not cached.")
 }
 val minCount = getMinCount(sequences)
-val lengthOnePatternsAndCounts =
-  getFreqItemAndCounts(minCount, sequences).collect()
-val prefixAndProjectedDatabase = getPrefixAndProjectedDatabase(
-  lengthOnePatternsAndCounts.map(_._1), sequences)
-val groupedProjectedDatabase = prefixAndProjectedDatabase
-  .map(x => (x._1.toSeq, x._2))
+val lengthOnePatternsAndCounts = getFreqItemAndCounts(minCount, 
sequences)
+val prefixSuffixPairs = getPrefixSuffixPairs(
+  lengthOnePatternsAndCounts.map(_._1).collect(), sequences)
+var patternsCount: Long = lengthOnePatternsAndCounts.count()
+var allPatternAndCounts = lengthOnePatternsAndCounts.map(x => 
(ArrayBuffer(x._1), x._2))
+var (smallPrefixSuffixPairs, largePrefixSuffixPairs) =
+  splitPrefixSuffixPairs(prefixSuffixPairs)
+largePrefixSuffixPairs.persist(StorageLevel.MEMORY_AND_DISK)
+var patternLength: Int = 1
+while (patternLength < maxPatternLength &&
+  largePrefixSuffixPairs.count() != 0) {
+  val (nextPatternAndCounts, nextPrefixSuffixPairs) =
+getPatternCountsAndPrefixSuffixPairs(minCount, 
largePrefixSuffixPairs)
+  patternsCount = nextPatternAndCounts.count()
--- End diff --

OK


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8998][MLlib] Collect enough frequent pr...

2015-07-26 Thread zhangjiajin

Github user zhangjiajin commented on a diff in the pull request:

https://github.com/apache/spark/pull/7412#discussion_r35500511
  
--- Diff: mllib/src/main/scala/org/apache/spark/mllib/fpm/PrefixSpan.scala 
---
@@ -82,20 +86,106 @@ class PrefixSpan private (
   logWarning("Input data is not cached.")
 }
 val minCount = getMinCount(sequences)
-val lengthOnePatternsAndCounts =
-  getFreqItemAndCounts(minCount, sequences).collect()
-val prefixAndProjectedDatabase = getPrefixAndProjectedDatabase(
-  lengthOnePatternsAndCounts.map(_._1), sequences)
-val groupedProjectedDatabase = prefixAndProjectedDatabase
-  .map(x => (x._1.toSeq, x._2))
+val lengthOnePatternsAndCounts = getFreqItemAndCounts(minCount, 
sequences)
+val prefixSuffixPairs = getPrefixSuffixPairs(
+  lengthOnePatternsAndCounts.map(_._1).collect(), sequences)
+var patternsCount: Long = lengthOnePatternsAndCounts.count()
+var allPatternAndCounts = lengthOnePatternsAndCounts.map(x => 
(ArrayBuffer(x._1), x._2))
+var (smallPrefixSuffixPairs, largePrefixSuffixPairs) =
+  splitPrefixSuffixPairs(prefixSuffixPairs)
+largePrefixSuffixPairs.persist(StorageLevel.MEMORY_AND_DISK)
+var patternLength: Int = 1
--- End diff --

OK


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8998][MLlib] Collect enough frequent pr...

2015-07-26 Thread zhangjiajin

Github user zhangjiajin commented on a diff in the pull request:

https://github.com/apache/spark/pull/7412#discussion_r35500442
  
--- Diff: mllib/src/main/scala/org/apache/spark/mllib/fpm/PrefixSpan.scala 
---
@@ -82,20 +86,106 @@ class PrefixSpan private (
   logWarning("Input data is not cached.")
 }
 val minCount = getMinCount(sequences)
-val lengthOnePatternsAndCounts =
-  getFreqItemAndCounts(minCount, sequences).collect()
-val prefixAndProjectedDatabase = getPrefixAndProjectedDatabase(
-  lengthOnePatternsAndCounts.map(_._1), sequences)
-val groupedProjectedDatabase = prefixAndProjectedDatabase
-  .map(x => (x._1.toSeq, x._2))
+val lengthOnePatternsAndCounts = getFreqItemAndCounts(minCount, 
sequences)
+val prefixSuffixPairs = getPrefixSuffixPairs(
+  lengthOnePatternsAndCounts.map(_._1).collect(), sequences)
+var patternsCount: Long = lengthOnePatternsAndCounts.count()
--- End diff --

OK


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8998][MLlib] Collect enough frequent pr...

2015-07-26 Thread zhangjiajin

Github user zhangjiajin commented on a diff in the pull request:

https://github.com/apache/spark/pull/7412#discussion_r35500446
  
--- Diff: mllib/src/main/scala/org/apache/spark/mllib/fpm/PrefixSpan.scala 
---
@@ -82,20 +86,106 @@ class PrefixSpan private (
   logWarning("Input data is not cached.")
 }
 val minCount = getMinCount(sequences)
-val lengthOnePatternsAndCounts =
-  getFreqItemAndCounts(minCount, sequences).collect()
-val prefixAndProjectedDatabase = getPrefixAndProjectedDatabase(
-  lengthOnePatternsAndCounts.map(_._1), sequences)
-val groupedProjectedDatabase = prefixAndProjectedDatabase
-  .map(x => (x._1.toSeq, x._2))
+val lengthOnePatternsAndCounts = getFreqItemAndCounts(minCount, 
sequences)
+val prefixSuffixPairs = getPrefixSuffixPairs(
+  lengthOnePatternsAndCounts.map(_._1).collect(), sequences)
+var patternsCount: Long = lengthOnePatternsAndCounts.count()
+var allPatternAndCounts = lengthOnePatternsAndCounts.map(x => 
(ArrayBuffer(x._1), x._2))
+var (smallPrefixSuffixPairs, largePrefixSuffixPairs) =
+  splitPrefixSuffixPairs(prefixSuffixPairs)
+largePrefixSuffixPairs.persist(StorageLevel.MEMORY_AND_DISK)
+var patternLength: Int = 1
+while (patternLength < maxPatternLength &&
+  largePrefixSuffixPairs.count() != 0) {
+  val (nextPatternAndCounts, nextPrefixSuffixPairs) =
+getPatternCountsAndPrefixSuffixPairs(minCount, 
largePrefixSuffixPairs)
+  patternsCount = nextPatternAndCounts.count()
+  largePrefixSuffixPairs.unpersist()
+  val splitedPrefixSuffixPairs = 
splitPrefixSuffixPairs(nextPrefixSuffixPairs)
+  largePrefixSuffixPairs = splitedPrefixSuffixPairs._2
+  largePrefixSuffixPairs.persist(StorageLevel.MEMORY_AND_DISK)
+  smallPrefixSuffixPairs = smallPrefixSuffixPairs ++ 
splitedPrefixSuffixPairs._1
+  allPatternAndCounts = allPatternAndCounts ++ nextPatternAndCounts
+  patternLength = patternLength + 1
+}
+if (smallPrefixSuffixPairs.count() > 0) {
+  val projectedDatabase = smallPrefixSuffixPairs
+.map(x => (x._1.toSeq, x._2))
+.groupByKey()
+.map(x => (x._1.toArray, x._2.toArray))
+  val nextPatternAndCounts = getPatternsInLocal(minCount, 
projectedDatabase)
+  allPatternAndCounts = allPatternAndCounts ++ nextPatternAndCounts
+}
+allPatternAndCounts.map { case (pattern, count) => (pattern.toArray, 
count) }
+  }
+
+
+  /**
+   * Split prefix suffix pairs to two parts:
+   * suffixes' size less than maxSuffixesBeforeLocalProcessing and
--- End diff --

OK


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8998][MLlib] Collect enough frequent pr...

2015-07-26 Thread zhangjiajin

Github user zhangjiajin commented on a diff in the pull request:

https://github.com/apache/spark/pull/7412#discussion_r35500444
  
--- Diff: mllib/src/main/scala/org/apache/spark/mllib/fpm/PrefixSpan.scala 
---
@@ -82,20 +86,106 @@ class PrefixSpan private (
   logWarning("Input data is not cached.")
 }
 val minCount = getMinCount(sequences)
-val lengthOnePatternsAndCounts =
-  getFreqItemAndCounts(minCount, sequences).collect()
-val prefixAndProjectedDatabase = getPrefixAndProjectedDatabase(
-  lengthOnePatternsAndCounts.map(_._1), sequences)
-val groupedProjectedDatabase = prefixAndProjectedDatabase
-  .map(x => (x._1.toSeq, x._2))
+val lengthOnePatternsAndCounts = getFreqItemAndCounts(minCount, 
sequences)
+val prefixSuffixPairs = getPrefixSuffixPairs(
+  lengthOnePatternsAndCounts.map(_._1).collect(), sequences)
+var patternsCount: Long = lengthOnePatternsAndCounts.count()
+var allPatternAndCounts = lengthOnePatternsAndCounts.map(x => 
(ArrayBuffer(x._1), x._2))
+var (smallPrefixSuffixPairs, largePrefixSuffixPairs) =
+  splitPrefixSuffixPairs(prefixSuffixPairs)
+largePrefixSuffixPairs.persist(StorageLevel.MEMORY_AND_DISK)
+var patternLength: Int = 1
+while (patternLength < maxPatternLength &&
+  largePrefixSuffixPairs.count() != 0) {
+  val (nextPatternAndCounts, nextPrefixSuffixPairs) =
+getPatternCountsAndPrefixSuffixPairs(minCount, 
largePrefixSuffixPairs)
+  patternsCount = nextPatternAndCounts.count()
+  largePrefixSuffixPairs.unpersist()
+  val splitedPrefixSuffixPairs = 
splitPrefixSuffixPairs(nextPrefixSuffixPairs)
+  largePrefixSuffixPairs = splitedPrefixSuffixPairs._2
+  largePrefixSuffixPairs.persist(StorageLevel.MEMORY_AND_DISK)
+  smallPrefixSuffixPairs = smallPrefixSuffixPairs ++ 
splitedPrefixSuffixPairs._1
+  allPatternAndCounts = allPatternAndCounts ++ nextPatternAndCounts
--- End diff --

OK


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8999][MLlib]Support non-temporal sequen...

2015-07-25 Thread zhangjiajin

Github user zhangjiajin commented on the pull request:

https://github.com/apache/spark/pull/7594#issuecomment-124829693
  
@mengxr I compared two approaches (Array[Array[Int]] and Array[Int]), I 
found Array[Array[Int]] performance is better than Array[Int].
The dataset I used is BMSWebView2( KDD CUP 2000). When the support is 1500, 
the running time of Array[Array[Int]] is 44s, the running time of Array[Int] is 
69s. These time values are the average of three measurements.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8999][MLlib]Support non-temporal sequen...

2015-07-24 Thread zhangjiajin

Github user zhangjiajin commented on the pull request:

https://github.com/apache/spark/pull/7646#issuecomment-124565961
  
@mengxr About benchmark data sets for performance testing, can you give 
some suggestions?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8999][MLlib]Support non-temporal sequen...

2015-07-24 Thread zhangjiajin

Github user zhangjiajin commented on the pull request:

https://github.com/apache/spark/pull/7646#issuecomment-124564177
  
@mengxr @feynmanliang Please review. Thx.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8999][MLlib]Support non-temporal sequen...

2015-07-24 Thread zhangjiajin

GitHub user zhangjiajin opened a pull request:

https://github.com/apache/spark/pull/7646

[SPARK-8999][MLlib]Support non-temporal sequence in PrefixSpan (Array[Int])

Support non-temporal sequence in PrefixSpan (Array[Int])

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/zhangjiajin/spark multiItems_2

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/7646.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #7646


commit c6ceb63a557c1d9c3dcccf44a16ab32528b012f2
Author: zhangjiajin 
Date:   2015-07-07T07:30:10Z

Add new algorithm PrefixSpan and test file.

commit d9d8137c157374f9d463c0ee387536ed7448ca5f
Author: zhang jiajin 
Date:   2015-07-08T10:22:16Z

Delete Prefixspan.scala

Use PrefixSpan.scala instead of Prefixspan.scala. Delete Prefixspan.scala

commit c1d13d01f218b5f7ad3a68fa29202b2839090f7f
Author: zhang jiajin 
Date:   2015-07-08T10:23:31Z

Delete PrefixspanSuite.scala

Use PrefixSpanSuite.scala instead of PrefixspanSuite.scala, Delete 
PrefixspanSuite.scala.

commit a7e50d43fac419e1aba3e668b71b1d08bef0
Author: zhangjiajin 
Date:   2015-07-14T02:21:04Z

Add feature: Collect enough frequent prefixes before projection in 
PrefixSpan.

commit f06772fc1347fe412a1ccdb9c27df8d0573ca462
Author: zhangjiajin 
Date:   2015-07-14T02:46:05Z

fix a scala style error.

commit b572f54147652077dc198eb6ecf041b0ba8bc63e
Author: zhangjiajin 
Date:   2015-07-15T02:57:41Z

initialize file before rebase.

commit 216ab0cc98bc1f8fd27e97f021640d76c153b860
Author: zhangjiajin 
Date:   2015-07-24T15:39:33Z

Support non-temporal sequence in PrefixSpan




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8999][MLlib]Support non-temporal sequen...

2015-07-22 Thread zhangjiajin

Github user zhangjiajin commented on the pull request:

https://github.com/apache/spark/pull/7594#issuecomment-123922017
  
@mengxr OK. I will test two approaches (Array[Array[Int]] and Array[Int]). 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8999][MLlib]Support non-temporal sequen...

2015-07-22 Thread zhangjiajin

GitHub user zhangjiajin opened a pull request:

https://github.com/apache/spark/pull/7594

[SPARK-8999][MLlib]Support non-temporal sequence in PrefixSpan

Support non-temporal sequence in PrefixSpan

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/zhangjiajin/spark multiItems

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/7594.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #7594


commit c6ceb63a557c1d9c3dcccf44a16ab32528b012f2
Author: zhangjiajin 
Date:   2015-07-07T07:30:10Z

Add new algorithm PrefixSpan and test file.

commit d9d8137c157374f9d463c0ee387536ed7448ca5f
Author: zhang jiajin 
Date:   2015-07-08T10:22:16Z

Delete Prefixspan.scala

Use PrefixSpan.scala instead of Prefixspan.scala. Delete Prefixspan.scala

commit c1d13d01f218b5f7ad3a68fa29202b2839090f7f
Author: zhang jiajin 
Date:   2015-07-08T10:23:31Z

Delete PrefixspanSuite.scala

Use PrefixSpanSuite.scala instead of PrefixspanSuite.scala, Delete 
PrefixspanSuite.scala.

commit a7e50d43fac419e1aba3e668b71b1d08bef0
Author: zhangjiajin 
Date:   2015-07-14T02:21:04Z

Add feature: Collect enough frequent prefixes before projection in 
PrefixSpan.

commit f06772fc1347fe412a1ccdb9c27df8d0573ca462
Author: zhangjiajin 
Date:   2015-07-14T02:46:05Z

fix a scala style error.

commit b572f54147652077dc198eb6ecf041b0ba8bc63e
Author: zhangjiajin 
Date:   2015-07-15T02:57:41Z

initialize file before rebase.

commit 9ed36d57ed82054d2be2fc30e7980aca4a90a0ba
Author: zhangjiajin 
Date:   2015-07-22T09:53:35Z

Support non-temporal sequence in PrefixSpan




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8998][MLlib] Collect enough frequent pr...

2015-07-20 Thread zhangjiajin

Github user zhangjiajin commented on the pull request:

https://github.com/apache/spark/pull/7412#issuecomment-122877168
  
@feynmanliang @mengxr I'm working on performance test, I compared the my 
first version, the performance has greatly improved. But the most recent update 
(add maxSuffixesBeforeLocalProcessing), the performance decrease very badly, 
please help me take a look at the code for the current, see how to optimize? 
Thank you!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8998][MLlib] Collect enough frequent pr...

2015-07-17 Thread zhangjiajin

Github user zhangjiajin commented on a diff in the pull request:

https://github.com/apache/spark/pull/7412#discussion_r34943941
  
--- Diff: mllib/src/main/scala/org/apache/spark/mllib/fpm/PrefixSpan.scala 
---
@@ -82,20 +86,69 @@ class PrefixSpan private (
   logWarning("Input data is not cached.")
 }
 val minCount = getMinCount(sequences)
-val lengthOnePatternsAndCounts =
-  getFreqItemAndCounts(minCount, sequences).collect()
-val prefixAndProjectedDatabase = getPrefixAndProjectedDatabase(
-  lengthOnePatternsAndCounts.map(_._1), sequences)
-val groupedProjectedDatabase = prefixAndProjectedDatabase
-  .map(x => (x._1.toSeq, x._2))
+val lengthOnePatternsAndCounts = getFreqItemAndCounts(minCount, 
sequences)
+val prefixSuffixPairs = getPrefixSuffixPairs(
+  lengthOnePatternsAndCounts.map(_._1).collect(), sequences)
+var patternsCount: Long = lengthOnePatternsAndCounts.count()
+var allPatternAndCounts = lengthOnePatternsAndCounts.map(x => 
(ArrayBuffer(x._1), x._2))
+var currentPrefixSuffixPairs = prefixSuffixPairs
--- End diff --

OK. Because the pairs maybe very big, so, I use 
persist(StorageLevel.MEMORY_AND_DISK).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8998][MLlib] Collect enough frequent pr...

2015-07-17 Thread zhangjiajin

Github user zhangjiajin commented on a diff in the pull request:

https://github.com/apache/spark/pull/7412#discussion_r34943942
  
--- Diff: mllib/src/main/scala/org/apache/spark/mllib/fpm/PrefixSpan.scala 
---
@@ -82,20 +86,69 @@ class PrefixSpan private (
   logWarning("Input data is not cached.")
 }
 val minCount = getMinCount(sequences)
-val lengthOnePatternsAndCounts =
-  getFreqItemAndCounts(minCount, sequences).collect()
-val prefixAndProjectedDatabase = getPrefixAndProjectedDatabase(
-  lengthOnePatternsAndCounts.map(_._1), sequences)
-val groupedProjectedDatabase = prefixAndProjectedDatabase
-  .map(x => (x._1.toSeq, x._2))
+val lengthOnePatternsAndCounts = getFreqItemAndCounts(minCount, 
sequences)
+val prefixSuffixPairs = getPrefixSuffixPairs(
+  lengthOnePatternsAndCounts.map(_._1).collect(), sequences)
+var patternsCount: Long = lengthOnePatternsAndCounts.count()
+var allPatternAndCounts = lengthOnePatternsAndCounts.map(x => 
(ArrayBuffer(x._1), x._2))
+var currentPrefixSuffixPairs = prefixSuffixPairs
+var patternLength: Int = 1
+while (patternLength < maxPatternLength &&
+  patternsCount <= minPatternsBeforeLocalProcessing &&
+  currentPrefixSuffixPairs.count() != 0) {
+  val (nextPatternAndCounts, nextPrefixSuffixPairs) =
+getPatternCountsAndPrefixSuffixPairs(minCount, 
currentPrefixSuffixPairs)
+  patternsCount = nextPatternAndCounts.count()
+  currentPrefixSuffixPairs = nextPrefixSuffixPairs
--- End diff --

OK. Because the pairs maybe very big, so, I use 
persist(StorageLevel.MEMORY_AND_DISK).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8998][MLlib] Collect enough frequent pr...

2015-07-17 Thread zhangjiajin

Github user zhangjiajin commented on the pull request:

https://github.com/apache/spark/pull/7412#issuecomment-122459599
  
@feynmanliang  If the original prefix is "ABC" before calling 
LocalPrefixSpan.run. After calling LocalPrefixSpan.run, the prefix is "EDABC", 
reverse it, the prefix is "CBADE", but the correct prefix is "ABCDE", it's 
wrong. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8998][MLlib] Collect enough frequent pr...

2015-07-17 Thread zhangjiajin

Github user zhangjiajin commented on the pull request:

https://github.com/apache/spark/pull/7412#issuecomment-122261569
  
@feynmanliang 
In the file "LocalPrefixSpan.scala", I have a question:

L48:   val newPrefixes = item :: prefixes
The elements in newPrefixes is reversed. 

Why not :  val newPrefixes = prefixes :+ item
What's the different ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8998][MLlib] Collect enough frequent pr...

2015-07-16 Thread zhangjiajin

Github user zhangjiajin commented on the pull request:

https://github.com/apache/spark/pull/7412#issuecomment-122143351
  
I'm confused, the groupBy just reorganize data, not generate new data, why 
does excutor overload after shuffling ?

The following diagrams are from paper "Mining Sequential Patterns by 
Pattern-Growth: The PrefixSpan Approach":


![image](https://cloud.githubusercontent.com/assets/13159256/8738656/262676ec-2c65-11e5-9c7e-5e79e5a03b38.png)



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8998][MLlib] Collect enough frequent pr...

2015-07-16 Thread zhangjiajin

Github user zhangjiajin commented on the pull request:

https://github.com/apache/spark/pull/7412#issuecomment-121902394

@feynmanliang This assumes that all the values (suffixes) associated to a
key (prefix) will fit on an executor, but I don't think that patternsCount >
minPatternsBeforeShuffle will guarantee that. Better to count the suffixes for
each prefix using aggregateByKey before doing local processing.

The minPatternsBeforeLocalProcessing is related to the executors. The
suffixes threshold (call it maxSuffixesThreshold) is related with the input
sequences size. How to set the default value of maxSuffixesThreshold ?

You worry about a exector will overload. If an executor has multiple prefix
maybe reduce the influence of this problem. For example, the number of
executors is 4, the minPatternsBeforeLocalProcessing is 20, means one executor
have 5 random prefixes and its suffixes.

The following diagram show the two methods' different:

![234](https://cloud.githubusercontent.com/assets/13159256/8719603/74bf13c6-2bdf-11e5-9f5d-877fae4fc4c8.PNG)

---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

1 2 >

1 - 100 of 139 matches

Mail list logo