[jira] [Commented] (SPARK-6239) Spark MLlib fpm#FPGrowth minSupport should use long instead

2015-03-10 Thread Littlestar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14354450#comment-14354450
 ] 

Littlestar commented on SPARK-6239:
---

I just want to set minCount=2 for test 10*1*1 records.

FPGrowthModel model = new FPGrowth()
  .setMinSupport(2/(10*1.0*1.0))
  .setNumPartitions(500)
  .run(maps);

I think use minCount is better than minSupport.

> Spark MLlib fpm#FPGrowth minSupport should use long instead
> ---
>
> Key: SPARK-6239
> URL: https://issues.apache.org/jira/browse/SPARK-6239
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.3.0
>Reporter: Littlestar
>Priority: Minor
>
> Spark MLlib fpm#FPGrowth minSupport should use long instead
> ==
> val minCount = math.ceil(minSupport * count).toLong
> because:
> 1. [count]numbers of datasets is not kown before read.
> 2. [minSupport ]double precision.
> from mahout#FPGrowthDriver.java
> addOption("minSupport", "s", "(Optional) The minimum number of times a 
> co-occurrence must be present."
>   + " Default Value: 3", "3");
> I just want to set minCount=2 for test.
> Thanks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6239) Spark MLlib fpm#FPGrowth minSupport should use long instead

2015-03-10 Thread Littlestar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14354829#comment-14354829
 ] 

Littlestar commented on SPARK-6239:
---

When use FPGrowthModel, the numbers of input records is unkown before read.
I think change the meaning of minSupport, or add setMinCount 

If I want to set minCount=2, I must use.setMinSupport(1.99/(rdd.count())), 
because of double's precision.
val minCount = math.ceil(minSupport * count).toLong
 math.ceil

> Spark MLlib fpm#FPGrowth minSupport should use long instead
> ---
>
> Key: SPARK-6239
> URL: https://issues.apache.org/jira/browse/SPARK-6239
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.3.0
>Reporter: Littlestar
>Priority: Minor
>
> Spark MLlib fpm#FPGrowth minSupport should use long instead
> ==
> val minCount = math.ceil(minSupport * count).toLong
> because:
> 1. [count]numbers of datasets is not kown before read.
> 2. [minSupport ]double precision.
> from mahout#FPGrowthDriver.java
> addOption("minSupport", "s", "(Optional) The minimum number of times a 
> co-occurrence must be present."
>   + " Default Value: 3", "3");
> I just want to set minCount=2 for test.
> Thanks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6239) Spark MLlib fpm#FPGrowth minSupport should use long instead

2015-03-10 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14354837#comment-14354837
 ] 

Sean Owen commented on SPARK-6239:
--

Conversely, you have to know the size of the input if you wish to express 
support as a fraction, and the only available param is an absolute value.
You do know the input size via {{count()}} and can do just as you say. I think 
you could argue for this parameter either way but the established API chooses 
to use a fraction, and changing it just exchanges a little convenience in one 
use case for another.

> Spark MLlib fpm#FPGrowth minSupport should use long instead
> ---
>
> Key: SPARK-6239
> URL: https://issues.apache.org/jira/browse/SPARK-6239
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.3.0
>Reporter: Littlestar
>Priority: Minor
>
> Spark MLlib fpm#FPGrowth minSupport should use long instead
> ==
> val minCount = math.ceil(minSupport * count).toLong
> because:
> 1. [count]numbers of datasets is not kown before read.
> 2. [minSupport ]double precision.
> from mahout#FPGrowthDriver.java
> addOption("minSupport", "s", "(Optional) The minimum number of times a 
> co-occurrence must be present."
>   + " Default Value: 3", "3");
> I just want to set minCount=2 for test.
> Thanks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6239) Spark MLlib fpm#FPGrowth minSupport should use long instead

2015-03-30 Thread Tomasz Bartczak (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14386924#comment-14386924
 ] 

Tomasz Bartczak commented on SPARK-6239:


I also stumbled upon this little inconvenience in the API.

My point in the discussion is that
1. FPGrowth internally is using a Long value internally, aka minCount (see 
https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/fpm/FPGrowth.scala#L120
 and 
https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/fpm/FPGrowth.scala#L146)
 - so it is more performant to specify that directly without doing a count.
2. a good API can be used in multiple use cases. This PR 
https://github.com/apache/spark/pull/5246 adds 'minCount' to be specified while 
keeping the existing API untouched.

Why would that be a bad idea to include such an option then?

> Spark MLlib fpm#FPGrowth minSupport should use long instead
> ---
>
> Key: SPARK-6239
> URL: https://issues.apache.org/jira/browse/SPARK-6239
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.3.0
>Reporter: Littlestar
>Priority: Minor
>
> Spark MLlib fpm#FPGrowth minSupport should use long instead
> ==
> val minCount = math.ceil(minSupport * count).toLong
> because:
> 1. [count]numbers of datasets is not kown before read.
> 2. [minSupport ]double precision.
> from mahout#FPGrowthDriver.java
> addOption("minSupport", "s", "(Optional) The minimum number of times a 
> co-occurrence must be present."
>   + " Default Value: 3", "3");
> I just want to set minCount=2 for test.
> Thanks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6239) Spark MLlib fpm#FPGrowth minSupport should use long instead

2015-03-30 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14386933#comment-14386933
 ] 

Sean Owen commented on SPARK-6239:
--

Just the little API overhead for littler gain IMHO. I'm open to other committer 
opinions but just didn't seem worth changing. I would imagine a relative value 
is more usually useful. 

> Spark MLlib fpm#FPGrowth minSupport should use long instead
> ---
>
> Key: SPARK-6239
> URL: https://issues.apache.org/jira/browse/SPARK-6239
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.3.0
>Reporter: Littlestar
>Priority: Minor
>
> Spark MLlib fpm#FPGrowth minSupport should use long instead
> ==
> val minCount = math.ceil(minSupport * count).toLong
> because:
> 1. [count]numbers of datasets is not kown before read.
> 2. [minSupport ]double precision.
> from mahout#FPGrowthDriver.java
> addOption("minSupport", "s", "(Optional) The minimum number of times a 
> co-occurrence must be present."
>   + " Default Value: 3", "3");
> I just want to set minCount=2 for test.
> Thanks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6239) Spark MLlib fpm#FPGrowth minSupport should use long instead

2015-03-30 Thread Littlestar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14387748#comment-14387748
 ] 

Littlestar commented on SPARK-6239:
---

>>If I want to set minCount=2, I must use.setMinSupport(1.99/(rdd.count())), 
>>because of double's precision.


How to reopen this PR and mark relation to  pull/5246, thanks.


> Spark MLlib fpm#FPGrowth minSupport should use long instead
> ---
>
> Key: SPARK-6239
> URL: https://issues.apache.org/jira/browse/SPARK-6239
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.3.0
>Reporter: Littlestar
>Priority: Minor
>
> Spark MLlib fpm#FPGrowth minSupport should use long instead
> ==
> val minCount = math.ceil(minSupport * count).toLong
> because:
> 1. [count]numbers of datasets is not kown before read.
> 2. [minSupport ]double precision.
> from mahout#FPGrowthDriver.java
> addOption("minSupport", "s", "(Optional) The minimum number of times a 
> co-occurrence must be present."
>   + " Default Value: 3", "3");
> I just want to set minCount=2 for test.
> Thanks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6239) Spark MLlib fpm#FPGrowth minSupport should use long instead

2015-04-03 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14394247#comment-14394247
 ] 

Apache Spark commented on SPARK-6239:
-

User 'kretes' has created a pull request for this issue:
https://github.com/apache/spark/pull/5246

> Spark MLlib fpm#FPGrowth minSupport should use long instead
> ---
>
> Key: SPARK-6239
> URL: https://issues.apache.org/jira/browse/SPARK-6239
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.3.0
>Reporter: Littlestar
>Priority: Minor
>
> Spark MLlib fpm#FPGrowth minSupport should use long instead
> ==
> val minCount = math.ceil(minSupport * count).toLong
> because:
> 1. [count]numbers of datasets is not kown before read.
> 2. [minSupport ]double precision.
> from mahout#FPGrowthDriver.java
> addOption("minSupport", "s", "(Optional) The minimum number of times a 
> co-occurrence must be present."
>   + " Default Value: 3", "3");
> I just want to set minCount=2 for test.
> Thanks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org