[jira] [Commented] (SPARK-21359) frequency discretizer
[ https://issues.apache.org/jira/browse/SPARK-21359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16081467#comment-16081467 ] Fu Shanshan commented on SPARK-21359: - but why in the example: Array((0, 18.0), (1, 19.0), (2, 8.0), (3, 5.0), (4, 2.2), (5, 1.0), (6, 9.1), (7, 10.1), (8, 1.1), (9, 16.0), (10, 20.0), (11, 20.0)) QuantileDiscretizer result +---++--+ | id|hour|result| +---++--+ | 0|18.0| 3.0| | 1|19.0| 3.0| | 2| 8.0| 1.0| | 3| 5.0| 1.0| | 4| 2.2| 1.0| | 5| 1.0| 0.0| | 6| 9.1| 2.0| | 7|10.1| 2.0| | 8| 1.1| 0.0| | 9|16.0| 2.0| | 10|20.0| 3.0| | 11|20.0| 3.0| +---++--+ for number 18. it belong to bin 3. I thought it is because it makes equal-width bins, so the bin array is (0, 5, 10, 15, 20), so 18 is in the last bin. but my result, for number 18, it should be in bin 2. for equal frequency definition, so the bin array is (-inf, 5.0, 10.1, 19, inf or 20), so 18 in the bin 2, instead of the last bin. Not sure am I misunderstood this questions. Thank you for your patiences. > frequency discretizer > - > > Key: SPARK-21359 > URL: https://issues.apache.org/jira/browse/SPARK-21359 > Project: Spark > Issue Type: New JIRA Project > Components: ML >Affects Versions: 2.1.1 >Reporter: Fu Shanshan > > Typically data is discretized into partitions of K equal lengths/width (equal > intervals) or K% of the total data (equal frequencies) -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21359) frequency discretizer
[ https://issues.apache.org/jira/browse/SPARK-21359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16080339#comment-16080339 ] Sean Owen commented on SPARK-21359: --- No, what you are describing is pretty much the definition of quantiles. That's what QuantileDiscretizer does. It does not make equal-width bins. > frequency discretizer > - > > Key: SPARK-21359 > URL: https://issues.apache.org/jira/browse/SPARK-21359 > Project: Spark > Issue Type: New JIRA Project > Components: ML >Affects Versions: 2.1.1 >Reporter: Fu Shanshan > > Typically data is discretized into partitions of K equal lengths/width (equal > intervals) or K% of the total data (equal frequencies) -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21359) frequency discretizer
[ https://issues.apache.org/jira/browse/SPARK-21359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16080235#comment-16080235 ] Fu Shanshan commented on SPARK-21359: - so the the difference between QuantileDiscretizer and my frequencyDiscretizer is the binning method. QuantileDiscretizer —— Equal Width Binning The algorithm divides the data into k intervals of equal size. The width of intervals is: w = (max-min)/k And the interval boundaries are: min+w, min+2w, ... , min+(k-1)w and FrequencyDiscretizer —— Equal Frequency Binning. The algorithm divides the data into k groups which each group contains approximately same number of values. For the both methods, the best way of determining k is by looking at the histogram and try different intervals or groups. > frequency discretizer > - > > Key: SPARK-21359 > URL: https://issues.apache.org/jira/browse/SPARK-21359 > Project: Spark > Issue Type: New JIRA Project > Components: ML >Affects Versions: 2.1.1 >Reporter: Fu Shanshan > > Typically data is discretized into partitions of K equal lengths/width (equal > intervals) or K% of the total data (equal frequencies) -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21359) frequency discretizer
[ https://issues.apache.org/jira/browse/SPARK-21359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16080080#comment-16080080 ] Sean Owen commented on SPARK-21359: --- I can't understand from this code what the functionality is supposed to do, or why it's useful, or why you can't use the existing discretizer. This needs to be much better elaborated or closed. > frequency discretizer > - > > Key: SPARK-21359 > URL: https://issues.apache.org/jira/browse/SPARK-21359 > Project: Spark > Issue Type: New JIRA Project > Components: ML >Affects Versions: 2.1.1 >Reporter: Fu Shanshan > > Typically data is discretized into partitions of K equal lengths/width (equal > intervals) or K% of the total data (equal frequencies) -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21359) frequency discretizer
[ https://issues.apache.org/jira/browse/SPARK-21359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16080060#comment-16080060 ] Apache Spark commented on SPARK-21359: -- User 'Shanshan-IC' has created a pull request for this issue: https://github.com/apache/spark/pull/18585 > frequency discretizer > - > > Key: SPARK-21359 > URL: https://issues.apache.org/jira/browse/SPARK-21359 > Project: Spark > Issue Type: New JIRA Project > Components: ML >Affects Versions: 2.1.1 >Reporter: Fu Shanshan > > Typically data is discretized into partitions of K equal lengths/width (equal > intervals) or K% of the total data (equal frequencies) -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org