[jira] [Commented] (SPARK-21359) frequency discretizer
[ https://issues.apache.org/jira/browse/SPARK-21359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16081467#comment-16081467 ] Fu Shanshan commented on SPARK-21359: - but why in the example: Array((0, 18.0), (1, 19.0), (2, 8.0), (3, 5.0), (4, 2.2), (5, 1.0), (6, 9.1), (7, 10.1), (8, 1.1), (9, 16.0), (10, 20.0), (11, 20.0)) QuantileDiscretizer result +---++--+ | id|hour|result| +---++--+ | 0|18.0| 3.0| | 1|19.0| 3.0| | 2| 8.0| 1.0| | 3| 5.0| 1.0| | 4| 2.2| 1.0| | 5| 1.0| 0.0| | 6| 9.1| 2.0| | 7|10.1| 2.0| | 8| 1.1| 0.0| | 9|16.0| 2.0| | 10|20.0| 3.0| | 11|20.0| 3.0| +---++--+ for number 18. it belong to bin 3. I thought it is because it makes equal-width bins, so the bin array is (0, 5, 10, 15, 20), so 18 is in the last bin. but my result, for number 18, it should be in bin 2. for equal frequency definition, so the bin array is (-inf, 5.0, 10.1, 19, inf or 20), so 18 in the bin 2, instead of the last bin. Not sure am I misunderstood this questions. Thank you for your patiences. > frequency discretizer > - > > Key: SPARK-21359 > URL: https://issues.apache.org/jira/browse/SPARK-21359 > Project: Spark > Issue Type: New JIRA Project > Components: ML >Affects Versions: 2.1.1 >Reporter: Fu Shanshan > > Typically data is discretized into partitions of K equal lengths/width (equal > intervals) or K% of the total data (equal frequencies) -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21359) frequency discretizer
[ https://issues.apache.org/jira/browse/SPARK-21359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16080235#comment-16080235 ] Fu Shanshan commented on SPARK-21359: - so the the difference between QuantileDiscretizer and my frequencyDiscretizer is the binning method. QuantileDiscretizer —— Equal Width Binning The algorithm divides the data into k intervals of equal size. The width of intervals is: w = (max-min)/k And the interval boundaries are: min+w, min+2w, ... , min+(k-1)w and FrequencyDiscretizer —— Equal Frequency Binning. The algorithm divides the data into k groups which each group contains approximately same number of values. For the both methods, the best way of determining k is by looking at the histogram and try different intervals or groups. > frequency discretizer > - > > Key: SPARK-21359 > URL: https://issues.apache.org/jira/browse/SPARK-21359 > Project: Spark > Issue Type: New JIRA Project > Components: ML >Affects Versions: 2.1.1 >Reporter: Fu Shanshan > > Typically data is discretized into partitions of K equal lengths/width (equal > intervals) or K% of the total data (equal frequencies) -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-21359) frequency discretizer
Fu Shanshan created SPARK-21359: --- Summary: frequency discretizer Key: SPARK-21359 URL: https://issues.apache.org/jira/browse/SPARK-21359 Project: Spark Issue Type: New JIRA Project Components: ML Affects Versions: 2.1.1 Reporter: Fu Shanshan Typically data is discretized into partitions of K equal lengths/width (equal intervals) or K% of the total data (equal frequencies) -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org