[jira] [Commented] (SPARK-21359) frequency discretizer

2017-07-10 Thread Fu Shanshan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16081467#comment-16081467
 ] 

Fu Shanshan commented on SPARK-21359:
-

but why in the example:
Array((0, 18.0), (1, 19.0), (2, 8.0), (3, 5.0), (4, 2.2), (5, 1.0), (6, 9.1), 
(7, 10.1), (8, 1.1), (9, 16.0), (10, 20.0), (11, 20.0)) 

QuantileDiscretizer result   
+---++--+
| id|hour|result|
+---++--+
|  0|18.0|   3.0|
|  1|19.0|   3.0|
|  2| 8.0|   1.0|
|  3| 5.0|   1.0|
|  4| 2.2|   1.0|
|  5| 1.0|   0.0|
|  6| 9.1|   2.0|
|  7|10.1|   2.0|
|  8| 1.1|   0.0|
|  9|16.0|   2.0|
| 10|20.0|   3.0|
| 11|20.0|   3.0|
+---++--+

for number 18. it belong to bin 3. I thought it is because it makes equal-width 
bins, so the bin array is (0, 5, 10, 15, 20), so 18 is in the last bin.
but my result, for number 18, it should be in bin 2. for equal frequency 
definition, so the bin array is (-inf, 5.0, 10.1, 19, inf or 20), so 18 in the 
bin 2, instead of the last bin.
Not sure am I misunderstood this questions. Thank you for your patiences.

> frequency discretizer
> -
>
> Key: SPARK-21359
> URL: https://issues.apache.org/jira/browse/SPARK-21359
> Project: Spark
>  Issue Type: New JIRA Project
>  Components: ML
>Affects Versions: 2.1.1
>Reporter: Fu Shanshan
>
> Typically data is discretized into partitions of K equal lengths/width (equal 
> intervals) or K% of the total data (equal frequencies)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21359) frequency discretizer

2017-07-10 Thread Fu Shanshan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16080235#comment-16080235
 ] 

Fu Shanshan commented on SPARK-21359:
-

so the the difference between QuantileDiscretizer and my frequencyDiscretizer 
is the binning method.
QuantileDiscretizer —— Equal Width Binning
The algorithm divides the data into k intervals of equal size. The width of 
intervals is:
w = (max-min)/k

And the interval boundaries are:
min+w, min+2w, ... , min+(k-1)w
and FrequencyDiscretizer —— Equal Frequency Binning.

The algorithm divides the data into k groups which each group contains 
approximately same number of values. For the both methods, the best way of 
determining k is by looking at the histogram and try different intervals or 
groups.

> frequency discretizer
> -
>
> Key: SPARK-21359
> URL: https://issues.apache.org/jira/browse/SPARK-21359
> Project: Spark
>  Issue Type: New JIRA Project
>  Components: ML
>Affects Versions: 2.1.1
>Reporter: Fu Shanshan
>
> Typically data is discretized into partitions of K equal lengths/width (equal 
> intervals) or K% of the total data (equal frequencies)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-21359) frequency discretizer

2017-07-10 Thread Fu Shanshan (JIRA)
Fu Shanshan created SPARK-21359:
---

 Summary: frequency discretizer
 Key: SPARK-21359
 URL: https://issues.apache.org/jira/browse/SPARK-21359
 Project: Spark
  Issue Type: New JIRA Project
  Components: ML
Affects Versions: 2.1.1
Reporter: Fu Shanshan


Typically data is discretized into partitions of K equal lengths/width (equal 
intervals) or K% of the total data (equal frequencies)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org