[jira] [Commented] (SPARK-21359) frequency discretizer

2017-07-10 Thread Fu Shanshan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16081467#comment-16081467
 ] 

Fu Shanshan commented on SPARK-21359:
-

but why in the example:
Array((0, 18.0), (1, 19.0), (2, 8.0), (3, 5.0), (4, 2.2), (5, 1.0), (6, 9.1), 
(7, 10.1), (8, 1.1), (9, 16.0), (10, 20.0), (11, 20.0)) 

QuantileDiscretizer result   
+---++--+
| id|hour|result|
+---++--+
|  0|18.0|   3.0|
|  1|19.0|   3.0|
|  2| 8.0|   1.0|
|  3| 5.0|   1.0|
|  4| 2.2|   1.0|
|  5| 1.0|   0.0|
|  6| 9.1|   2.0|
|  7|10.1|   2.0|
|  8| 1.1|   0.0|
|  9|16.0|   2.0|
| 10|20.0|   3.0|
| 11|20.0|   3.0|
+---++--+

for number 18. it belong to bin 3. I thought it is because it makes equal-width 
bins, so the bin array is (0, 5, 10, 15, 20), so 18 is in the last bin.
but my result, for number 18, it should be in bin 2. for equal frequency 
definition, so the bin array is (-inf, 5.0, 10.1, 19, inf or 20), so 18 in the 
bin 2, instead of the last bin.
Not sure am I misunderstood this questions. Thank you for your patiences.

> frequency discretizer
> -
>
> Key: SPARK-21359
> URL: https://issues.apache.org/jira/browse/SPARK-21359
> Project: Spark
>  Issue Type: New JIRA Project
>  Components: ML
>Affects Versions: 2.1.1
>Reporter: Fu Shanshan
>
> Typically data is discretized into partitions of K equal lengths/width (equal 
> intervals) or K% of the total data (equal frequencies)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21359) frequency discretizer

2017-07-10 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16080339#comment-16080339
 ] 

Sean Owen commented on SPARK-21359:
---

No, what you are describing is pretty much the definition of quantiles. That's 
what QuantileDiscretizer does. It does not make equal-width bins.

> frequency discretizer
> -
>
> Key: SPARK-21359
> URL: https://issues.apache.org/jira/browse/SPARK-21359
> Project: Spark
>  Issue Type: New JIRA Project
>  Components: ML
>Affects Versions: 2.1.1
>Reporter: Fu Shanshan
>
> Typically data is discretized into partitions of K equal lengths/width (equal 
> intervals) or K% of the total data (equal frequencies)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21359) frequency discretizer

2017-07-10 Thread Fu Shanshan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16080235#comment-16080235
 ] 

Fu Shanshan commented on SPARK-21359:
-

so the the difference between QuantileDiscretizer and my frequencyDiscretizer 
is the binning method.
QuantileDiscretizer —— Equal Width Binning
The algorithm divides the data into k intervals of equal size. The width of 
intervals is:
w = (max-min)/k

And the interval boundaries are:
min+w, min+2w, ... , min+(k-1)w
and FrequencyDiscretizer —— Equal Frequency Binning.

The algorithm divides the data into k groups which each group contains 
approximately same number of values. For the both methods, the best way of 
determining k is by looking at the histogram and try different intervals or 
groups.

> frequency discretizer
> -
>
> Key: SPARK-21359
> URL: https://issues.apache.org/jira/browse/SPARK-21359
> Project: Spark
>  Issue Type: New JIRA Project
>  Components: ML
>Affects Versions: 2.1.1
>Reporter: Fu Shanshan
>
> Typically data is discretized into partitions of K equal lengths/width (equal 
> intervals) or K% of the total data (equal frequencies)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21359) frequency discretizer

2017-07-10 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16080080#comment-16080080
 ] 

Sean Owen commented on SPARK-21359:
---

I can't understand from this code what the functionality is supposed to do, or 
why it's useful, or why you can't use the existing discretizer. This needs to 
be much better elaborated or closed.

> frequency discretizer
> -
>
> Key: SPARK-21359
> URL: https://issues.apache.org/jira/browse/SPARK-21359
> Project: Spark
>  Issue Type: New JIRA Project
>  Components: ML
>Affects Versions: 2.1.1
>Reporter: Fu Shanshan
>
> Typically data is discretized into partitions of K equal lengths/width (equal 
> intervals) or K% of the total data (equal frequencies)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21359) frequency discretizer

2017-07-10 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16080060#comment-16080060
 ] 

Apache Spark commented on SPARK-21359:
--

User 'Shanshan-IC' has created a pull request for this issue:
https://github.com/apache/spark/pull/18585

> frequency discretizer
> -
>
> Key: SPARK-21359
> URL: https://issues.apache.org/jira/browse/SPARK-21359
> Project: Spark
>  Issue Type: New JIRA Project
>  Components: ML
>Affects Versions: 2.1.1
>Reporter: Fu Shanshan
>
> Typically data is discretized into partitions of K equal lengths/width (equal 
> intervals) or K% of the total data (equal frequencies)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org