[jira] [Commented] (SPARK-8884) 1-sample Anderson-Darling Goodness-of-Fit test

2016-11-11 Thread Timothy Hunter (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8884?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15657825#comment-15657825
 ] 

Timothy Hunter commented on SPARK-8884:
---

I do not have a strong preference either way. We should just either
complete this feature (with DataFrame APIs) or close the open PR.



> 1-sample Anderson-Darling Goodness-of-Fit test
> --
>
> Key: SPARK-8884
> URL: https://issues.apache.org/jira/browse/SPARK-8884
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Jose Cambronero
>
> We have implemented a 1-sample Anderson-Darling goodness-of-fit test to add 
> to the current hypothesis testing functionality. The current implementation 
> supports various distributions (normal, exponential, gumbel, logistic, and 
> weibull). However, users must provide distribution parameters for all except 
> normal/exponential (in which case they are estimated from the data). In 
> contrast to other tests, such as the Kolmogorov Smirnov test, we only support 
> specific distributions as the critical values depend on the distribution 
> being tested. 
> The distributed implementation of AD takes advantage of the fact that we can 
> calculate a portion of the statistic within each partition of a sorted data 
> set, independent of the global order of those observations. We can then carry 
> some additional information that allows us to adjust the final amounts once 
> we have collected 1 result per partition.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8884) 1-sample Anderson-Darling Goodness-of-Fit test

2016-11-11 Thread yuhao yang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8884?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15657532#comment-15657532
 ] 

yuhao yang commented on SPARK-8884:
---

I thought it was closed because we have stopped the development for mllib 
(RDD-based API). 

If not, I'm willing to continue to work on this.
If yes, then I can only try to convert it to DataFrame-based API.

> 1-sample Anderson-Darling Goodness-of-Fit test
> --
>
> Key: SPARK-8884
> URL: https://issues.apache.org/jira/browse/SPARK-8884
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Jose Cambronero
>
> We have implemented a 1-sample Anderson-Darling goodness-of-fit test to add 
> to the current hypothesis testing functionality. The current implementation 
> supports various distributions (normal, exponential, gumbel, logistic, and 
> weibull). However, users must provide distribution parameters for all except 
> normal/exponential (in which case they are estimated from the data). In 
> contrast to other tests, such as the Kolmogorov Smirnov test, we only support 
> specific distributions as the critical values depend on the distribution 
> being tested. 
> The distributed implementation of AD takes advantage of the fact that we can 
> calculate a portion of the statistic within each partition of a sorted data 
> set, independent of the global order of those observations. We can then carry 
> some additional information that allows us to adjust the final amounts once 
> we have collected 1 result per partition.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8884) 1-sample Anderson-Darling Goodness-of-Fit test

2016-11-11 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8884?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15656780#comment-15656780
 ] 

Sean Owen commented on SPARK-8884:
--

Maybe, but the new PR is also apparently not moving ahead

> 1-sample Anderson-Darling Goodness-of-Fit test
> --
>
> Key: SPARK-8884
> URL: https://issues.apache.org/jira/browse/SPARK-8884
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Jose Cambronero
>
> We have implemented a 1-sample Anderson-Darling goodness-of-fit test to add 
> to the current hypothesis testing functionality. The current implementation 
> supports various distributions (normal, exponential, gumbel, logistic, and 
> weibull). However, users must provide distribution parameters for all except 
> normal/exponential (in which case they are estimated from the data). In 
> contrast to other tests, such as the Kolmogorov Smirnov test, we only support 
> specific distributions as the critical values depend on the distribution 
> being tested. 
> The distributed implementation of AD takes advantage of the fact that we can 
> calculate a portion of the statistic within each partition of a sorted data 
> set, independent of the global order of those observations. We can then carry 
> some additional information that allows us to adjust the final amounts once 
> we have collected 1 result per partition.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8884) 1-sample Anderson-Darling Goodness-of-Fit test

2016-11-10 Thread Timothy Hunter (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8884?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15655563#comment-15655563
 ] 

Timothy Hunter commented on SPARK-8884:
---

[~srowen] this ticket should still be open I believe? [~yuhaoyan] has an open 
PR for it.

> 1-sample Anderson-Darling Goodness-of-Fit test
> --
>
> Key: SPARK-8884
> URL: https://issues.apache.org/jira/browse/SPARK-8884
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Jose Cambronero
>
> We have implemented a 1-sample Anderson-Darling goodness-of-fit test to add 
> to the current hypothesis testing functionality. The current implementation 
> supports various distributions (normal, exponential, gumbel, logistic, and 
> weibull). However, users must provide distribution parameters for all except 
> normal/exponential (in which case they are estimated from the data). In 
> contrast to other tests, such as the Kolmogorov Smirnov test, we only support 
> specific distributions as the critical values depend on the distribution 
> being tested. 
> The distributed implementation of AD takes advantage of the fact that we can 
> calculate a portion of the statistic within each partition of a sorted data 
> set, independent of the global order of those observations. We can then carry 
> some additional information that allows us to adjust the final amounts once 
> we have collected 1 result per partition.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8884) 1-sample Anderson-Darling Goodness-of-Fit test

2016-04-19 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8884?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15248822#comment-15248822
 ] 

Joseph K. Bradley commented on SPARK-8884:
--

I'm not sure this will make 2.0, so I'm changing the target to 2.1.  [~mengxr] 
please retarget if needed.

> 1-sample Anderson-Darling Goodness-of-Fit test
> --
>
> Key: SPARK-8884
> URL: https://issues.apache.org/jira/browse/SPARK-8884
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Jose Cambronero
>
> We have implemented a 1-sample Anderson-Darling goodness-of-fit test to add 
> to the current hypothesis testing functionality. The current implementation 
> supports various distributions (normal, exponential, gumbel, logistic, and 
> weibull). However, users must provide distribution parameters for all except 
> normal/exponential (in which case they are estimated from the data). In 
> contrast to other tests, such as the Kolmogorov Smirnov test, we only support 
> specific distributions as the critical values depend on the distribution 
> being tested. 
> The distributed implementation of AD takes advantage of the fact that we can 
> calculate a portion of the statistic within each partition of a sorted data 
> set, independent of the global order of those observations. We can then carry 
> some additional information that allows us to adjust the final amounts once 
> we have collected 1 result per partition.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8884) 1-sample Anderson-Darling Goodness-of-Fit test

2016-03-19 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8884?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15199216#comment-15199216
 ] 

Apache Spark commented on SPARK-8884:
-

User 'hhbyyh' has created a pull request for this issue:
https://github.com/apache/spark/pull/11780

> 1-sample Anderson-Darling Goodness-of-Fit test
> --
>
> Key: SPARK-8884
> URL: https://issues.apache.org/jira/browse/SPARK-8884
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Jose Cambronero
>Priority: Minor
>
> We have implemented a 1-sample Anderson-Darling goodness-of-fit test to add 
> to the current hypothesis testing functionality. The current implementation 
> supports various distributions (normal, exponential, gumbel, logistic, and 
> weibull). However, users must provide distribution parameters for all except 
> normal/exponential (in which case they are estimated from the data). In 
> contrast to other tests, such as the Kolmogorov Smirnov test, we only support 
> specific distributions as the critical values depend on the distribution 
> being tested. 
> The distributed implementation of AD takes advantage of the fact that we can 
> calculate a portion of the statistic within each partition of a sorted data 
> set, independent of the global order of those observations. We can then carry 
> some additional information that allows us to adjust the final amounts once 
> we have collected 1 result per partition.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8884) 1-sample Anderson-Darling Goodness-of-Fit test

2016-03-06 Thread Jose Cambronero (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8884?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15182476#comment-15182476
 ] 

Jose Cambronero commented on SPARK-8884:


[~yuhaoyan] please do! I unfortunately got really busy last semester and this 
in grad school and was not able to continue following up on this. 

> 1-sample Anderson-Darling Goodness-of-Fit test
> --
>
> Key: SPARK-8884
> URL: https://issues.apache.org/jira/browse/SPARK-8884
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Jose Cambronero
>Priority: Minor
>
> We have implemented a 1-sample Anderson-Darling goodness-of-fit test to add 
> to the current hypothesis testing functionality. The current implementation 
> supports various distributions (normal, exponential, gumbel, logistic, and 
> weibull). However, users must provide distribution parameters for all except 
> normal/exponential (in which case they are estimated from the data). In 
> contrast to other tests, such as the Kolmogorov Smirnov test, we only support 
> specific distributions as the critical values depend on the distribution 
> being tested. 
> The distributed implementation of AD takes advantage of the fact that we can 
> calculate a portion of the statistic within each partition of a sorted data 
> set, independent of the global order of those observations. We can then carry 
> some additional information that allows us to adjust the final amounts once 
> we have collected 1 result per partition.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8884) 1-sample Anderson-Darling Goodness-of-Fit test

2016-03-06 Thread yuhao yang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8884?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15182420#comment-15182420
 ] 

yuhao yang commented on SPARK-8884:
---

Hi [~josepablocam]. Do you mind if I continue to work on this? I think this is 
well-written yet I might need to start another PR to finish it. Let me know if 
you still plan to work on it. Thanks.

> 1-sample Anderson-Darling Goodness-of-Fit test
> --
>
> Key: SPARK-8884
> URL: https://issues.apache.org/jira/browse/SPARK-8884
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Jose Cambronero
>Priority: Minor
>
> We have implemented a 1-sample Anderson-Darling goodness-of-fit test to add 
> to the current hypothesis testing functionality. The current implementation 
> supports various distributions (normal, exponential, gumbel, logistic, and 
> weibull). However, users must provide distribution parameters for all except 
> normal/exponential (in which case they are estimated from the data). In 
> contrast to other tests, such as the Kolmogorov Smirnov test, we only support 
> specific distributions as the critical values depend on the distribution 
> being tested. 
> The distributed implementation of AD takes advantage of the fact that we can 
> calculate a portion of the statistic within each partition of a sorted data 
> set, independent of the global order of those observations. We can then carry 
> some additional information that allows us to adjust the final amounts once 
> we have collected 1 result per partition.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8884) 1-sample Anderson-Darling Goodness-of-Fit test

2015-08-03 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8884?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14652383#comment-14652383
 ] 

Joseph K. Bradley commented on SPARK-8884:
--

Modifying target to 1.6; please say if that's not OK.

 1-sample Anderson-Darling Goodness-of-Fit test
 --

 Key: SPARK-8884
 URL: https://issues.apache.org/jira/browse/SPARK-8884
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Jose Cambronero
Priority: Minor

 We have implemented a 1-sample Anderson-Darling goodness-of-fit test to add 
 to the current hypothesis testing functionality. The current implementation 
 supports various distributions (normal, exponential, gumbel, logistic, and 
 weibull). However, users must provide distribution parameters for all except 
 normal/exponential (in which case they are estimated from the data). In 
 contrast to other tests, such as the Kolmogorov Smirnov test, we only support 
 specific distributions as the critical values depend on the distribution 
 being tested. 
 The distributed implementation of AD takes advantage of the fact that we can 
 calculate a portion of the statistic within each partition of a sorted data 
 set, independent of the global order of those observations. We can then carry 
 some additional information that allows us to adjust the final amounts once 
 we have collected 1 result per partition.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8884) 1-sample Anderson-Darling Goodness-of-Fit test

2015-07-09 Thread Feynman Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8884?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14621474#comment-14621474
 ] 

Feynman Liang commented on SPARK-8884:
--

Hi [~sandyr] and [~josepablocam],

Do you mind providing some example use cases demonstrating applicability of 
this test to MLlib users? Is there a reference for the distributed algorithm 
you are implementing? What does this provide on top of KS?

 1-sample Anderson-Darling Goodness-of-Fit test
 --

 Key: SPARK-8884
 URL: https://issues.apache.org/jira/browse/SPARK-8884
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Jose Cambronero
Priority: Minor

 We have implemented a 1-sample Anderson-Darling goodness-of-fit test to add 
 to the current hypothesis testing functionality. The current implementation 
 supports various distributions (normal, exponential, gumbel, logistic, and 
 weibull). However, users must provide distribution parameters for all except 
 normal/exponential (in which case they are estimated from the data). In 
 contrast to other tests, such as the Kolmogorov Smirnov test, we only support 
 specific distributions as the critical values depend on the distribution 
 being tested. 
 The distributed implementation of AD takes advantage of the fact that we can 
 calculate a portion of the statistic within each partition of a sorted data 
 set, independent of the global order of those observations. We can then carry 
 some additional information that allows us to adjust the final amounts once 
 we have collected 1 result per partition.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8884) 1-sample Anderson-Darling Goodness-of-Fit test

2015-07-09 Thread Jose Cambronero (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8884?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14621491#comment-14621491
 ] 

Jose Cambronero commented on SPARK-8884:


You can find it in the R nortest library, and in SciPy's stats library. The use 
cases are the same as KS, with the advantage that it is better suited to 
detecting deviations at the tails of the distributions. It provides users an 
alternative over KS, a la  more than one way to skin a cat. 

The statistic is implemented as a sum, so the algorithm is just decomposing 
that into 2 portions. One that we can calculate in a per-partition basis, and 
the remaining portion which we scale by a factor and add in at the end. I can 
write up a clear step-by-step breakout from the original formula to this one, 
if that is something people might find useful.

 1-sample Anderson-Darling Goodness-of-Fit test
 --

 Key: SPARK-8884
 URL: https://issues.apache.org/jira/browse/SPARK-8884
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Jose Cambronero
Priority: Minor

 We have implemented a 1-sample Anderson-Darling goodness-of-fit test to add 
 to the current hypothesis testing functionality. The current implementation 
 supports various distributions (normal, exponential, gumbel, logistic, and 
 weibull). However, users must provide distribution parameters for all except 
 normal/exponential (in which case they are estimated from the data). In 
 contrast to other tests, such as the Kolmogorov Smirnov test, we only support 
 specific distributions as the critical values depend on the distribution 
 being tested. 
 The distributed implementation of AD takes advantage of the fact that we can 
 calculate a portion of the statistic within each partition of a sorted data 
 set, independent of the global order of those observations. We can then carry 
 some additional information that allows us to adjust the final amounts once 
 we have collected 1 result per partition.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8884) 1-sample Anderson-Darling Goodness-of-Fit test

2015-07-07 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8884?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14617826#comment-14617826
 ] 

Apache Spark commented on SPARK-8884:
-

User 'josepablocam' has created a pull request for this issue:
https://github.com/apache/spark/pull/7278

 1-sample Anderson-Darling Goodness-of-Fit test
 --

 Key: SPARK-8884
 URL: https://issues.apache.org/jira/browse/SPARK-8884
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Jose Cambronero
Priority: Minor

 We have implemented a 1-sample Anderson-Darling goodness-of-fit test to add 
 to the current hypothesis testing functionality. The current implementation 
 supports various distributions (normal, exponential, gumbel, logistic, and 
 weibull). However, users must provide distribution parameters for all except 
 normal/exponential (in which case they are estimated from the data). In 
 contrast to other tests, such as the Kolmogorov Smirnov test, we only support 
 specific distributions as the critical values depend on the distribution 
 being tested. 
 The distributed implementation of AD takes advantage of the fact that we can 
 calculate a portion of the statistic within each partition of a sorted data 
 set, independent of the global order of those observations. We can then carry 
 some additional information that allows us to adjust the final amounts once 
 we have collected 1 result per partition.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org