[jira] [Commented] (SPARK-8884) 1-sample Anderson-Darling Goodness-of-Fit test
[ https://issues.apache.org/jira/browse/SPARK-8884?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15657825#comment-15657825 ] Timothy Hunter commented on SPARK-8884: --- I do not have a strong preference either way. We should just either complete this feature (with DataFrame APIs) or close the open PR. > 1-sample Anderson-Darling Goodness-of-Fit test > -- > > Key: SPARK-8884 > URL: https://issues.apache.org/jira/browse/SPARK-8884 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: Jose Cambronero > > We have implemented a 1-sample Anderson-Darling goodness-of-fit test to add > to the current hypothesis testing functionality. The current implementation > supports various distributions (normal, exponential, gumbel, logistic, and > weibull). However, users must provide distribution parameters for all except > normal/exponential (in which case they are estimated from the data). In > contrast to other tests, such as the Kolmogorov Smirnov test, we only support > specific distributions as the critical values depend on the distribution > being tested. > The distributed implementation of AD takes advantage of the fact that we can > calculate a portion of the statistic within each partition of a sorted data > set, independent of the global order of those observations. We can then carry > some additional information that allows us to adjust the final amounts once > we have collected 1 result per partition. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8884) 1-sample Anderson-Darling Goodness-of-Fit test
[ https://issues.apache.org/jira/browse/SPARK-8884?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15657532#comment-15657532 ] yuhao yang commented on SPARK-8884: --- I thought it was closed because we have stopped the development for mllib (RDD-based API). If not, I'm willing to continue to work on this. If yes, then I can only try to convert it to DataFrame-based API. > 1-sample Anderson-Darling Goodness-of-Fit test > -- > > Key: SPARK-8884 > URL: https://issues.apache.org/jira/browse/SPARK-8884 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: Jose Cambronero > > We have implemented a 1-sample Anderson-Darling goodness-of-fit test to add > to the current hypothesis testing functionality. The current implementation > supports various distributions (normal, exponential, gumbel, logistic, and > weibull). However, users must provide distribution parameters for all except > normal/exponential (in which case they are estimated from the data). In > contrast to other tests, such as the Kolmogorov Smirnov test, we only support > specific distributions as the critical values depend on the distribution > being tested. > The distributed implementation of AD takes advantage of the fact that we can > calculate a portion of the statistic within each partition of a sorted data > set, independent of the global order of those observations. We can then carry > some additional information that allows us to adjust the final amounts once > we have collected 1 result per partition. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8884) 1-sample Anderson-Darling Goodness-of-Fit test
[ https://issues.apache.org/jira/browse/SPARK-8884?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15656780#comment-15656780 ] Sean Owen commented on SPARK-8884: -- Maybe, but the new PR is also apparently not moving ahead > 1-sample Anderson-Darling Goodness-of-Fit test > -- > > Key: SPARK-8884 > URL: https://issues.apache.org/jira/browse/SPARK-8884 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: Jose Cambronero > > We have implemented a 1-sample Anderson-Darling goodness-of-fit test to add > to the current hypothesis testing functionality. The current implementation > supports various distributions (normal, exponential, gumbel, logistic, and > weibull). However, users must provide distribution parameters for all except > normal/exponential (in which case they are estimated from the data). In > contrast to other tests, such as the Kolmogorov Smirnov test, we only support > specific distributions as the critical values depend on the distribution > being tested. > The distributed implementation of AD takes advantage of the fact that we can > calculate a portion of the statistic within each partition of a sorted data > set, independent of the global order of those observations. We can then carry > some additional information that allows us to adjust the final amounts once > we have collected 1 result per partition. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8884) 1-sample Anderson-Darling Goodness-of-Fit test
[ https://issues.apache.org/jira/browse/SPARK-8884?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15655563#comment-15655563 ] Timothy Hunter commented on SPARK-8884: --- [~srowen] this ticket should still be open I believe? [~yuhaoyan] has an open PR for it. > 1-sample Anderson-Darling Goodness-of-Fit test > -- > > Key: SPARK-8884 > URL: https://issues.apache.org/jira/browse/SPARK-8884 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: Jose Cambronero > > We have implemented a 1-sample Anderson-Darling goodness-of-fit test to add > to the current hypothesis testing functionality. The current implementation > supports various distributions (normal, exponential, gumbel, logistic, and > weibull). However, users must provide distribution parameters for all except > normal/exponential (in which case they are estimated from the data). In > contrast to other tests, such as the Kolmogorov Smirnov test, we only support > specific distributions as the critical values depend on the distribution > being tested. > The distributed implementation of AD takes advantage of the fact that we can > calculate a portion of the statistic within each partition of a sorted data > set, independent of the global order of those observations. We can then carry > some additional information that allows us to adjust the final amounts once > we have collected 1 result per partition. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8884) 1-sample Anderson-Darling Goodness-of-Fit test
[ https://issues.apache.org/jira/browse/SPARK-8884?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15248822#comment-15248822 ] Joseph K. Bradley commented on SPARK-8884: -- I'm not sure this will make 2.0, so I'm changing the target to 2.1. [~mengxr] please retarget if needed. > 1-sample Anderson-Darling Goodness-of-Fit test > -- > > Key: SPARK-8884 > URL: https://issues.apache.org/jira/browse/SPARK-8884 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: Jose Cambronero > > We have implemented a 1-sample Anderson-Darling goodness-of-fit test to add > to the current hypothesis testing functionality. The current implementation > supports various distributions (normal, exponential, gumbel, logistic, and > weibull). However, users must provide distribution parameters for all except > normal/exponential (in which case they are estimated from the data). In > contrast to other tests, such as the Kolmogorov Smirnov test, we only support > specific distributions as the critical values depend on the distribution > being tested. > The distributed implementation of AD takes advantage of the fact that we can > calculate a portion of the statistic within each partition of a sorted data > set, independent of the global order of those observations. We can then carry > some additional information that allows us to adjust the final amounts once > we have collected 1 result per partition. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8884) 1-sample Anderson-Darling Goodness-of-Fit test
[ https://issues.apache.org/jira/browse/SPARK-8884?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15199216#comment-15199216 ] Apache Spark commented on SPARK-8884: - User 'hhbyyh' has created a pull request for this issue: https://github.com/apache/spark/pull/11780 > 1-sample Anderson-Darling Goodness-of-Fit test > -- > > Key: SPARK-8884 > URL: https://issues.apache.org/jira/browse/SPARK-8884 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: Jose Cambronero >Priority: Minor > > We have implemented a 1-sample Anderson-Darling goodness-of-fit test to add > to the current hypothesis testing functionality. The current implementation > supports various distributions (normal, exponential, gumbel, logistic, and > weibull). However, users must provide distribution parameters for all except > normal/exponential (in which case they are estimated from the data). In > contrast to other tests, such as the Kolmogorov Smirnov test, we only support > specific distributions as the critical values depend on the distribution > being tested. > The distributed implementation of AD takes advantage of the fact that we can > calculate a portion of the statistic within each partition of a sorted data > set, independent of the global order of those observations. We can then carry > some additional information that allows us to adjust the final amounts once > we have collected 1 result per partition. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8884) 1-sample Anderson-Darling Goodness-of-Fit test
[ https://issues.apache.org/jira/browse/SPARK-8884?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15182476#comment-15182476 ] Jose Cambronero commented on SPARK-8884: [~yuhaoyan] please do! I unfortunately got really busy last semester and this in grad school and was not able to continue following up on this. > 1-sample Anderson-Darling Goodness-of-Fit test > -- > > Key: SPARK-8884 > URL: https://issues.apache.org/jira/browse/SPARK-8884 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: Jose Cambronero >Priority: Minor > > We have implemented a 1-sample Anderson-Darling goodness-of-fit test to add > to the current hypothesis testing functionality. The current implementation > supports various distributions (normal, exponential, gumbel, logistic, and > weibull). However, users must provide distribution parameters for all except > normal/exponential (in which case they are estimated from the data). In > contrast to other tests, such as the Kolmogorov Smirnov test, we only support > specific distributions as the critical values depend on the distribution > being tested. > The distributed implementation of AD takes advantage of the fact that we can > calculate a portion of the statistic within each partition of a sorted data > set, independent of the global order of those observations. We can then carry > some additional information that allows us to adjust the final amounts once > we have collected 1 result per partition. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8884) 1-sample Anderson-Darling Goodness-of-Fit test
[ https://issues.apache.org/jira/browse/SPARK-8884?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15182420#comment-15182420 ] yuhao yang commented on SPARK-8884: --- Hi [~josepablocam]. Do you mind if I continue to work on this? I think this is well-written yet I might need to start another PR to finish it. Let me know if you still plan to work on it. Thanks. > 1-sample Anderson-Darling Goodness-of-Fit test > -- > > Key: SPARK-8884 > URL: https://issues.apache.org/jira/browse/SPARK-8884 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: Jose Cambronero >Priority: Minor > > We have implemented a 1-sample Anderson-Darling goodness-of-fit test to add > to the current hypothesis testing functionality. The current implementation > supports various distributions (normal, exponential, gumbel, logistic, and > weibull). However, users must provide distribution parameters for all except > normal/exponential (in which case they are estimated from the data). In > contrast to other tests, such as the Kolmogorov Smirnov test, we only support > specific distributions as the critical values depend on the distribution > being tested. > The distributed implementation of AD takes advantage of the fact that we can > calculate a portion of the statistic within each partition of a sorted data > set, independent of the global order of those observations. We can then carry > some additional information that allows us to adjust the final amounts once > we have collected 1 result per partition. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8884) 1-sample Anderson-Darling Goodness-of-Fit test
[ https://issues.apache.org/jira/browse/SPARK-8884?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14652383#comment-14652383 ] Joseph K. Bradley commented on SPARK-8884: -- Modifying target to 1.6; please say if that's not OK. 1-sample Anderson-Darling Goodness-of-Fit test -- Key: SPARK-8884 URL: https://issues.apache.org/jira/browse/SPARK-8884 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Jose Cambronero Priority: Minor We have implemented a 1-sample Anderson-Darling goodness-of-fit test to add to the current hypothesis testing functionality. The current implementation supports various distributions (normal, exponential, gumbel, logistic, and weibull). However, users must provide distribution parameters for all except normal/exponential (in which case they are estimated from the data). In contrast to other tests, such as the Kolmogorov Smirnov test, we only support specific distributions as the critical values depend on the distribution being tested. The distributed implementation of AD takes advantage of the fact that we can calculate a portion of the statistic within each partition of a sorted data set, independent of the global order of those observations. We can then carry some additional information that allows us to adjust the final amounts once we have collected 1 result per partition. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8884) 1-sample Anderson-Darling Goodness-of-Fit test
[ https://issues.apache.org/jira/browse/SPARK-8884?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14621474#comment-14621474 ] Feynman Liang commented on SPARK-8884: -- Hi [~sandyr] and [~josepablocam], Do you mind providing some example use cases demonstrating applicability of this test to MLlib users? Is there a reference for the distributed algorithm you are implementing? What does this provide on top of KS? 1-sample Anderson-Darling Goodness-of-Fit test -- Key: SPARK-8884 URL: https://issues.apache.org/jira/browse/SPARK-8884 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Jose Cambronero Priority: Minor We have implemented a 1-sample Anderson-Darling goodness-of-fit test to add to the current hypothesis testing functionality. The current implementation supports various distributions (normal, exponential, gumbel, logistic, and weibull). However, users must provide distribution parameters for all except normal/exponential (in which case they are estimated from the data). In contrast to other tests, such as the Kolmogorov Smirnov test, we only support specific distributions as the critical values depend on the distribution being tested. The distributed implementation of AD takes advantage of the fact that we can calculate a portion of the statistic within each partition of a sorted data set, independent of the global order of those observations. We can then carry some additional information that allows us to adjust the final amounts once we have collected 1 result per partition. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8884) 1-sample Anderson-Darling Goodness-of-Fit test
[ https://issues.apache.org/jira/browse/SPARK-8884?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14621491#comment-14621491 ] Jose Cambronero commented on SPARK-8884: You can find it in the R nortest library, and in SciPy's stats library. The use cases are the same as KS, with the advantage that it is better suited to detecting deviations at the tails of the distributions. It provides users an alternative over KS, a la more than one way to skin a cat. The statistic is implemented as a sum, so the algorithm is just decomposing that into 2 portions. One that we can calculate in a per-partition basis, and the remaining portion which we scale by a factor and add in at the end. I can write up a clear step-by-step breakout from the original formula to this one, if that is something people might find useful. 1-sample Anderson-Darling Goodness-of-Fit test -- Key: SPARK-8884 URL: https://issues.apache.org/jira/browse/SPARK-8884 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Jose Cambronero Priority: Minor We have implemented a 1-sample Anderson-Darling goodness-of-fit test to add to the current hypothesis testing functionality. The current implementation supports various distributions (normal, exponential, gumbel, logistic, and weibull). However, users must provide distribution parameters for all except normal/exponential (in which case they are estimated from the data). In contrast to other tests, such as the Kolmogorov Smirnov test, we only support specific distributions as the critical values depend on the distribution being tested. The distributed implementation of AD takes advantage of the fact that we can calculate a portion of the statistic within each partition of a sorted data set, independent of the global order of those observations. We can then carry some additional information that allows us to adjust the final amounts once we have collected 1 result per partition. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8884) 1-sample Anderson-Darling Goodness-of-Fit test
[ https://issues.apache.org/jira/browse/SPARK-8884?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14617826#comment-14617826 ] Apache Spark commented on SPARK-8884: - User 'josepablocam' has created a pull request for this issue: https://github.com/apache/spark/pull/7278 1-sample Anderson-Darling Goodness-of-Fit test -- Key: SPARK-8884 URL: https://issues.apache.org/jira/browse/SPARK-8884 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Jose Cambronero Priority: Minor We have implemented a 1-sample Anderson-Darling goodness-of-fit test to add to the current hypothesis testing functionality. The current implementation supports various distributions (normal, exponential, gumbel, logistic, and weibull). However, users must provide distribution parameters for all except normal/exponential (in which case they are estimated from the data). In contrast to other tests, such as the Kolmogorov Smirnov test, we only support specific distributions as the critical values depend on the distribution being tested. The distributed implementation of AD takes advantage of the fact that we can calculate a portion of the statistic within each partition of a sorted data set, independent of the global order of those observations. We can then carry some additional information that allows us to adjust the final amounts once we have collected 1 result per partition. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org