[ 
https://issues.apache.org/jira/browse/STATISTICS-8?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Barnhill updated STATISTICS-8:
-----------------------------------
    Description: 
Apache commons is one of the most widely used resources by Java programmers 
around the world. Data related applications are soaring and Java is one of the 
most commonly used languages for data engineering. Consequently the 
commons-statistics library, currently under development, is likely to find a 
widespread audience.

For this project we aim to implement regression methods, arguably the most 
widely used techniques in statistics and machine learning, within the Apache 
commons framework, in particular within the new commons-statistics library.

The assignee will:
 * Use core functionality from the regression sub-libraries of the deprecated 
commons-math 4 framework as a starting point
 * Create a new, standalone commons component for regression statistics, 
focusing first on linear and logistic regression
 * Make architectural and design decisions in the commons philosophy, that is, 
lightweight standalone components easy to understand and use by a wide range of 
Java developers (i.e. not a large, omnibus mathematical library with many 
degrees of abstraction)
 * Draw inspiration from widely used libraries in scikit-learn and R to design 
an up-to-date statistics package
 * Design unit testing and documentation for these libraries

Particularly challenging design decisions include how to incorporate core 
matrix libraries with a minimum of dependencies and redundancies.

We see this project as potentially having a large impact on big data 
applications. Java and the JVM are fundamental to popular data engineering 
tools like Hadoop and Spark. Regression analyses are however often handled 
downstream, on the other side of the "data fence", by tools like Python and R. 
A robust and scalable pure Java regression library, easily visible and 
accessible through Apache commons, can enable better integration of both sides 
of this data divide by enabling many machine learning steps to be programmed at 
scale on the Java side. 

  was:
Apache commons is one of the most widely used resources by Java programmers 
around the world. Data related applications are soaring and Java is one of the 
most commonly used languages for data engineering. Consequently the 
commons-statistics library, currently under development, is likely to find a 
widespread audience.

For this project we aim to implement regression methods, arguably the most 
widely used techniques in statistics and machine learning, within the Apache 
commons framework, in particular within the new commons-statistics library.

The assignee will:
 * Use core functionality from the regression sub-libraries of the deprecated 
commons-math 4 framework as a starting point
 * Create a new, standalone commons component for regression statistics, 
focusing first on linear and logistic regression
 * Make architectural and design decisions in the commons philosophy, that is, 
lightweight standalone components easy to understand and use by a wide range of 
Java developers (i.e. not a large, omnibus mathematical library with many 
degrees of abstraction)
 * Draw inspiration from widely used libraries in scikit-learn and R to design 
an up-to-date statistics package
 * Design unit testing and documentation for these libraries

Particularly challenging design decisions include how to incorporate core 
matrix libraries with a minimum of dependencies and redundancies.


> Implementation of regression libraries within common-statistics framework
> -------------------------------------------------------------------------
>
>                 Key: STATISTICS-8
>                 URL: https://issues.apache.org/jira/browse/STATISTICS-8
>             Project: Apache Commons Statistics
>          Issue Type: Task
>            Reporter: Eric Barnhill
>            Priority: Major
>
> Apache commons is one of the most widely used resources by Java programmers 
> around the world. Data related applications are soaring and Java is one of 
> the most commonly used languages for data engineering. Consequently the 
> commons-statistics library, currently under development, is likely to find a 
> widespread audience.
> For this project we aim to implement regression methods, arguably the most 
> widely used techniques in statistics and machine learning, within the Apache 
> commons framework, in particular within the new commons-statistics library.
> The assignee will:
>  * Use core functionality from the regression sub-libraries of the deprecated 
> commons-math 4 framework as a starting point
>  * Create a new, standalone commons component for regression statistics, 
> focusing first on linear and logistic regression
>  * Make architectural and design decisions in the commons philosophy, that 
> is, lightweight standalone components easy to understand and use by a wide 
> range of Java developers (i.e. not a large, omnibus mathematical library with 
> many degrees of abstraction)
>  * Draw inspiration from widely used libraries in scikit-learn and R to 
> design an up-to-date statistics package
>  * Design unit testing and documentation for these libraries
> Particularly challenging design decisions include how to incorporate core 
> matrix libraries with a minimum of dependencies and redundancies.
> We see this project as potentially having a large impact on big data 
> applications. Java and the JVM are fundamental to popular data engineering 
> tools like Hadoop and Spark. Regression analyses are however often handled 
> downstream, on the other side of the "data fence", by tools like Python and 
> R. A robust and scalable pure Java regression library, easily visible and 
> accessible through Apache commons, can enable better integration of both 
> sides of this data divide by enabling many machine learning steps to be 
> programmed at scale on the Java side. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to