Re: [DISCUSS] SystemML Incubator Proposal

2015-10-27 Thread Luciano Resende
On Sun, Oct 25, 2015 at 11:02 PM, Henry Saputra 
wrote:

> Thanks Luciano, I got my answer but would probably helped to
> distinguish option to run it as Apache Hadoop MapReduce or YARN
> application, and with abstraction of Apache Spark.
>

Thanks for the feedback.
I have updated the proposal to clarify Hadoop MapReduce instead of just
mentioning Hadoop.


>
> Looking forward possibility of having it run with Apache Flink :)
>
> - Henry
>
>


-- 
Luciano Resende
http://people.apache.org/~lresende
http://twitter.com/lresende1975
http://lresende.blogspot.com/


Re: [DISCUSS] SystemML Incubator Proposal

2015-10-26 Thread Henry Saputra
Thanks Luciano, I got my answer but would probably helped to
distinguish option to run it as Apache Hadoop MapReduce or YARN
application, and with abstraction of Apache Spark.

Looking forward possibility of having it run with Apache Flink :)

- Henry

On Sat, Oct 24, 2015 at 12:32 PM, Luciano Resende  wrote:
> On Sat, Oct 24, 2015 at 11:31 AM, Henry Saputra 
> wrote:
>
>> I have one question about the proposal, it keep mentioning that it
>> could run on "Hadoop or Spark", but technically Spark can run on
>> Hadoop YARN.
>> Was it trying to say it could be run in Hadoop YARN (maybe via
>> MapReduce) or Spark?
>>
>>
> Exactly, if this is a point of confusion i can clarify it on the proposal.
>
>
>> I would love to see if the execution abstraction is well enough
>> defined to be able to run it on the others distributed framework like
>> Flink or Tez  (maybe via Crunch?)
>>
>>
> Yes, this is definitely a possibility, we have talked about Flink before as
> a possible next runtime.
>
>
>> Thanks,
>>
>> Henry
>>
>
>
> --
> Luciano Resende
> http://people.apache.org/~lresende
> http://twitter.com/lresende1975
> http://lresende.blogspot.com/

-
To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
For additional commands, e-mail: general-h...@incubator.apache.org



Re: [DISCUSS] SystemML Incubator Proposal

2015-10-24 Thread Adunuthula, Seshu
Hello Luciano,

Recently heard the presentation on SystemML at Apache BigData conference
and it sounds exciting. Looking forward to Apache Incubation.

Regards
Seshu Adunuthula


On 10/23/15, 5:34 PM, "Luciano Resende"  wrote:

>On Fri, Oct 23, 2015 at 5:30 PM, Henry Saputra 
>wrote:
>
>> Hi Luciano,
>>
>> Good proposal, but looks like
>> https://wiki.apache.org/incubator/SystemM does not exist?
>>
>
>Good catch, it's a typo on the original link and it's missing the L at the
>end, here is the correct link
>
>https://wiki.apache.org/incubator/SystemML
>
>
>
>>
>> Also, Reynold Xin and Patrick Wendell are not member of IPMCs so I
>> don't they could be mentors of this project, yet.
>>
>> They can ask to be member of IPMCs since both are already member of
>> ASF. But for now need to remove it from proposal.
>>
>>
>>
>Yes, they are aware of the requirement, and this will be fixed before we
>call a vote on the proposal.
>
>
>
>> - Henry
>>
>> On Fri, Oct 23, 2015 at 4:34 PM, Luciano Resende 
>> wrote:
>> > We would like to start a discussion on accepting SystemML as an Apache
>> > Incubator project.
>> >
>> > The proposal is available at :
>> > https://wiki.apache.org/incubator/SystemM
>> >
>> > And it's contents is also copied below.
>> >
>> > Thanks in Advance for you time reviewing and providing feedback.
>> >
>> > ==
>> >
>> > = SystemML =
>> >
>> > == Abstract ==
>> >
>> > SystemML provides declarative large-scale machine learning (ML) that
>>aims
>> > at flexible specification of ML algorithms and automatic generation of
>> > hybrid runtime plans ranging from single node, in-memory
>>computations, to
>> > distributed computations on Apache Hadoop and  Apache Spark. ML
>> algorithms
>> > are expressed in an R-like syntax, that includes linear algebra
>> primitives,
>> > statistical functions, and ML-specific constructs. This high-level
>> language
>> > significantly increases the productivity of data scientists as it
>> provides
>> > (1) full flexibility in expressing custom analytics, and (2) data
>> > independence from the underlying input formats and physical data
>> > representations. Automatic optimization according to data
>>characteristics
>> > such as distribution on the disk file system, and sparsity as well as
>> > processing characteristics in the distributed environment like number
>>of
>> > nodes, CPU, memory per node, ensures both efficiency and scalability.
>> >
>> > == Proposal ==
>> >
>> > The goal of SystemML is to create a commercial friendly, scalable and
>> > extensible machine learning framework for data scientists to create or
>> > extend machine learning algorithms using a declarative syntax. The
>> machine
>> > learning framework enables data scientists to develop algorithms
>>locally
>> > without the need of a distributed cluster, and scale up and scale out
>>the
>> > execution of these algorithms to distributed Hadoop or Spark clusters.
>> >
>> > == Background ==
>> >
>> > SystemML started as a research project in the IBM Almaden Research
>>Center
>> > around 2010 aiming to enable data scientists to develop machine
>>learning
>> > algorithms independent of data and cluster characteristics.
>> >
>> > == Rationale ==
>> >
>> > SystemML enables the specification of machine learning algorithms
>>using a
>> > declarative machine learning (DML) language. DML includes linear
>>algebra
>> > primitives, statistical functions, and additional constructs. This
>> > high-level language significantly increases the productivity of data
>> > scientists as it provides (1) full flexibility in expressing custom
>> > analytics and (2) data independence from the underlying input formats
>>and
>> > physical data representations.
>> >
>> > SystemML computations can be executed in a variety of different
>>modes. It
>> > supports single node in-memory computations and large-scale
>>distributed
>> > cluster computations. This allows the user to quickly prototype new
>> > algorithms in local environments but automatically scale to large data
>> > sizes as well without changing the algorithm implementation.
>> >
>> > Algorithms specified in DML are dynamically compiled and optimized
>>based
>> on
>> > data and cluster characteristics using rule-based and cost-based
>> > optimization techniques. The optimizer automatically generates hybrid
>> > runtime execution plans ranging from in-memory single-node execution
>>to
>> > distributed computations on Spark or Hadoop. This ensures both
>>efficiency
>> > and scalability. Automatic optimization reduces or eliminates the
>>need to
>> > hand-tune distributed runtime execution plans and system
>>configurations.
>> >
>> > == Initial Goals ==
>> >
>> > The initial goals to move SystemML to the Apache Incubator is to
>>broaden
>> > the community foster the contributions from data scientists to develop
>> new
>> > machine learning algorithms and enhance the existing ones. Ultimately,
>> 

Re: [DISCUSS] SystemML Incubator Proposal

2015-10-24 Thread Henry Saputra
I have one question about the proposal, it keep mentioning that it
could run on "Hadoop or Spark", but technically Spark can run on
Hadoop YARN.
Was it trying to say it could be run in Hadoop YARN (maybe via
MapReduce) or Spark?

I would love to see if the execution abstraction is well enough
defined to be able to run it on the others distributed framework like
Flink or Tez  (maybe via Crunch?)

Thanks,

Henry

On Fri, Oct 23, 2015 at 4:34 PM, Luciano Resende  wrote:
> We would like to start a discussion on accepting SystemML as an Apache
> Incubator project.
>
> The proposal is available at :
> https://wiki.apache.org/incubator/SystemM
>
> And it's contents is also copied below.
>
> Thanks in Advance for you time reviewing and providing feedback.
>
> ==
>
> = SystemML =
>
> == Abstract ==
>
> SystemML provides declarative large-scale machine learning (ML) that aims
> at flexible specification of ML algorithms and automatic generation of
> hybrid runtime plans ranging from single node, in-memory computations, to
> distributed computations on Apache Hadoop and  Apache Spark. ML algorithms
> are expressed in an R-like syntax, that includes linear algebra primitives,
> statistical functions, and ML-specific constructs. This high-level language
> significantly increases the productivity of data scientists as it provides
> (1) full flexibility in expressing custom analytics, and (2) data
> independence from the underlying input formats and physical data
> representations. Automatic optimization according to data characteristics
> such as distribution on the disk file system, and sparsity as well as
> processing characteristics in the distributed environment like number of
> nodes, CPU, memory per node, ensures both efficiency and scalability.
>
> == Proposal ==
>
> The goal of SystemML is to create a commercial friendly, scalable and
> extensible machine learning framework for data scientists to create or
> extend machine learning algorithms using a declarative syntax. The machine
> learning framework enables data scientists to develop algorithms locally
> without the need of a distributed cluster, and scale up and scale out the
> execution of these algorithms to distributed Hadoop or Spark clusters.
>
> == Background ==
>
> SystemML started as a research project in the IBM Almaden Research Center
> around 2010 aiming to enable data scientists to develop machine learning
> algorithms independent of data and cluster characteristics.
>
> == Rationale ==
>
> SystemML enables the specification of machine learning algorithms using a
> declarative machine learning (DML) language. DML includes linear algebra
> primitives, statistical functions, and additional constructs. This
> high-level language significantly increases the productivity of data
> scientists as it provides (1) full flexibility in expressing custom
> analytics and (2) data independence from the underlying input formats and
> physical data representations.
>
> SystemML computations can be executed in a variety of different modes. It
> supports single node in-memory computations and large-scale distributed
> cluster computations. This allows the user to quickly prototype new
> algorithms in local environments but automatically scale to large data
> sizes as well without changing the algorithm implementation.
>
> Algorithms specified in DML are dynamically compiled and optimized based on
> data and cluster characteristics using rule-based and cost-based
> optimization techniques. The optimizer automatically generates hybrid
> runtime execution plans ranging from in-memory single-node execution to
> distributed computations on Spark or Hadoop. This ensures both efficiency
> and scalability. Automatic optimization reduces or eliminates the need to
> hand-tune distributed runtime execution plans and system configurations.
>
> == Initial Goals ==
>
> The initial goals to move SystemML to the Apache Incubator is to broaden
> the community foster the contributions from data scientists to develop new
> machine learning algorithms and enhance the existing ones. Ultimately, this
> may lead to the creation of an industry standard in specifying machine
> learning algorithms.
>
> == Current Status ==
>
> The initial code has been developed at the IBM Almaden Research Center in
> California and has recently been made available in GitHub under the Apache
> Software License 2.0. The project currently supports a single node (in
> memory computation) as well as distributed computations utilizing Hadoop or
> Spark clusters.
>
> === Meritocracy ===
>
> We plan to invest in supporting a meritocracy. We will discuss the
> requirements in an open forum. Several companies have already expressed
> interest in this project, and we intend to invite additional developers to
> participate. We will encourage and monitor community participation so that
> privileges can be extended to those that contribute operating to the
> standard of meritocracy 

Re: [DISCUSS] SystemML Incubator Proposal

2015-10-24 Thread Luciano Resende
On Sat, Oct 24, 2015 at 11:31 AM, Henry Saputra 
wrote:

> I have one question about the proposal, it keep mentioning that it
> could run on "Hadoop or Spark", but technically Spark can run on
> Hadoop YARN.
> Was it trying to say it could be run in Hadoop YARN (maybe via
> MapReduce) or Spark?
>
>
Exactly, if this is a point of confusion i can clarify it on the proposal.


> I would love to see if the execution abstraction is well enough
> defined to be able to run it on the others distributed framework like
> Flink or Tez  (maybe via Crunch?)
>
>
Yes, this is definitely a possibility, we have talked about Flink before as
a possible next runtime.


> Thanks,
>
> Henry
>


-- 
Luciano Resende
http://people.apache.org/~lresende
http://twitter.com/lresende1975
http://lresende.blogspot.com/


Re: [DISCUSS] SystemML Incubator Proposal

2015-10-23 Thread Hitesh Shah
Hi Luciano, 

If you need any additional mentors, let me know. I would be interested in 
helping out. 

thanks
— Hitesh 


On Oct 23, 2015, at 4:34 PM, Luciano Resende  wrote:

> We would like to start a discussion on accepting SystemML as an Apache
> Incubator project.
> 
> The proposal is available at :
> https://wiki.apache.org/incubator/SystemM
> 
> And it's contents is also copied below.
> 
> Thanks in Advance for you time reviewing and providing feedback.
> 
> ==
> 
> = SystemML =
> 
> == Abstract ==
> 
> SystemML provides declarative large-scale machine learning (ML) that aims
> at flexible specification of ML algorithms and automatic generation of
> hybrid runtime plans ranging from single node, in-memory computations, to
> distributed computations on Apache Hadoop and  Apache Spark. ML algorithms
> are expressed in an R-like syntax, that includes linear algebra primitives,
> statistical functions, and ML-specific constructs. This high-level language
> significantly increases the productivity of data scientists as it provides
> (1) full flexibility in expressing custom analytics, and (2) data
> independence from the underlying input formats and physical data
> representations. Automatic optimization according to data characteristics
> such as distribution on the disk file system, and sparsity as well as
> processing characteristics in the distributed environment like number of
> nodes, CPU, memory per node, ensures both efficiency and scalability.
> 
> == Proposal ==
> 
> The goal of SystemML is to create a commercial friendly, scalable and
> extensible machine learning framework for data scientists to create or
> extend machine learning algorithms using a declarative syntax. The machine
> learning framework enables data scientists to develop algorithms locally
> without the need of a distributed cluster, and scale up and scale out the
> execution of these algorithms to distributed Hadoop or Spark clusters.
> 
> == Background ==
> 
> SystemML started as a research project in the IBM Almaden Research Center
> around 2010 aiming to enable data scientists to develop machine learning
> algorithms independent of data and cluster characteristics.
> 
> == Rationale ==
> 
> SystemML enables the specification of machine learning algorithms using a
> declarative machine learning (DML) language. DML includes linear algebra
> primitives, statistical functions, and additional constructs. This
> high-level language significantly increases the productivity of data
> scientists as it provides (1) full flexibility in expressing custom
> analytics and (2) data independence from the underlying input formats and
> physical data representations.
> 
> SystemML computations can be executed in a variety of different modes. It
> supports single node in-memory computations and large-scale distributed
> cluster computations. This allows the user to quickly prototype new
> algorithms in local environments but automatically scale to large data
> sizes as well without changing the algorithm implementation.
> 
> Algorithms specified in DML are dynamically compiled and optimized based on
> data and cluster characteristics using rule-based and cost-based
> optimization techniques. The optimizer automatically generates hybrid
> runtime execution plans ranging from in-memory single-node execution to
> distributed computations on Spark or Hadoop. This ensures both efficiency
> and scalability. Automatic optimization reduces or eliminates the need to
> hand-tune distributed runtime execution plans and system configurations.
> 
> == Initial Goals ==
> 
> The initial goals to move SystemML to the Apache Incubator is to broaden
> the community foster the contributions from data scientists to develop new
> machine learning algorithms and enhance the existing ones. Ultimately, this
> may lead to the creation of an industry standard in specifying machine
> learning algorithms.
> 
> == Current Status ==
> 
> The initial code has been developed at the IBM Almaden Research Center in
> California and has recently been made available in GitHub under the Apache
> Software License 2.0. The project currently supports a single node (in
> memory computation) as well as distributed computations utilizing Hadoop or
> Spark clusters.
> 
> === Meritocracy ===
> 
> We plan to invest in supporting a meritocracy. We will discuss the
> requirements in an open forum. Several companies have already expressed
> interest in this project, and we intend to invite additional developers to
> participate. We will encourage and monitor community participation so that
> privileges can be extended to those that contribute operating to the
> standard of meritocracy that Apache emphasizes.
> 
> === Community ===
> 
> The need for a generic scalable and declarative machine learning approach
> in the open source is tremendous, so there is a potential for a very large
> community. We believe that SystemML’s extensible architecture, 

Re: [DISCUSS] SystemML Incubator Proposal

2015-10-23 Thread Henry Saputra
Hi Luciano,

Good proposal, but looks like
https://wiki.apache.org/incubator/SystemM does not exist?

Also, Reynold Xin and Patrick Wendell are not member of IPMCs so I
don't they could be mentors of this project, yet.

They can ask to be member of IPMCs since both are already member of
ASF. But for now need to remove it from proposal.


- Henry

On Fri, Oct 23, 2015 at 4:34 PM, Luciano Resende  wrote:
> We would like to start a discussion on accepting SystemML as an Apache
> Incubator project.
>
> The proposal is available at :
> https://wiki.apache.org/incubator/SystemM
>
> And it's contents is also copied below.
>
> Thanks in Advance for you time reviewing and providing feedback.
>
> ==
>
> = SystemML =
>
> == Abstract ==
>
> SystemML provides declarative large-scale machine learning (ML) that aims
> at flexible specification of ML algorithms and automatic generation of
> hybrid runtime plans ranging from single node, in-memory computations, to
> distributed computations on Apache Hadoop and  Apache Spark. ML algorithms
> are expressed in an R-like syntax, that includes linear algebra primitives,
> statistical functions, and ML-specific constructs. This high-level language
> significantly increases the productivity of data scientists as it provides
> (1) full flexibility in expressing custom analytics, and (2) data
> independence from the underlying input formats and physical data
> representations. Automatic optimization according to data characteristics
> such as distribution on the disk file system, and sparsity as well as
> processing characteristics in the distributed environment like number of
> nodes, CPU, memory per node, ensures both efficiency and scalability.
>
> == Proposal ==
>
> The goal of SystemML is to create a commercial friendly, scalable and
> extensible machine learning framework for data scientists to create or
> extend machine learning algorithms using a declarative syntax. The machine
> learning framework enables data scientists to develop algorithms locally
> without the need of a distributed cluster, and scale up and scale out the
> execution of these algorithms to distributed Hadoop or Spark clusters.
>
> == Background ==
>
> SystemML started as a research project in the IBM Almaden Research Center
> around 2010 aiming to enable data scientists to develop machine learning
> algorithms independent of data and cluster characteristics.
>
> == Rationale ==
>
> SystemML enables the specification of machine learning algorithms using a
> declarative machine learning (DML) language. DML includes linear algebra
> primitives, statistical functions, and additional constructs. This
> high-level language significantly increases the productivity of data
> scientists as it provides (1) full flexibility in expressing custom
> analytics and (2) data independence from the underlying input formats and
> physical data representations.
>
> SystemML computations can be executed in a variety of different modes. It
> supports single node in-memory computations and large-scale distributed
> cluster computations. This allows the user to quickly prototype new
> algorithms in local environments but automatically scale to large data
> sizes as well without changing the algorithm implementation.
>
> Algorithms specified in DML are dynamically compiled and optimized based on
> data and cluster characteristics using rule-based and cost-based
> optimization techniques. The optimizer automatically generates hybrid
> runtime execution plans ranging from in-memory single-node execution to
> distributed computations on Spark or Hadoop. This ensures both efficiency
> and scalability. Automatic optimization reduces or eliminates the need to
> hand-tune distributed runtime execution plans and system configurations.
>
> == Initial Goals ==
>
> The initial goals to move SystemML to the Apache Incubator is to broaden
> the community foster the contributions from data scientists to develop new
> machine learning algorithms and enhance the existing ones. Ultimately, this
> may lead to the creation of an industry standard in specifying machine
> learning algorithms.
>
> == Current Status ==
>
> The initial code has been developed at the IBM Almaden Research Center in
> California and has recently been made available in GitHub under the Apache
> Software License 2.0. The project currently supports a single node (in
> memory computation) as well as distributed computations utilizing Hadoop or
> Spark clusters.
>
> === Meritocracy ===
>
> We plan to invest in supporting a meritocracy. We will discuss the
> requirements in an open forum. Several companies have already expressed
> interest in this project, and we intend to invite additional developers to
> participate. We will encourage and monitor community participation so that
> privileges can be extended to those that contribute operating to the
> standard of meritocracy that Apache emphasizes.
>
> === Community ===
>
> The need for a 

Re: [DISCUSS] SystemML Incubator Proposal

2015-10-23 Thread Luciano Resende
On Fri, Oct 23, 2015 at 5:30 PM, Henry Saputra 
wrote:

> Hi Luciano,
>
> Good proposal, but looks like
> https://wiki.apache.org/incubator/SystemM does not exist?
>

Good catch, it's a typo on the original link and it's missing the L at the
end, here is the correct link

https://wiki.apache.org/incubator/SystemML



>
> Also, Reynold Xin and Patrick Wendell are not member of IPMCs so I
> don't they could be mentors of this project, yet.
>
> They can ask to be member of IPMCs since both are already member of
> ASF. But for now need to remove it from proposal.
>
>
>
Yes, they are aware of the requirement, and this will be fixed before we
call a vote on the proposal.



> - Henry
>
> On Fri, Oct 23, 2015 at 4:34 PM, Luciano Resende 
> wrote:
> > We would like to start a discussion on accepting SystemML as an Apache
> > Incubator project.
> >
> > The proposal is available at :
> > https://wiki.apache.org/incubator/SystemM
> >
> > And it's contents is also copied below.
> >
> > Thanks in Advance for you time reviewing and providing feedback.
> >
> > ==
> >
> > = SystemML =
> >
> > == Abstract ==
> >
> > SystemML provides declarative large-scale machine learning (ML) that aims
> > at flexible specification of ML algorithms and automatic generation of
> > hybrid runtime plans ranging from single node, in-memory computations, to
> > distributed computations on Apache Hadoop and  Apache Spark. ML
> algorithms
> > are expressed in an R-like syntax, that includes linear algebra
> primitives,
> > statistical functions, and ML-specific constructs. This high-level
> language
> > significantly increases the productivity of data scientists as it
> provides
> > (1) full flexibility in expressing custom analytics, and (2) data
> > independence from the underlying input formats and physical data
> > representations. Automatic optimization according to data characteristics
> > such as distribution on the disk file system, and sparsity as well as
> > processing characteristics in the distributed environment like number of
> > nodes, CPU, memory per node, ensures both efficiency and scalability.
> >
> > == Proposal ==
> >
> > The goal of SystemML is to create a commercial friendly, scalable and
> > extensible machine learning framework for data scientists to create or
> > extend machine learning algorithms using a declarative syntax. The
> machine
> > learning framework enables data scientists to develop algorithms locally
> > without the need of a distributed cluster, and scale up and scale out the
> > execution of these algorithms to distributed Hadoop or Spark clusters.
> >
> > == Background ==
> >
> > SystemML started as a research project in the IBM Almaden Research Center
> > around 2010 aiming to enable data scientists to develop machine learning
> > algorithms independent of data and cluster characteristics.
> >
> > == Rationale ==
> >
> > SystemML enables the specification of machine learning algorithms using a
> > declarative machine learning (DML) language. DML includes linear algebra
> > primitives, statistical functions, and additional constructs. This
> > high-level language significantly increases the productivity of data
> > scientists as it provides (1) full flexibility in expressing custom
> > analytics and (2) data independence from the underlying input formats and
> > physical data representations.
> >
> > SystemML computations can be executed in a variety of different modes. It
> > supports single node in-memory computations and large-scale distributed
> > cluster computations. This allows the user to quickly prototype new
> > algorithms in local environments but automatically scale to large data
> > sizes as well without changing the algorithm implementation.
> >
> > Algorithms specified in DML are dynamically compiled and optimized based
> on
> > data and cluster characteristics using rule-based and cost-based
> > optimization techniques. The optimizer automatically generates hybrid
> > runtime execution plans ranging from in-memory single-node execution to
> > distributed computations on Spark or Hadoop. This ensures both efficiency
> > and scalability. Automatic optimization reduces or eliminates the need to
> > hand-tune distributed runtime execution plans and system configurations.
> >
> > == Initial Goals ==
> >
> > The initial goals to move SystemML to the Apache Incubator is to broaden
> > the community foster the contributions from data scientists to develop
> new
> > machine learning algorithms and enhance the existing ones. Ultimately,
> this
> > may lead to the creation of an industry standard in specifying machine
> > learning algorithms.
> >
> > == Current Status ==
> >
> > The initial code has been developed at the IBM Almaden Research Center in
> > California and has recently been made available in GitHub under the
> Apache
> > Software License 2.0. The project currently supports a single node (in
> > memory computation) as well 

[DISCUSS] SystemML Incubator Proposal

2015-10-23 Thread Luciano Resende
We would like to start a discussion on accepting SystemML as an Apache
Incubator project.

The proposal is available at :
https://wiki.apache.org/incubator/SystemM

And it's contents is also copied below.

Thanks in Advance for you time reviewing and providing feedback.

==

= SystemML =

== Abstract ==

SystemML provides declarative large-scale machine learning (ML) that aims
at flexible specification of ML algorithms and automatic generation of
hybrid runtime plans ranging from single node, in-memory computations, to
distributed computations on Apache Hadoop and  Apache Spark. ML algorithms
are expressed in an R-like syntax, that includes linear algebra primitives,
statistical functions, and ML-specific constructs. This high-level language
significantly increases the productivity of data scientists as it provides
(1) full flexibility in expressing custom analytics, and (2) data
independence from the underlying input formats and physical data
representations. Automatic optimization according to data characteristics
such as distribution on the disk file system, and sparsity as well as
processing characteristics in the distributed environment like number of
nodes, CPU, memory per node, ensures both efficiency and scalability.

== Proposal ==

The goal of SystemML is to create a commercial friendly, scalable and
extensible machine learning framework for data scientists to create or
extend machine learning algorithms using a declarative syntax. The machine
learning framework enables data scientists to develop algorithms locally
without the need of a distributed cluster, and scale up and scale out the
execution of these algorithms to distributed Hadoop or Spark clusters.

== Background ==

SystemML started as a research project in the IBM Almaden Research Center
around 2010 aiming to enable data scientists to develop machine learning
algorithms independent of data and cluster characteristics.

== Rationale ==

SystemML enables the specification of machine learning algorithms using a
declarative machine learning (DML) language. DML includes linear algebra
primitives, statistical functions, and additional constructs. This
high-level language significantly increases the productivity of data
scientists as it provides (1) full flexibility in expressing custom
analytics and (2) data independence from the underlying input formats and
physical data representations.

SystemML computations can be executed in a variety of different modes. It
supports single node in-memory computations and large-scale distributed
cluster computations. This allows the user to quickly prototype new
algorithms in local environments but automatically scale to large data
sizes as well without changing the algorithm implementation.

Algorithms specified in DML are dynamically compiled and optimized based on
data and cluster characteristics using rule-based and cost-based
optimization techniques. The optimizer automatically generates hybrid
runtime execution plans ranging from in-memory single-node execution to
distributed computations on Spark or Hadoop. This ensures both efficiency
and scalability. Automatic optimization reduces or eliminates the need to
hand-tune distributed runtime execution plans and system configurations.

== Initial Goals ==

The initial goals to move SystemML to the Apache Incubator is to broaden
the community foster the contributions from data scientists to develop new
machine learning algorithms and enhance the existing ones. Ultimately, this
may lead to the creation of an industry standard in specifying machine
learning algorithms.

== Current Status ==

The initial code has been developed at the IBM Almaden Research Center in
California and has recently been made available in GitHub under the Apache
Software License 2.0. The project currently supports a single node (in
memory computation) as well as distributed computations utilizing Hadoop or
Spark clusters.

=== Meritocracy ===

We plan to invest in supporting a meritocracy. We will discuss the
requirements in an open forum. Several companies have already expressed
interest in this project, and we intend to invite additional developers to
participate. We will encourage and monitor community participation so that
privileges can be extended to those that contribute operating to the
standard of meritocracy that Apache emphasizes.

=== Community ===

The need for a generic scalable and declarative machine learning approach
in the open source is tremendous, so there is a potential for a very large
community. We believe that SystemML’s extensible architecture, declarative
syntax, cost based optimizer and its alignment with Spark will further
encourage community participation not only in enhancing the infrastructure
but also speed up the creation of algorithms for a wide range of use
cases.  We expect that over time SystemML will attract a large community.

=== Alignment ===

The initial committers strongly believe that a generic scalable and
declarative