let's please stay on topic here - ML system is a well-established term,
we aim to provide abstractions for different data science lifecyle
tasks, different users (ML researchers, data scientists, and later also
domain experts), and different deployments.
I think, we all agree that we want to cleanup the inconsistent
descriptions and consolidate them to a simple and easy to understand
phrase. Right now consensus seems to emerge to "Apache SystemDS - An
open source ML system for the end-to-end data science lifecycle", but we
keep the thread open for few more days to hear other opinions on
alternatives. Once we close this discussion, we would then
systematically fix all occurrences and make clear nobody (including
committers and PMC members) changes it without discussion.
Regards,
Matthias
On 5/19/2021 2:33 PM, Janardhan wrote:
We shall put this one for formal voting once a suitable description(s)
is found. :)
Terms:
1. ML System
a. Does it mean the way our compiler understands the code and optimizes
it for algorithms (includes, not just ML specific algorithms)?
b. Or is it about the ML algorithms?
2. System vs Platform
We seem to have preferred "system" over "platform"!
3. Big data or data science
the software works fine for small to big data - so big data may not be
relevant.
---
Names of other related (in objectives) projects?
1. TensorFlow - An end-to-end open source machine learning platform
TensorFlow sticks to ML pipeline[1]. Their pipeline roughly looks like this
ML metadata -> Data validation -> Transform -> ML training -> Model
analysis -> serving/deployment
2. H2o.ai - H2O is a fully open source, distributed in-memory machine
learning platform with linear scalability
Strikingly, most of the H2o functionality is similar to SystemML.
Pipeline:
Load data -> Exploratory data analysis and feature selection -> Modeling,
model evaluation, & selection -> prediction
3. MXNet - A flexible and efficient library for deep learning.
In their github description - Lightweight, Portable, Flexible
Distributed/Mobile Deep Learning with
Dynamic, Mutation-aware
Dataflow Dep Scheduler
[1] https://www.tensorflow.org/tfx
[2] https://www.h2o.ai/products/h2o/
[3] https://mxnet.apache.org/versions/1.8.0/
Thank you,
Janardhan
On Wed, May 19, 2021 at 12:06 AM Baunsgaard, Sebastian
<[email protected]> wrote:
+1 for : "Apache SystemDS - An open source ML system for the end-to-end
data science lifecycle"
The webpage have to be changed here:
https://github.com/apache/systemds-website/blob/master/_src/_includes/themes/apache/home.html
And in that process maybe going through the text on the main webpage would
be good.
for instance the first sentence describing systemds is:
"Apache SystemDS provides an optimal workplace for machine learning using
big data"
I would also like to point out the graphical resources on the webpage
still contain SystemML, therefore we should remove or replace them.
Regards
Sebastian
________________________________
From: arnab phani <[email protected]>
Sent: Tuesday, May 18, 2021 8:02:44 PM
To: [email protected]
Subject: Re: [DISCUSS] SystemDS project description
I like "Apache SystemDS - An open source ML system for the end-to-end data
science lifecycle".
Only thing is that "open source" sounds a bit redundant given that the name
includes Apache.
But at places where "Apache" is not mentioned (e.g. PyPI), this description
is apt.
Regards,
Arnab..
On Tue, May 18, 2021 at 7:53 PM Matthias Boehm <[email protected]> wrote:
thanks for initiating this discussion and there are indeed a couple of
things we need to clean up. Just for the future, please ask before
adding even more to this diversity (I understand you just recently
changed the github summary proactively without such discussion).
ad 1) DML stands for Declarative ML Language and it's design philosophy
is based on a declarative specification in terms of providing data
independence (abstract data types, no hard coding of
dense/sparse/compressed), and implementation-agnostic operations (no
hard-coding of local vs distributed vs federated vs HW accelerator
operations).
ad 2) When merging SystemDS into Apache SytemDS, I changed the JIRA
summary to "Apache SystemDS - An open source ML system for the
end-to-end data science lifecycle" and I still like this best because we
want to have a stable name, independent of trends of underlying
execution models. As a side not I always disliked the phrase "A machine
learning platform optimal for big data" (use of optional, big data
wording). However, this is just my opinion, and I think it's a good
point to discuss this once and for all (for the foreseeable future at
least). Any thoughts?
Regards,
Matthias
On 5/18/2021 4:18 PM, Janardhan wrote:
Hi all,
We are using different descriptions at various places. It would be
better
to exemplify each term more clearly. Sorry, If I am asking something
obvious.
1. Which one should we use as the project description?
note: Although, description given in the SystemDS research paper can
be considered - the paper was published before the Merge into SystemML.
2. Also, what is the full form of DML?
a. Declarative machine Learning Language
b. Descriptive Machine Learning Language
c. ..
Research paper [1]:
SystemDS: A Declarative Machine Learning System for the End-to-End Data
Science Lifecycle
GitHub
Apache SystemDS - A versatile system for the end-to-end data science
lifecycle
PyPI
SystemDS is a distributed and declarative machine learning platform.
systemds.apache.org
A machine learning platform optimal for big data
Jira
Apache SystemDS - An open source ML system for the end-to-end data
science
lifecycle
---
SystemDS game plan [1] major parts:
1. DSL-based, High-level Abstractions: We aim to provide a hierarchy of
abstractions for the different lifecycle tasks as well as users with
different expertise
2. Hybrid Runtime Plans and Optimizing Compiler: To support the wide
variety of algorithm classes, we will continue to provide different
parallelization strategies, enriched by a new backend for federated ML
and privacy enhancing technologies.
3. Data Model - Heterogeneous Tensors: To support data integration and
cleaning primitives in linear algebra programs requires a more generic
data model for handling heterogeneous and structured data. In contrast
to
existing ML systems, our central data models are heterogeneous tensors.
[1] https://arxiv.org/abs/1909.02976
[2] Roadmap discussion - https://s.apache.org/systemds-roadmap
Thank you,
Janardhan