just FYI, we created a ticket for the suitable name search, and shared the related results [1]. So from my perspective, it really boils down to the question if we accept the closeness to 'Linux systemd'. Back in 2018 (when starting SystemDS), I came to the conclusion that it's fine because of the very different objectives and because SystemDS reflects both the origin from SystemML and its new focus on data science pipelines.

[1] https://issues.apache.org/jira/projects/PODLINGNAMESEARCH/issues/PODLINGNAMESEARCH-179?filter=allissues

Regards,
Matthias

On 3/9/2020 6:37 PM, Matthias Boehm wrote:
Hi all,

as you're probably aware, development activities of Apache SystemML significantly slowed down and were virtually non-existing in the last year for various reasons. Part of that was that my team and I [1] decided to start SystemDS [2,3] as a fork of SystemML in 09/2018 with a new vision and roadmap for the future.

During PMC discussions regarding the retirement of SystemML, we came to the conclusions that the best path forward -- for the entire community -- would be to merge SystemDS back into Apache SystemML, rename it to SystemDS, and continue jointly. Before doing so, I want to share the plan with the entire community.

SystemDS aims at providing better systems support for the end-to-end data science lifecycle, with a special focus on ML pipelines from data integration, cleaning, and preparation, over efficient ML model training, to model debugging and serving. A key observation is that state-of-the-art data integration and cleaning primitives are themselves based on machine learning. Our main objectives are to support effective and efficient data preparation, ML training and debugging at scale, something that cannot be composed from existing libraries. The game plan includes three major parts:

1) DSL-based, High-level Abstractions: We aim to provide a hierarchy of abstractions for the different lifecycle tasks as well as users with different expertise (ML researchers, data scientists, domain experts), based on our DSL for ML training and scoring. Exploratory data science interleaves data preparation, ML training, scoring, and debugging in an iterative process; and once these tasks are expressed in dense or sparse linear algebra, we expect very good performance.

2) Hybrid Runtime Plans and Optimizing Compiler: To support the wide variety of algorithm classes, we will continue to provide different parallelization strategies, enriched by a new backend for federated ML and privacy enhancing technologies. Since the hierarchy of language abstractions inevitably leads to redundancy, we further aim to improve the automatic optimization capabilities of the compiler and underlying runtime.

3) Data Model - Heterogeneous Tensors: To support data integration and cleaning primitives in linear algebra programs requires a more generic data model for handling heterogeneous and structured data. In contrast to existing ML systems, our central data model are heterogeneous tensors. Thus, we generalize SystemML's FP64 matrices to multi-dimensional arrays where one dimension may have a schema including JSON strings to represent nested data.

Admin: We intend to create the SystemDS 0.2 release in March. Afterwards we would then rebase all our commits (369) back onto the SystemML codeline. Subsequently, we will rename Apache SystemML to Apache SystemDS and continue our development under Apache umbrella. I just went through the Apache name search guidelines and we'll perform a 'suitable name search' accordingly and then transfer SystemDS. The existing PMC and committer status stays of course intact unless people want to leave. Shortly after the merge, I will nominate the four most active contributors of the last year to become committers. Regarding releases (and JIRA numbers), it's up for discussion but both, continuing with SystemML versions (i.e., 1.3) or SystemDS versions (0.3) seem fine to me.

Roadmap: At technical level, SystemDS will continue to support all operations and algorithms SystemML provided but significantly extent the scope and functionality via the mentioned hierarchy of language abstractions (in form of builtin functions). However, during the fork we already removed old baggage like the MR backend, the scrip-level debugger, the PyDML frontend and several other things [4]. Major new internals are native support for lineage tracing and reuse, the data model of heterogeneous tensors, and a new federated backend.

[1] https://damslab.github.io/
[2] https://github.com/tugraz-isds/systemds
[3] http://cidrdb.org/cidr2020/papers/p22-boehm-cidr20.pdf
[4] https://github.com/tugraz-isds/systemds/releases/tag/v0.1.0

Regards,
Matthias

Reply via email to