Thanks for starting this discussions, Matthias. Are there any features from SystemML that could be be removed or deprecated when SystemDS being merged to SystemML repository?
- Henry On Sat, Mar 21, 2020 at 2:47 PM Matthias Boehm <mboe...@gmail.com> wrote: > just FYI, we created a ticket for the suitable name search, and shared > the related results [1]. So from my perspective, it really boils down to > the question if we accept the closeness to 'Linux systemd'. Back in 2018 > (when starting SystemDS), I came to the conclusion that it's fine > because of the very different objectives and because SystemDS reflects > both the origin from SystemML and its new focus on data science pipelines. > > [1] > > https://issues.apache.org/jira/projects/PODLINGNAMESEARCH/issues/PODLINGNAMESEARCH-179?filter=allissues > > Regards, > Matthias > > On 3/9/2020 6:37 PM, Matthias Boehm wrote: > > Hi all, > > > > as you're probably aware, development activities of Apache SystemML > > significantly slowed down and were virtually non-existing in the last > > year for various reasons. Part of that was that my team and I [1] > > decided to start SystemDS [2,3] as a fork of SystemML in 09/2018 with a > > new vision and roadmap for the future. > > > > During PMC discussions regarding the retirement of SystemML, we came to > > the conclusions that the best path forward -- for the entire community > > -- would be to merge SystemDS back into Apache SystemML, rename it to > > SystemDS, and continue jointly. Before doing so, I want to share the > > plan with the entire community. > > > > SystemDS aims at providing better systems support for the end-to-end > > data science lifecycle, with a special focus on ML pipelines from data > > integration, cleaning, and preparation, over efficient ML model > > training, to model debugging and serving. A key observation is that > > state-of-the-art data integration and cleaning primitives are themselves > > based on machine learning. Our main objectives are to support effective > > and efficient data preparation, ML training and debugging at scale, > > something that cannot be composed from existing libraries. The game plan > > includes three major parts: > > > > 1) DSL-based, High-level Abstractions: We aim to provide a hierarchy of > > abstractions for the different lifecycle tasks as well as users with > > different expertise (ML researchers, data scientists, domain experts), > > based on our DSL for ML training and scoring. Exploratory data science > > interleaves data preparation, ML training, scoring, and debugging in an > > iterative process; and once these tasks are expressed in dense or sparse > > linear algebra, we expect very good performance. > > > > 2) Hybrid Runtime Plans and Optimizing Compiler: To support the wide > > variety of algorithm classes, we will continue to provide different > > parallelization strategies, enriched by a new backend for federated ML > > and privacy enhancing technologies. Since the hierarchy of language > > abstractions inevitably leads to redundancy, we further aim to improve > > the automatic optimization capabilities of the compiler and underlying > > runtime. > > > > 3) Data Model - Heterogeneous Tensors: To support data integration and > > cleaning primitives in linear algebra programs requires a more generic > > data model for handling heterogeneous and structured data. In contrast > > to existing ML systems, our central data model are heterogeneous > > tensors. Thus, we generalize SystemML's FP64 matrices to > > multi-dimensional arrays where one dimension may have a schema including > > JSON strings to represent nested data. > > > > Admin: We intend to create the SystemDS 0.2 release in March. Afterwards > > we would then rebase all our commits (369) back onto the SystemML > > codeline. Subsequently, we will rename Apache SystemML to Apache > > SystemDS and continue our development under Apache umbrella. I just went > > through the Apache name search guidelines and we'll perform a 'suitable > > name search' accordingly and then transfer SystemDS. The existing PMC > > and committer status stays of course intact unless people want to leave. > > Shortly after the merge, I will nominate the four most active > > contributors of the last year to become committers. Regarding releases > > (and JIRA numbers), it's up for discussion but both, continuing with > > SystemML versions (i.e., 1.3) or SystemDS versions (0.3) seem fine to me. > > > > Roadmap: At technical level, SystemDS will continue to support all > > operations and algorithms SystemML provided but significantly extent the > > scope and functionality via the mentioned hierarchy of language > > abstractions (in form of builtin functions). However, during the fork we > > already removed old baggage like the MR backend, the scrip-level > > debugger, the PyDML frontend and several other things [4]. Major new > > internals are native support for lineage tracing and reuse, the data > > model of heterogeneous tensors, and a new federated backend. > > > > [1] https://damslab.github.io/ > > [2] https://github.com/tugraz-isds/systemds > > [3] http://cidrdb.org/cidr2020/papers/p22-boehm-cidr20.pdf > > [4] https://github.com/tugraz-isds/systemds/releases/tag/v0.1.0 > > > > Regards, > > Matthias >