Hey YanPing, This addition is nice to see. I agree that there is great opportunity for the Arrow and Mnemonic communities to collaborate. I look forward to working together.
Jacques On Mon, Feb 22, 2016 at 3:01 PM, Wang, Yanping <yanping.w...@intel.com> wrote: > Hi, All > > Based on feedback, we added following into Mnemonic proposal: > > ==== Relationships with Other Apache Product ==== > + Relationship with Apache™ Arrow: > + Arrow's columnar data layout allows great use of CPU caches & SIMD. It > places all data that relevant to a column operation in a compact format in > memory. > + > + Mnemonic directly puts the whole business object graphs on external > heterogeneous storage media, e.g. off-heap, SSD. It is not necessary to > normalize the structures of object graphs for caching, checkpoint or > storing. It doesn’t require developers to normalize their data object > graphs. Mnemonic applications can avoid indexing & join datasets compared > to traditional approaches. > + > + Mnemonic can leverage Arrow to transparently re-layout qualified data > objects or create special containers that is able to efficiently hold those > data records in columnar form as one of major performance optimization > constructs. > + > > Thanks > Yanping > > -----Original Message----- > From: Wang, Yanping [mailto:yanping.w...@intel.com] > Sent: Sunday, February 21, 2016 11:47 AM > To: general@incubator.apache.org > Subject: [DISCUSS] Mnemonic incubator proposal > > Hi all > > We'd like to start a discussion regarding a proposal to submit Mnemonic to > the Apache Incubator. > > The proposal text is available on the Wiki here: > https://wiki.apache.org/incubator/MnemonicProposal > > and pasted below for convenience. > > We are excited to make this proposal, and look forward to the community's > input! > > Best, > Yanping > > > = Mnemonic Proposal = > === Abstract === > Mnemonic is a Java based non-volatile memory library for in-place > structured data processing and computing. It is a solution for generic > object and block persistence on heterogeneous block and byte-addressable > devices, such as DRAM, persistent memory, NVMe, SSD, and cloud network > storage. > > === Proposal === > Mnemonic is a structured data persistence in-memory in-place library for > Java-based applications and frameworks. It provides unified interfaces for > data manipulation on heterogeneous block/byte-addressable devices, such as > DRAM, persistent memory, NVMe, SSD, and cloud network devices. > > The design motivation for this project is to create a non-volatile > programming paradigm for in-memory data object persistence, in-memory data > objects caching, and JNI-less IPC. > Mnemonic simplifies the usage of data object caching, persistence, and > JNI-less IPC for massive object oriented structural datasets. > > Mnemonic defines Non-Volatile Java objects that store data fields in > persistent memory and storage. During the program runtime, only methods and > volatile fields are instantiated in Java heap, Non-Volatile data fields are > directly accessed via GET/SET operation to and from persistent memory and > storage. Mnemonic avoids SerDes and significantly reduces amount of garbage > in Java heap. > > Major features of Mnemonic: > * Provides an abstract level of viewpoint to utilize heterogeneous > block/byte-addressable device as a whole (e.g., DRAM, persistent memory, > NVMe, SSD, HD, cloud network Storage). > * Provides seamless support object oriented design and programming without > adding burden to transfer object data to different form. > * Avoids the object data serialization/de-serialization for data > retrieval, caching and storage. > * Reduces the consumption of on-heap memory and in turn to reduce and > stabilize Java Garbage Collection (GC) pauses for latency sensitive > applications. > * Overcomes current limitations of Java GC to manage much larger memory > resources for massive dataset processing and computing. > * Supports the migration data usage model from traditional NVMe/SSD/HD to > non-volatile memory with ease. > * Uses lazy loading mechanism to avoid unnecessary memory consumption if > some data does not need to use for computing immediately. > * Bypasses JNI call for the interaction between Java runtime application > and its native code. > * Provides an allocation aware auto-reclaim mechanism to prevent external > memory resource leaking. > > > === Background === > Big Data and Cloud applications increasingly require both high throughput > and low latency processing. Java-based applications targeting the Big Data > and Cloud space should be tuned for better throughput, lower latency, and > more predictable response time. > Typically, there are some issues that impact BigData applications' > performance and scalability: > > 1) The Complexity of Data Transformation/Organization: In most cases, > during data processing, applications use their own complicated data caching > mechanism for SerDes data objects, spilling to different storage and > eviction large amount of data. Some data objects contains complex values > and structure that will make it much more difficulty for data organization. > To load and then parse/decode its datasets from storage consumes high > system resource and computation power. > > 2) Lack of Caching, Burst Temporary Object Creation/Destruction Causes > Frequent Long GC Pauses: Big Data computing/syntax generates large amount > of temporary objects during processing, e.g. lambda, SerDes, copying and > etc. This will trigger frequent long Java GC pause to scan references, to > update references lists, and to copy live objects from one memory location > to another blindly. > > 3) The Unpredictable GC Pause: For latency sensitive applications, such as > database, search engine, web query, real-time/streaming computing, require > latency/request-response under control. But current Java GC does not > provide predictable GC activities with large on-heap memory management. > > 4) High JNI Invocation Cost: JNI calls are expensive, but high performance > applications usually try to leverage native code to improve performance, > however, JNI calls need to convert Java objects into something that C/C++ > can understand. In addition, some comprehensive native code needs to > communicate with Java based application that will cause frequently JNI call > along with stack marshalling. > > Mnemonic project provides a solution to address above issues and > performance bottlenecks for structured data processing and computing. It > also simplifies the massive data handling with much reduced GC activity. > > === Rationale === > There are strong needs for a cohesive, easy-to-use non-volatile programing > model for unified heterogeneous memory resources management and allocation. > Mnemonic project provides a reusable and flexible framework to accommodate > other special type of memory/block devices for better performance without > changing client code. > > Most of the BigData frameworks (e.g., Apache Spark™, Apache™ Hadoop®, > Apache HBase™, Apache Flink™, Apache Kafka™, etc.) have their own > complicated memory management modules for caching and checkpoint. Many > approaches increase the complexity and are error-prone to maintain code. > > We have observed heavy overheads during the operations of data parse, > SerDes, pack/unpack, code/decode for data loading, storage, checkpoint, > caching, marshal and transferring. Mnemonic provides a generic in-memory > persistence object model to address those overheads for better performance. > In addition, it manages its in-memory persistence objects and blocks in the > way that GC does, which means their underlying memory resource is able to > be reclaimed without explicitly releasing it. > > Some existing Big Data applications suffer from poor Java GC behaviors > when they process their massive unstructured datasets. Those behaviors > either cause very long stop-the-world GC pauses or take significant system > resources during computing which impact throughput and incur significant > perceivable pauses for interactive analytics. > > There are more and more computing intensive Big Data applications moving > down to rely on JNI to offload their computing tasks to native code which > dramatically increases the cost of JNI invocation and IPC. Mnemonic > provides a mechanism to communicate with native code directly through > in-place object data update to avoid complex object data type conversion > and stack marshaling. In addition, this project can be extended to support > various lockers for threads between Java code and native code. > > === Initial Goals === > Our initial goal is to bring Mnemonic into the ASF and transit the > engineering and governance processes to the "Apache Way." We would like to > enrich a collaborative development model that closely aligns with current > and future industry memory and storage technologies. > > Another important goal is to encourage efforts to integrate non-volatile > programming model into data centric processing/analytics > frameworks/applications, (e.g., Apache Spark™, Apache HBase™, Apache > Flink™, Apache™ Hadoop®, Apache Cassandra™, etc.). > > We expect Mnemonic project to be continuously developing new > functionalities in an open, community-driven way. We envision accelerating > innovation under ASF governance in order to meet the requirements of a wide > variety of use cases for in-memory non-volatile and volatile data caching > programming. > > === Current Status === > Mnemonic project is available at Intel’s internal repository and managed > by its designers and developers. It is also temporary hosted at Github for > general view https://github.com/NonVolatileComputing/Mnemonic.git > > We have integrated this project for Apache Spark™ 1.5.0 and get 2X > performance improvement ratio for Spark™ MLlib k-means workload and > observed expected benefits of removing SerDes, reducing total GC pause time > by 40% from our experiments. > > ==== Meritocracy ==== > Mnemonic was originally created by Gang (Gary) Wang and Yanping Wang in > early 2015. The initial committers are the current Mnemonic R&D team > members from US, China, and India Big Data Technologies Group at Intel. > This group will form a base for much broader community to collaborate on > this code base. > > We intend to radically expand the initial developer and user community by > running the project in accordance with the "Apache Way." Users and new > contributors will be treated with respect and welcomed. By participating in > the community and providing quality patches/support that move the project > forward, they will earn merit. They also will be encouraged to provide > non-code contributions (documentation, events, community management, etc.) > and will gain merit for doing so. Those with a proven support and quality > track record will be encouraged to become committers. > > ==== Community ==== > If Mnemonic is accepted for incubation, the primary initial goal is to > transit the core community towards embracing the Apache Way of project > governance. We would solicit major existing contributors to become > committers on the project from the start. > > ==== Core Developers ==== > Mnemonic core developers are all skilled software developers and system > performance engineers at Intel Corp with years of experiences in their > fields. They have contributed many code to Apache projects. There are PMCs > and experienced committers have been working with us from Apache Spark™, > Apache HBase™, Apache Phoenix™, Apache™ Hadoop® for this project's open > source efforts. > > === Alignment === > The initial code base is targeted to data centric processing and analyzing > in general. Mnemonic has been building the connection and integration for > Apache projects and other projects. > > We believe Mnemonic will be evolved to become a promising project for > real-time processing, in-memory streaming analytics and more, along with > current and future new server platforms with persistent memory as base > storage devices. > > === Known Risks === > ==== Orphaned products ==== > Intel’s Big Data Technologies Group is actively working with community on > integrating this project to Big Data frameworks and applications. We are > continuously adding new concepts and codes to this project and support new > usage cases and features for Apache Big Data ecosystem. > > The project contributors are leading contributors of Hadoop-based > technologies and have a long standing in the Hadoop community. As we are > addressing major Big Data processing performance issues, there is minimal > risk of this work becoming non-strategic and unsupported. > > Our contributors are confident that a larger community will be formed > within the project in a relatively short period of time. > > ==== Inexperience with Open Source ==== > This project has long standing experienced mentors and interested > contributors from Apache Spark™, Apache HBase™, Apache Phoenix™, Apache™ > Hadoop® to help us moving through open source process. We are actively > working with experienced Apache community PMCs and committers to improve > our project and further testing. > > ==== Homogeneous Developers ==== > All initial committers and interested contributors are employed at Intel. > As an infrastructure memory project, there are wide range of Apache > projects are interested in innovative memory project to fit large sized > persistent memory and storage devices. Various Apache projects such as > Apache Spark™, Apache HBase™, Apache Phoenix™, Apache Flink™, Apache > Cassandra™ etc. can take good advantage of this project to overcome > serialization/de-serialization, Java GC, and caching issues. We expect a > wide range of interest will be generated after we open source this project > to Apache. > > ==== Reliance on Salaried Developers ==== > All developers are paid by their employers to contribute to this project. > We welcome all others to contribute to this project after it is open > sourced. > > ==== Relationships with Other Apache Product ==== > + Relationship with Apache™ Arrow: > + Arrow's columnar data layout allows great use of CPU caches & SIMD. It > places all data that relevant to a column operation in a compact format in > memory. > + > + Mnemonic directly puts the whole business object graphs on external > heterogeneous storage media, e.g. off-heap, SSD. It is not necessary to > normalize the structures of object graphs for caching, checkpoint or > storing. It doesn’t require developers to normalize their data object > graphs. Mnemonic applications can avoid indexing & join datasets compared > to traditional approaches. > + > + Mnemonic can leverage Arrow to transparently re-layout qualified data > objects or create special containers that is able to efficiently hold those > data records in columnar form as one of major performance optimization > constructs. > + > > Mnemonic can be integrated into various Big Data and Cloud frameworks and > applications. > We are currently working on several Apache projects with Mnemonic: > > For Apache Spark™ we integrated Mnemonic to improve: > a) Local checkpoints > b) Memory management for caching > c) Persistent memory datasets input > d) Non-Volatile RDD operations > The best use case for Apache Spark™ computing is that the input data is > stored in form of Mnemonic native storage to avoid caching its row data for > iterative processing. Moreover, Spark applications can leverage Mnemonic to > perform data transforming in persistent or non-persistent memory without > SerDes. > > For Apache™ Hadoop®, we are integrating HDFS Caching with Mnemonic instead > of mmap. This will take advantage of persistent memory related features. We > also plan to evaluate to integrate in Namenode Editlog, FSImage persistent > data into Mnemonic persistent memory area. > > For Apache HBase™, we are using Mnemonic for BucketCache and evaluating > performance improvements. > > We expect Mnemonic will be further developed and integrated into many > Apache BigData projects and so on, to enhance memory management solutions > for much improved performance and reliability. > > ==== An Excessive Fascination with the Apache Brand ==== > While we expect Apache brand helps to attract more contributors, our > interests in starting this project is based on the factors mentioned in the > Rationale section. > > We would like Mnemonic to become an Apache project to further foster a > healthy community of contributors and consumers in BigData technology R&D > areas. Since Mnemonic can directly benefit many Apache projects and solves > major performance problems, we expect the Apache Software Foundation to > increase interaction with the larger community as well. > > === Documentation === > The documentation is currently available at Intel and will be posted > under: https://mnemonic.incubator.apache.org/docs > > === Initial Source === > Initial source code is temporary hosted Github for general viewing: > https://github.com/NonVolatileComputing/Mnemonic.git > It will be moved to Apache http://git.apache.org/ after podling. > > The initial Source is written in Java code (88%) and mixed with JNI C code > (11%) and shell script (1%) for underlying native allocation libraries. > > === Source and Intellectual Property Submission Plan === > As soon as Mnemonic is approved to join the Incubator, the source code > will be transitioned via the Software Grant Agreement onto ASF > infrastructure and in turn made available under the Apache License, version > 2.0. > > === External Dependencies === > The required external dependencies are all Apache licenses or other > compatible Licenses > Note: The runtime dependent licenses of Mnemonic are all declared as > Apache 2.0, the GNU licensed components are used for Mnemonic build and > deployment. The Mnemonic JNI libraries are built using the GNU tools. > > maven and its plugins (http://maven.apache.org/ ) [Apache 2.0] > JDK8 or OpenJDK 8 (http://java.com/) [Oracle or Openjdk JDK License] > Nvml (http://pmem.io ) [optional] [Open Source] > PMalloc (https://github.com/bigdata-memory/pmalloc ) [optional] [Apache > 2.0] > > Build and test dependencies: > org.testng.testng v6.8.17 (http://testng.org) [Apache 2.0] > org.flowcomputing.commons.commons-resgc v0.8.7 [Apache 2.0] > org.flowcomputing.commons.commons-primitives v.0.6.0 [Apache 2.0] > com.squareup.javapoet v1.3.1-SNAPSHOT [Apache 2.0] > JDK8 or OpenJDK 8 (http://java.com/) [Oracle or Openjdk JDK License] > > === Cryptography === > Project Mnemonic does not use cryptography itself, however, Hadoop > projects use standard APIs and tools for SSH and SSL communication where > necessary. > > === Required Resources === > We request that following resources be created for the project to use > > ==== Mailing lists ==== > priv...@mnemonic.incubator.apache.org (moderated subscriptions) > comm...@mnemonic.incubator.apache.org > d...@mnemonic.incubator.apache.org > > ==== Git repository ==== > https://github.com/apache/incubator-mnemonic > > ==== Documentation ==== > https://mnemonic.incubator.apache.org/docs/ > > ==== JIRA instance ==== > https://issues.apache.org/jira/browse/mnemonic > > === Initial Committers === > * Gang (Gary) Wang (gang1 dot wang at intel dot com) > * Yanping Wang (yanping dot wang at intel dot com) > * Uma Maheswara Rao G (umamahesh at apache dot org) > * Kai Zheng (drankye at apache dot org) > * Rakesh Radhakrishnan Potty (rakeshr at apache dot org) > * Sean Zhong (seanzhong at apache dot org) > * Henry Saputra (hsaputra at apache dot org) > * Hao Cheng (hao dot cheng at intel dot com) > > === Affiliations === > * Gang (Gary) Wang, Intel > * Yanping Wang, Intel > * Uma Maheswara Rao G, Intel > * Kai Zheng, Intel > * Rakesh Radhakrishnan Potty, Intel > * Sean Zhong, Intel > * Henry Saputra, Independent > * Hao Cheng, Intel > > === Sponsors === > ==== Champion ==== > Patrick Hunt > > ==== Nominated Mentors ==== > * Patrick Hunt <phunt at apache dot org> - Apache IPMC member > * Andrew Purtell <apurtell at apache dot org > - Apache IPMC member > * James Taylor <jamestaylor at apache dot org> - Apache IPMC member > * Henry Saputra <hsaputra at apache dot org> - Apache IPMC member > > ==== Sponsoring Entity ==== > Apache Incubator PMC >