Hi, it sounds like an interesting projects. Several IPMC already mentioned they would like to be mentor. So I think it's on the right track.
Regards JB On 29/10/2018 06:51, Willem Jiang wrote: > It's look like a very interesting project. I'd like to be your mentor :) > Please ping me if you have any question about incubating process, I'd > like to share my journey with you. > > Willem Jiang > > Twitter: willemjiang > Weibo: 姜宁willem > > On Mon, Oct 29, 2018 at 8:35 AM Xiangdong Huang <hxd...@qq.com> wrote: >> >> Dear Apache Incubator Community, >> >> >> I would like to open up a discussion about incubating IoTDB at Apache. IoTDB >> is a database for managing large amounts of time series data from IoT >> sensors in industrial applications. >> >> >> The proposal is available as a draft at >> https://wiki.apache.org/incubator/IoTDBProposal . I have also included the >> text of the proposal below. >> >> >> >> >> = IoTDB Proposal = >> v0.1 >> >> >> == Abstract == >> IoTDB is a database for managing large amounts of time series data such as >> timestamped data from IoT sensors in industrial applications. >> >> >> == Proposal == >> IoTDB is a database for managing large amount of time series data with >> columnar storage, data encoding, pre-computation, and index techniques. It >> has SQL-like interface to write millions of data points per second per node >> and is optimized to get query results in few seconds over trillions of data >> points. It can also be easily integrated with Apache Hadoop MapReduce and >> Apache Spark for analytics. >> >> >> == Background == >> >> >> A new class of data management system requirements is becoming increasingly >> important with the rise of the Internet of Things. There are some database >> systems and technologies aimed at time series data management. For example, >> Gorilla and InfluxDB which are mainly built for data centers and monitoring >> application metrics. Other systems, for example, OpenTSDB and KairosDB, are >> built on Apache HBase and Apache Cassandra, respectively. >> >> >> However, many applications for time series data management have more >> requirements especially in industrial applications as follows: >> >> >> * Supporting time series data which has high data frequency. For example, a >> turbine engine may generate 1000 points per second (i.e., 1000Hz), while >> each CPU only reports 1 data points per 5 seconds in a data center >> monitoring application. >> >> >> * Supporting scanning data multi-resolutionally. For example, aggregation >> operation is important for time series data. >> >> >> * Supporting special queries for time series, such as pattern matching, >> time series segmentation, time-frequency transformation and frequency query. >> >> >> * Supporting a large number of monitoring targets (i.e. time series). An >> excavator may report more than 1000 time series, for example, revolving >> speed of the motor-engine, the speed of the excavator, the accelerated >> speed, the temperature of the water tank and so on, while a CPU or an >> application monitor has much fewer time series. >> >> >> * Optimization for out-of-order data points. In the industrial sector, it >> is common that equipment sends data using the UDP protocol rather than the >> TCP protocol. Sometimes, the network connect is unstable and parts of the >> data will be buffered for later sending. >> >> >> * Supporting long-term storage. Historical data is precious for equipment >> manufacturers. Therefore, removing or unloading historical data is highly >> desired for most industrial applications. The database system must not only >> support fast retrieval of historical data, but also should guarantee that >> the historical data does not impact the processing speed for “hot” or >> current data. >> >> >> * Supporting online transaction processing (OLTP) as well as complex >> analytics. It is obvious that supporting analyzing from the data files using >> Apache Spark/Apache Hadoop MapReduce directly is better than transforming >> data files to another file format for Big Data analytics. >> >> >> * Flexible deployment either on premise or in the cloud. IoTDB is as >> simple and can be deployed on a Raspberry Pi handling hundreds of time >> series. Meanwhile, the system can be also deployed in the cloud so that it >> supports tens of millions ingestions per second, OLTP queries in >> milliseconds, and analytics using Apache Spark/Apache Hadoop MapReduce. >> >> >> * * (1) If users deploy IoTDB on a device, such as a Raspberry Pi, a wind >> turbine, or a meteorological station, the deployment of the chosen database >> is designed to be simple. A device may have hundreds of time series (but >> less than a thousand time series) and the database needs to handle them. >> * * (2) When deploying IoTDB in a data center, the computational resources >> (i.e., the hardware configuration of servers) is not a problem when compared >> to a Raspberry Pi. In this deployment, IoTDB can use more computation >> resources, and has the ability to handle more time seires (e.g., millions of >> time series). >> >> >> Based on these requirements, we developed IoTDB, a new data store system for >> managing time series data. >> >> >> IoTDB started as a Tsinghua University research project. IoTDB's developer >> community has also grown to include additional institutions, for example, >> universities (e.g., Fudan University), research labs (e.g, NEL-BDS lab), and >> corporations (e.g., K2Data, Tencent). Funding has been provided by various >> institutions including the National Natural Science Foundation of China, and >> industry sponsors, such as Lenovo and K2Data. >> >> >> == Rationale == >> Because there is no existed open-sourced time series databases covering all >> the above requirements, we developed IoTDB. As the system matures, we are >> seeking a long-term home for the project. We believe the Apache Software >> Foundation would be an ideal fit. Also joining Apache will help coordinate >> and improve the development effort of the growing number of organizations >> which contribute to IoTDB improving the diversity of our community. >> >> >> IoTDB contains multiple modules, which are classified into categories: >> >> >> * '''TsFile Format''': TsFile is a new columnar file format. >> * '''Adaptor for Analytics and Visualization''': Integrating TsFile with >> Apache Hadoop HDFS, Apache Hadoop MapReduce and Apache Spark. Examples of >> integrating IoTDB with Apache Kafka, Apache Storm and Grafana are also >> provided. >> * '''IoTDB Engine''': An engine which consists of SQL parser, query plan >> generator, memtable, authentication and authorization,write ahead log (WAL), >> crash recovery, out-of-order data handler, and index for aggregation and >> pattern matching. The engine stores system data in TsFile format. >> * '''IoTDB JDBC''': An implementation of Java Database Connectivity (JDBC) >> for clients to connect to IoTDB using Java. >> >> >> === TsFile Format === >> >> >> TsFile format is a columnar store, which is similar with Apache Parquet and >> Apache CarbonData. It has the concepts of Chunk Group, Column Chunk, Page >> and Footer. Comparing with Apache Parquet and Apache CarbonData, it is >> designed and optimized for time series: >> >> >> ==== Time Series Friendly Encoding ==== >> IoTDB currently supports run length encoding (RLE), delta-of-delta encoding, >> and Facebook's Gorilla encoding. >> >> >> Lossy encoding methods (e.g., Piecewise Linear Approximation (PLA) and >> time-frequency transformation are works-in-progress. >> >> >> >> >> ==== Chunk Group ==== >> The data part of a TsFile consists of many Chunk Groups. Each Chunk Group >> stores the data of a device at a time interval. A Chunk Group is similar to >> the row group in Apache Parquet, while there are some constraints of the >> time dimension: For each device, the time intervals of different Chunk >> Groups are not overlapped and the latter Chunk Group always has a larger >> timestamp. >> >> >> Given a TsFile and a query with a time range filter, the query process can >> terminate scanning data once it reads data points whose timestamp reaches >> the time limit of the filter. We call the feature ''fast-return'' and it >> makes the time range query in a TsFile very efficient. >> >> >> >> >> >> >> ==== Different Column Chunk Format (Unnecessary the Repetition (R) and >> Definition (D) Fields) ==== >> >> >> While Apache Parquet and Apache CarbonData support complex data types, e.g., >> nested data and sparse columns, TsFile is exclusively designed for time >> series whose data model is \<device_id, series_id, timestamp, value\>. >> >> >> In a `Chunk Group`, each time series is a `Column Chunk`. Even though these >> time series belong to the same device, the data points in different time >> series are not aligned in the time dimension originally. >> >> >> For example, if you have a device with 2 sensors on the same data collection >> frequencies, sensor 1 may collect data at time 1521622662000 while the other >> one collects data at time 1521622662001 (delta=1ms). Therefore, each Column >> Chunk has its timestamps and values, which is quite different from Apache >> Parquet and Apache CarbonData. Because we store the time column along with >> each value column instead of making different chunks share the same time >> column for the sake of diverse data frequency for different time series, we >> do not store any null value on disk to align across time series. Besides, we >> do not need to attach `repetition` (R) and `definition` (D) fields on each >> value. Therefore, the disk space is saved and the query latency is reduced >> (because we do not align data by calculating R and D fields). >> >> >> >> >> ==== Domain Specific Information in Each Page ==== >> Similar to Apache Parquet and Apache CarbonData, a `Column Chunk` consists >> of several `Pages`, and each `Page` has a `Page header`. The `Page header` >> is a summary of the data in the page. >> >> >> Because TsFile is optimized for time series, the page header contains more >> domain specific information, such as the minimal and maximal value, the >> minimal and the maximal timestamp, the frequency and so on. TsFile can even >> store the histogram of values in the page header. >> >> >> This header information helps IoTDB in speeding up queries by skipping >> unnecessary pages. >> >> >> >> >> === Adaptor for Analytics === >> The TsFile provides: >> >> >> * InputFormat/OutputFormat interfaces for Reading/Writing data. >> * Deep integration with Apache Spark/Hadoop MapReduce including predicate >> push-down, column pruning, aggregation push down, etc. So users can use >> Apache Spark SQL/HiveQL to connect and query TsFiles. >> >> >> >> >> === IoTDB Engine === >> The IoTDB engine is a database engine, which uses TsFile as its storage file >> format. The IoTDB Engine supports SQL-like query plus many useful functions: >> >> >> * Tree-based time series schema >> * Log-Structured Merge (LSM)-based storage >> * Overflow file for out-of-order data >> * Scalable index framework >> * Special queries for time series >> >> >> ==== Tree-based Time Series Schema ==== >> IoTDB manages all the time series definitions using a tree structure. A path >> from the root of the tree to a leaf node represents a time series. >> Therefore, the unique id of a time series is a path, e.g., >> `root.China.beijing.windFarm1.windTurbine1.speed`. >> >> >> This kind of schema can express `group by` naturally. For example, >> `root.China.beijing.windFarm1.*.speed` represents the speed of all the wind >> turbines in wind farm 1 in Beijing, China. >> >> >> ==== Log-Structured Merge (LSM)-based Storage ==== >> In a time series, the data points should be ordered by their timestamps. In >> IoTDB, we use Log-Structured Merge (LSM) based mechanism. Therefore, a part >> of the data is stored in memory first and can be called as `memtable`. At >> this time, if data points come out-of-order, we resort them in memory. When >> this part of data exceeds the configured memory limit, we flush it on disk >> as a `Chunk Group` into an unclosed TsFile. Finally, a TsFile may contain >> several Chunk Groups, for reducing the number of small data files, which is >> helpful to reduce the I/O load of the storage system and reduces the >> execution time of a file-merge in LSM. Notice that the data is time-ordered >> in one Chunk Group on disk, and this layout is helpful for fast filtering in >> one Chunk Group for a query. >> >> >> Rule 1: In a TsFile, the Chunk Groups of one device are ordered by timestamp >> (Rule 1), and it is helpful for fast filtering among Chunk Groups for a >> query. >> >> >> Rule 2: When the size of the unclosed TsFile reaches the threshold defined >> in the configuration file, we close the file and generate a new one to store >> new arriving data spanning the entire data set. Like many systems which use >> LSM-based storage, we never modify a TsFile which has been closed except for >> the file-merge process (Rule 2). >> >> >> Rule 3: To reduce the number of TsFiles involved in a query process, we >> guarantee that the data points in different TsFiles are not overlapping on >> the time dimension after file mergence (Rule 3). >> >> >> ==== Overflow File for Out-of-order Data ==== >> When a part of data is flushed on disk (and will form a `Chunk Group` in a >> TsFile), the newly arriving data points whose timestamps are smaller than >> the largest timestamp in the Tsfile are `out-of-order`. >> >> >> To store the out-of-order data, we organize all the troublesome >> `out-of-order` data point insertions into a special TsFile, named >> `UnSequenceTsFile`. In an UnSequenceTsFile, the Chunk Groups of one device >> may be overlapping in the time dimension, which violates the Rule 1 and >> costs additional time compared to a normal TsFile for query filtering. >> >> There is another special operation: updating all the data points in a time >> range, e.g., `update all the speed values of device1 as 0 where the data >> time is in [1521622000000, 1521622662000]`. The operation is called when: >> (1) a sensor malfunctions and the database receives wrong data for a period; >> (2) we may want to reset all the records. Many NoSQL time series databases >> do not support such an operation. To support the operation in IoTDB, we use >> a tree-based structure, Treap, to store this part of operations and store >> them as `Overflow` files. >> >> >> Therefore, there are 3 kinds of data files: TsFiles, UnSequenceTsFiles and >> Overflow files. TsFiles should store most of the data. The volume of >> UnSequenceTsFiles depends on the workload: if there are too many >> out-of-order and the time span of out-of-order is huge, the volume will be >> large. Overflow files handle fewest data operations but will depend on the >> use of the special operations. >> >> >> ==== LSM-tree ==== >> Normally, LSM-based storage engines merge data files level by level so that >> it looks like a tree structure. In this way, data is well organized. The >> disadvantage is that data will be read and written several times. If the >> tree has 4 levels, each data point will be rewritten at least 4 times. >> >> >> Currently, we do not merge all the TsFiles into one because (1) the number >> of TsFiles is kept lower than many LSM storage engines because a memtable is >> mapped to several Chunk Groups rather than a file; (2) different TsFiles are >> not overlapping with each other in the time dimension (because of Rule 3). >> >> >> As mentioned before, TsFile supports ''fast-return'' to accelerate queries. >> However, UnSequenceTsFile and Overflow files do not allow this feature. The >> time spans of UnSequenceTsFile, Overflow file andTsFile may be overlapped, >> which leads to more files involved in the query process. To accelerate these >> queries, there is a merging process to reorganize files in the background. >> All the three kinds of files: TsFiles, UnSequenceTsFiles and Overflow files, >> are involved in the merging process. The merging process is implemented >> using multi-threading, while each thread is responsible for a series family. >> After merging, only TsFiles are left. These files have non-overlapping time >> spans and support the ''fast-return'' feature. >> >> >> ==== Scalable Index Framework ==== >> We allow users to implement indexes for faster queries. We currently support >> an index for pattern matching query (KV-Match index, ICDE 2019). Another >> index for fast aggregation (PISA index, CIKM 2016) is a work-in-progress. >> >> >> ==== Special Queries ==== >> We currently support `group by time interval` aggregation queries and `Fill >> by` operations, which are similar to those of InfluxDB. Time series >> segmentation operations and frequency queries are work-in-progress. >> >> >> == Initial Goals == >> The initial goals are to be open sourced and to integrate with the Apache >> development process. Furthermore, we plan for incremental development, and >> releases along with the Apache guidelines. >> >> >> == Current Status == >> We have developed the system for more than 2 years. There are currently 13k >> lines of code, some of which are generated by Antlr3 and Thrift. There are >> 230 issues which have been solved and more than 1500 commits. >> >> >> The system has been deployed in the staging environment of the State Grid >> Corporation of China to handle ~3 million time series (i.e, ~30,000 power >> generation assembly * ~100 sensors) and an equipment service company in >> China managing ~2 million time series (i.e, ~20k devices * 100 sensors). The >> insertion speed reaches ~2 million points/second/node, which is faster than >> InfluxDB, OpenTSDB and Apache Cassandra in our environment. >> >> >> There are many new features in the works including those mentioned herein. >> We will add more analytics functions, improve the data file merge process, >> and finish the first released version of IoTDB. >> >> >> == Meritocracy == >> The IoTDB project operates on meritocratic principles. Developers who submit >> more code with higher quality earn more merit. We have used `Issues` and >> `Pull Requests` modules on Github for collecting users' suggestions and >> patches. Users who submit issues, pull requests, documents and help the >> community management are welcomed and encouraged to become committers. >> >> >> == Community == >> >> >> The IoTDB project users communicate on Github >> (https://github.com/thulab/tsfile) . Developers make the communication on a >> website which is similar with JIRA (Currently, only registered users can >> apply to access the project for communication, url: >> https://tower.im/projects/36de8571a0ff4833ae9d7f1c5c400c22/). We have also >> introduced IoTDB at many technical conferences. Next, we will build the >> mailing list for more convenience, broader communication and archived >> discussions. >> >> >> If IoTDB is accepted for incubation at the Apache Software Foundation, the >> primary goal is to build a larger community. We believe that IoTDB will >> become a key project for time series data management, and so, we will rely >> on a large community of users and developers. >> >> >> TODO: IoTDB is currently on a private Github repository >> (https://github.com/thulab/iotdb), while its subproject TsFile (a file >> format for storing time series data) is open sourced on Github >> (https://github.com/thulab/tsfile). >> >> >> == Core Developers == >> IoTDB was initially developed by 2 dozen of students and teachers at >> Tsinghua University. Now, more and more developers have joined coming from >> other universities: Fudan University, Northwestern Polytechnical University >> and Harbin Institute of Technology in China. Other developers come from >> business companies such as Lenovo and Microsoft. We will be working to bring >> more and more developers into the project making contributions to IoTDB. >> >> >> == Relationships with Other Apache Products == >> IoTDB requires some Apache products (Apache Thrift, commons, collections, >> httpclient). >> >> >> IoTDB-Spark-connector and IoTDB-Hadoop-connector have been developed for >> supporting analysing time series data by using Apache Spark and MapReduce. >> >> >> Overall, IoTDB is designed as an open architecture, and it can be integrated >> with many other systems in the future. >> >> >> As mentioned before, in the IoTDB project, we designed a new columnar file >> format, called TsFile, which is similar to Apache Parquet. However, the new >> file format is optimized for time series data. >> >> >> >> >> >> >> == Known Risks == >> >> >> === Orphaned Products === >> Given the current level of investment in IoTDB, the risk of the project >> being abandoned is minimal. Time series data is more and more important and >> there are several constituents who are highly inspired to continue >> development. Tsinghua and NEL-BDS Lab relies on IoTDB as a platform for a >> large number of long-term research projects. We have deployed IoTDB in some >> company's staging environments for future applications. >> >> >> === Inexperience with Open Source === >> Students and researchers in Tsinghua University have been developing and >> using open source software for a long time. It is wonderful to be guided to >> join a formal open-source process for students. Some of our committers >> have experiences contributing to open source, for example: >> >> >> * druid: >> https://github.com/druid-io/druid/commit/f18cc5df97e5826c2dd8ffafba9fcb69d10a4d44 >> * druid: >> https://github.com/druid-io/druid/commit/aa7aee53ce524b7887b218333166941654788794 >> * YCSB: https://github.com/brianfrankcooper/YCSB/pull/776 >> >> >> Additionally, several ASF veterans and industry veterans have agreed to >> mentor the project and are listed in this proposal. The project will rely on >> their guidance and collective wisdom to quickly transition the entire team >> of initial committers towards practicing the Apache Way. >> >> >> >> >> === Reliance on Salaried Developers === >> Most of current developers are students and researchers/professors in >> universities, and their researches focus on big data management and >> analytics. It is unlikely that they will change their research focus away >> from big data management. We will work to ensure that the ability for the >> project to continuously be stewarded and to proceed forward independent of >> salaried developers is continued. >> >> >> === An Excessive Fascination with the Apache Brand === >> Most of the initial developers come from Tsinghua University with no intent >> to use the Apache brand for profit. We have no plans for making use of >> Apache brand in press releases nor posting billboards advertising acceptance >> of IoTDB into Apache Incubator. >> >> >> >> >> == Initial Source == >> IoTDB's github address and some required dependencies: >> >> >> * The storage file format: https://github.com/thulab/tsfile >> * Adaptor for Apache Hadoop MapReduce: >> https://github.com/thulab/tsfile-hadoop-connector >> * Adaptor for Apache Spark: https://github.com/thulab/tsfile-spark-connector >> * Adaptor for Grafana: https://github.com/thulab/iotdb-grafana >> * The database engine: https://github.com/thulab/iotdb (private project up >> to now) >> * The client driver: https://github.com/thulab/iotdb-jdbc >> >> >> >> >> === External Dependencies === >> To the best of our knowledge, all dependencies of IoTDB are distributed >> under Apache compatible licenses. Upon acceptance to the incubator, we would >> begin a thorough analysis of all transitive dependencies to verify this fact >> and introduce license checking into the build and release process. >> >> >> == Documentation == >> * Documentation for TsFile: https://github.com/thulab/tsfile/wiki >> * Documentation for IoTDB and its JDBC: http://tsfile.org/document >> (Chinese only. An English version is in progress.) >> >> >> == Required Resources == >> === Mailing Lists === >> * priv...@iotdb.incubator.apache.org >> * d...@iotdb.incubator.apache.org >> * comm...@iotdb.incubator.apache.org >> >> >> === Git Repositories === >> * https://git-wip-us.apache.org/repos/asf/incubator-iotdb.git >> >> >> === Issue Tracking === >> * JIRA IoTDB (We currently use the issue management provided by Github to >> track issues.) >> >> >> >> >> == Initial Committers == >> Tsinghua University, K2Data Company, Lenovo, Fundan University, Microsoft >> >> >> Jianmin Wang ( jimwang at tsinghua dot edu dot cn ) >> >> >> Jun Yuan (richard_yuan16 at 163 dot com ) >> >> >> Chen Wang ( wang_chen at tsinghua dot edu dot cn) >> >> >> Xiangdong Huang (sainthxd at gmail dot com) >> >> >> Jialin Qiao (qjl16 at mails dot tsinghua dot edu dot cn) >> >> >> Jinrui Zhang (jinrzhan at microsoft dot com) >> >> >> Rong Kang (kr11 at mails dot tsinghua dot edu dot cn) >> >> >> Tian Jiang(jiangtia18 at mails dot tsinghua dot edu dot cn) >> >> >> Lei Rui (rl18 at mails dot tsinghua dot edu dot cn) >> >> >> Rui Liu (liur17 at mails dot tsinghua dot edu dot cn) >> >> >> Kun Liu (liukun16 at mails dot tsinghua dot edu dot cn) >> >> >> Gaofei Cao (cgf16 at mails dot tsinghua dot edu dot cn) >> >> >> Yi Xu(x-y16 at mails dot tsinghua dot edu dot cn) >> >> >> Xinyi Zhao (xyzhao16 at mails dot tsinghua dot edu dot cn) >> >> >> Dongfang Mao (maodf17 at mails dot tsinghua dot edu dot cn) >> >> >> Tianan Li(lta18 at mails dot tsinghua dot edu dot cn) >> >> >> Yue Su (suy18 at mails dot tsinghua dot edu dot cn) >> >> >> Wangminhao Gou(gwmh18 at mails dot tsinghua dot edu dot cn) >> >> >> >> >> == Sponsors == >> === Champion === >> Kevin A. McGrail (kmcgr...@apache.org) >> >> >> === Nominated Mentors === >> TODO > > --------------------------------------------------------------------- > To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org > For additional commands, e-mail: general-h...@incubator.apache.org > --------------------------------------------------------------------- To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org For additional commands, e-mail: general-h...@incubator.apache.org