This is an automated email from the ASF dual-hosted git repository. jackietien pushed a commit to branch iotdb in repository https://gitbox.apache.org/repos/asf/tsfile.git
commit d04182ab149c8d48e8db45f0cf96c6f6432607d6 Author: CritasWang <[email protected]> AuthorDate: Tue May 28 17:27:05 2024 +0800 add README-zh (#89) --- README-zh.md | 125 +++++++++++++ README.md | 464 +++++------------------------------------------ cpp/tsfile/README-zh.md | 34 ++++ cpp/tsfile/README.md | 34 ++++ java/tsfile/README-zh.md | 198 ++++++++++++++++++++ java/tsfile/README.md | 183 ++++++++++++++++--- 6 files changed, 594 insertions(+), 444 deletions(-) diff --git a/README-zh.md b/README-zh.md new file mode 100644 index 00000000..11ed40f8 --- /dev/null +++ b/README-zh.md @@ -0,0 +1,125 @@ +<!-- + + Licensed to the Apache Software Foundation (ASF) under one + or more contributor license agreements. See the NOTICE file + distributed with this work for additional information + regarding copyright ownership. The ASF licenses this file + to you under the Apache License, Version 2.0 (the + "License"); you may not use this file except in compliance + with the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + + Unless required by applicable law or agreed to in writing, + software distributed under the License is distributed on an + "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + KIND, either express or implied. See the License for the + specific language governing permissions and limitations + under the License. + +--> + +# TsFile Document +<pre> +___________ ___________.__.__ +\__ ___/____\_ _____/|__| | ____ + | | / ___/| __) | | | _/ __ \ + | | \___ \ | \ | | |_\ ___/ + |____|/____ >\___ / |__|____/\___ > version 1.0.0 + \/ \/ \/ +</pre> +[](http://search.maven.org/#search|gav|1|g:"org.apache.tsfile") + +## 简介 + +TsFile是一种为时间序列数据设计的列式存储文件格式,它支持高效压缩、高读写吞吐量,并且兼容多种框架,如Spark和Flink。TsFile很容易集成到物联网大数据处理框架中。 + +时序数据即时间序列数据,是指带时间标签(按照时间的顺序变化,即时间序列化)的数据,其来源多元、数据量庞大,可广泛应用于物联网、智能制造、金融分析等领域。在数据驱动的当下,时序数据的重要性不言而喻。 + +尽管时序数据如此普遍且重要,但长期以来,时序数据的管理都缺乏标准化的文件格式。TsFile 的出现为用户管理时序数据提供了统一的文件格式。 + +[点击查看更多](https://www.timecho.com/archives/tian-bu-shi-chang-kong-bai-apache-tsfile-ru-he-chong-xin-ding-yi-shi-xu-shu-ju-guan-li) + + +## TsFile 特性 + +TsFile 通过自研实现了时序数据高效率管理、高灵活传输,并支持多类软件深度集成。其特性包括: + +- 时序模型:专门为物联网设计的数据模型,每个时间序列与特定设备相关联,所有设备通过分层结构相互连接; + +- 跨语言独立使用:可以使用多种语言的 SDK 直接读写 TsFile,使得一些轻量级的数据读写场景成为可能。 + +- 高效写入和压缩:为时间序列量身定制的列式存储格式,将数据按设备进行组织,并保证每个序列的数据连续存储,最小化存储空间。相比 CSV,压缩比可提升 90% 以上。 + +- 高查询性能:通过设备、物理量和时间维度索引,TsFile 实现了基于特定时间范围的时序数据快速过滤和查询。相比通用文件格式,查询吞吐可提升 2-10 倍。 + +- 开放集成:TsFile 是时序数据库 IoTDB 的底层存储文件格式,可与 IoTDB 形成可插拔的存算分离架构。TsFile 支持与 Spark、Flink 等大数据软件建立无缝生态集成,从而确保跨不同数据处理环境的兼容性和互操作性,实现时序数据跨生态深度分析。 + +## TsFile 基本概念 + +TsFile 可管理多个设备的时序数据。每个设备可具有不同的物理量。 + +每个设备的每个物理量对应一条时间序列。 + +TsFile 数据模型(Schema)定义了所有设备物理量的集合,如下表所示(m1 ~ m5) + +| Time | deviceId | m1 | m2 | m3 | m4 | m5 | +|------|----------|----|----|----|----|----| +| 1 | device1 | 1 | 2 | 3 | | | +| 2 | device1 | 1 | 2 | 3 | | | +| 3 | device2 | 1 | | 3 | 4 | 5 | +| 4 | device2 | 1 | | 3 | 4 | 5 | +| 5 | device3 | 1 | 2 | 3 | 4 | 5 | + +其中 Time 和 deviceId 为内置字段,无需定义,可直接写入。 + +## TsFile 设计原理 + +### 文件结构 + +下为 Apache TsFile 的文件结构。 + +- Page:一段连续的时序数据,存储的基本单元,按时间升序排序,时间戳和值各有单独的列进行存储。 + +- Chunk:由同一序列的多个连续的 Page 组成,一个文件同一个序列可以存储多个 Chunk。 + +- ChunkGroup:由一个设备的一至多个 Chunk 组成,多个 Chunk 可共享一列时间存储(多值模型)。 + +- Index:TsFile 末尾的元数据文件包含序列内部时间维度的索引和序列间的索引信息。 + + + +### 编码和压缩 + +TsFile 通过采用二阶差分编码、游程编码(RLE)、位压缩和 Snappy 等先进的编码和压缩技术,优化时序数据的存储和访问,并支持对时间戳列和数据值列进行单独编码,以实现更好的数据处理效能。 + +其独特之处在于编码算法专为时序数据特性设计,聚焦在时间属性和数据之间的相关性。 + +() + + +基于对时序数据应用需求的深刻理解,TsFile 有助于实现时序数据高压缩比和实时访问速度,并为企业进一步构建高效、可扩展、灵活的数据分析平台提供底层文件技术支撑。 + +| 数据类型 | 推荐编码 | 推荐压缩算法 | +|---------|------------|--------| +| INT32 | TS_2DIFF | LZ4 | +| INT64 | TS_2DIFF | LZ4 | +| FLOAT | GORILLA | LZ4 | +| DOUBLE | GORILLA | LZ4 | +| BOOLEAN | RLE | LZ4 | +| TEXT | DICTIONARY | LZ4 | + +更多类型的编码和压缩方式参见[文档](https://iotdb.apache.org/zh/UserGuide/latest/Basic-Concept/Encoding-and-Compression.html) + +## 开发 TsFile + +[Java](./java/tsfile/README-zh.md#开发) + +[C++](./cpp/tsfile/README-zh.md#开发) + + +## 使用 TsFile + +[Java](./java/tsfile/README-zh.md#使用) + +[C++](./cpp/tsfile/README-zh.md#使用) \ No newline at end of file diff --git a/README.md b/README.md index 9219ee2c..7b0c8b05 100644 --- a/README.md +++ b/README.md @@ -25,53 +25,56 @@ ___________ ___________.__.__ \__ ___/____\_ _____/|__| | ____ | | / ___/| __) | | | _/ __ \ | | \___ \ | \ | | |_\ ___/ - |____|/____ >\___ / |__|____/\___ > version 1.0.1-SNAPSHOT + |____|/____ >\___ / |__|____/\___ > version 1.0.0 \/ \/ \/ </pre> [](http://search.maven.org/#search|gav|1|g:"org.apache.tsfile") -## Abstract +## Introduction TsFile is a columnar storage file format designed for time series data, which supports efficient compression, high throughput of read and write, and compatibility with various frameworks, such as Spark and Flink. It is easy to integrate TsFile into IoT big data processing frameworks. -[Click for More Information](https://www.timecho.com/archives/tian-bu-shi-chang-kong-bai-apache-tsfile-ru-he-chong-xin-ding-yi-shi-xu-shu-ju-guan-li) - -## Motivation - Time series data is becoming increasingly important in a wide range of applications, including IoT, intelligent control, finance, log analysis, and monitoring systems. -TsFile is the first existing standard file format for time series data. The industry companies usually write time series data without unification, or use general columnar file format, which makes data collection and processing complicated without a standard. With TsFile, organizations could write data in TsFile inside end devices or gateway, then transfer TsFile to the cloud for unified management in IoTDB and other systems. In this way, we lower the network transmission and the computin [...] +TsFile is the first existing standard file format for time series data. Despite the widespread presence and significance of temporal data, there has been a longstanding absence of standardized file formats for its management. The advent of TsFile introduces a unified file format to facilitate users in managing temporal data. -TsFile is a specially designed file format rather than a database. Users can open, write, read, and close a TsFile easily like doing operations on a normal file. Besides, more interfaces are available on a TsFile. +[Click for More Information](https://www.timecho-global.com/archives/apache-tsfile-time-series-data-storage-redefined) -TsFile offers several distinctive features and benefits: +## TsFile Features -* Efficient Storage and Compression: TsFile employs advanced compression techniques to minimize storage requirements, resulting in reduced disk space consumption and improved system efficiency. +TsFile offers several distinctive features and benefits: -* Flexible Schema and Metadata Management: TsFile allows for directly write data without pre defining the schema, which is flexible for data aquisition. +- Mutil Language Independent Use: Multiple language SDK can be used to directly read and write TsFile, making it possible for some lightweight data reading and writing scenarios. -* High Query Performance with time range: TsFile has indexed devices, sensors and time dimensions to accelerate query performance, enabling fast filtering and retrieval of time series data. +- Efficient Writing and Compression: A column storage format tailored for time series, organizing data by device and ensuring continuous storage of data for each sequence, minimizing storage space. Compared to CSV, the compression ratio can be increased by more than 90%. -* Seamless Integration: TsFile is designed to seamlessly integrate with existing time series databases such as IoTDB, data processing frameworks, such as Spark and Flink. +- High Query Performance: By indexing devices, measurement, and time dimensions, TsFile implements fast filtering and querying of temporal data based on specific time ranges. Compared to general file formats, query throughput can be increased by 2-10 times. +- Open Integration: TsFile is the underlying storage file format of the temporal database IoTDB, which can form a pluggable storage computing separation architecture with IoTDB. TsFile supports compatibility with Spark Flink and other big data software establish seamless ecosystem integration to ensure compatibility and interoperability across different data processing environments, and achieve deep analysis of temporal data across ecosystems. -# Features +## TsFile Basic Concepts -When conceptualizing the structure of TsFile, there were several key considerations: +TsFile can manage the time series data of multiple devices. Each device can have different measurement. -- Efficient Compression: Recognizing the importance of space optimization, TsFile compresses data extensively to minimize storage requirements. +Each measurement of each device corresponds to a time series. -- Device Packing: Multiple devices are packed together to reduce the number of files, streamlining data management. +The TsFile Scheme defines a set of measurement for all devices, as shown in the table below (m1~m5) -- Data Locality: Time series data expected to be queried together are kept close in physical locations to enhance query performance. +| Time | deviceId | m1 | m2 | m3 | m4 | m5 | +|------|----------|----|----|----|----|----| +| 1 | device1 | 1 | 2 | 3 | | | +| 2 | device1 | 1 | 2 | 3 | | | +| 3 | device2 | 1 | | 3 | 4 | 5 | +| 4 | device2 | 1 | | 3 | 4 | 5 | +| 5 | device3 | 1 | 2 | 3 | 4 | 5 | -- Disk Fragmentation: TsFile ensures data is packed with sizes aligned with file systems to avoid disk fragmentation. +Among them, Time and deviceId are built-in fields that do not need to be defined and can be written directly. -- Efficient Access: With millions of time series needing efficient access, TsFile is optimized for rapid data retrieval. +## TsFile Design -# Columnar Storage and File Structure +### File Structure -TsFile adopts a columnar storage design, similar to other file formats, primarily to optimize time-series data's storage efficiency and query performance. This design aligns with the nature of time series data, which often involves large volumes of similar data types recorded over time. However, TsFile was developed particularly with a structure of page, chunk, chunk group, block, and index: +TsFile adopts a columnar storage design, similar to other file formats, primarily to optimize time-series data's storage efficiency and query performance. This design aligns with the nature of time series data, which often involves large volumes of similar data types recorded over time. However, TsFile was developed particularly with a structure of page, chunk, chunk group, and index: - Page: The basic unit for storing time series data, sorted by time in ascending order with separate columns for timestamps and values. @@ -79,18 +82,15 @@ TsFile adopts a columnar storage design, similar to other file formats, primaril - Chunk Group: Multiple chunks within a chunk group belong to one or multiple series of a device written in the same period, facilitating efficient query processing. -- Block: Buffered in memory before being flushed to TsFile, all chunk groups form a block, allowing for efficient data locality in distributed file systems like HDFS. - - Index: The file metadata at the end of TsFile contains a chunk-level index and file-level statistics for efficient data access. -The following diagram illustrates TsFile's innovative columnar storage design, showcasing the efficiency of its page, chunk, and block structure. - + +## Encoding and Compression - +TsFile employs advanced encoding and compression techniques to optimize storage and access for time series data. It uses methods like run-length encoding (RLE), bit-packing, and Snappy for efficient compression, allowing separate encoding of timestamp and value columns for better data processing. Its unique encoding algorithms are designed specifically for the characteristics of time series data in IoT scenarios, focusing on regular time intervals and the correlation among series. -# Encoding and Compression Techniques -TsFile employs advanced encoding and compression techniques to optimize storage and access for time series data. It uses methods like run-length encoding (RLE), bit-packing, and Snappy for efficient compression, allowing separate encoding of timestamp and value columns for better data processing. Its unique encoding algorithms are designed specifically for the characteristics of time series data in IoT scenarios, focusing on regular time intervals and the correlation among series. Additi [...] +Its uniqueness lies in the encoding algorithm designed specifically for time series data characteristics, focusing on the correlation between time attributes and data. The table below compares 3 file formats in different dimensions. @@ -99,402 +99,26 @@ The table below compares 3 file formats in different dimensions. Its development facilitates efficient data encoding, compression, and access, reflecting a deep understanding of industry needs, pioneering a path toward efficient, scalable, and flexible data analytics platforms. -# Building With Java - -## Prerequisites - -To build TsFile wirh Java, you need to have: - -1. Java >= 1.8 (1.8, 11 to 17 are verified. Please make sure the environment path has been set accordingly). -2. Maven >= 3.6 (If you want to compile TsFile from source code). - - -## Build TsFile with Maven - -``` -mvn clean package -P with-java -DskipTests -``` - -## Install to local machine - -``` -mvn install -P with-java -DskipTests -``` +| Data Type | Recommended Encoding | Recommended Compression | +|---------|------------|--------| +| INT32 | TS_2DIFF | LZ4 | +| INT64 | TS_2DIFF | LZ4 | +| FLOAT | GORILLA | LZ4 | +| DOUBLE | GORILLA | LZ4 | +| BOOLEAN | RLE | LZ4 | +| TEXT | DICTIONARY | LZ4 | -# Add TsFile as a dependency in Maven +more see[Docs](https://iotdb.apache.org/UserGuide/latest/Basic-Concept/Encoding-and-Compression.html) -The current SNAPSHOT version is `1.0.1-SNAPSHOT`, you can use it after Maven install +## Build TsFile -```xml -<dependencies> - <dependency> - <groupId>org.apache.tsfile</groupId> - <artifactId>tsfile-java</artifactId> - <version>1.0.1-SNAPSHOT</version> - </dependency> -<dependencies> -``` - -The current release version is `1.0.0` - -```xml -<dependencies> - <dependency> - <groupId>org.apache.tsfile</groupId> - <artifactId>tsfile</artifactId> - <version>1.0.0</version> - </dependency> -<dependencies> -``` - -# TsFile Java API - -## Write TsFile - -1. construct a `TsFileWriter` instance. - * Without pre-defined schema - - ```java - public TsFileWriter(File file) throws IOException - ``` - * With pre-defined schema - - ```java - public TsFileWriter(File file, Schema schema) throws IOException - ``` - This one is for using the HDFS file system. `TsFileOutput` can be an instance of class `HDFSOutput`. - - ```java - public TsFileWriter(TsFileOutput output, Schema schema) throws IOException - ``` - - If you want to set some TSFile configuration on your own, you could use param `config`. For example: - - ```java - TSFileConfig conf = new TSFileConfig(); - conf.setTSFileStorageFs("HDFS"); - TsFileWriter tsFileWriter = new TsFileWriter(file, schema, conf); - ``` - - In this example, data files will be stored in HDFS, instead of local file system. If you'd like to store data files in local file system, you can use `conf.setTSFileStorageFs("LOCAL")`, which is also the default config. - - You can also config the ip and rpc port of your HDFS by `config.setHdfsIp(...)` and `config.setHdfsPort(...)`. The default ip is `localhost` and default rpc port is `9000`. - - **Parameters:** - - * file : The TsFile to write - - * schema : The file schemas, will be introduced in next part. - - * config : The config of TsFile. -2. add measurements - - Or you can make an instance of class `Schema` first and pass this to the constructor of class `TsFileWriter` - - The class `Schema` contains a map whose key is the name of one measurement schema, and the value is the schema itself. - - Here are the interfaces: - - ```java - // Create an empty Schema or from an existing map - public Schema() - public Schema(Map<String, MeasurementSchema> measurements) - // Use this two interfaces to add measurements - public void registerMeasurement(MeasurementSchema descriptor) - public void registerMeasurements(Map<String, MeasurementSchema> measurements) - // Some useful getter and checker - public TSDataType getMeasurementDataType(String measurementId) - public MeasurementSchema getMeasurementSchema(String measurementId) - public Map<String, MeasurementSchema> getAllMeasurementSchema() - public boolean hasMeasurement(String measurementId) - ``` - - You can always use the following interface in `TsFileWriter` class to add additional measurements: - - ```java - public void addMeasurement(MeasurementSchema measurementSchema) throws WriteProcessException - ``` - - The class `MeasurementSchema` contains the information of one measurement, there are several constructors: - ```java - public MeasurementSchema(String measurementId, TSDataType type, TSEncoding encoding) - public MeasurementSchema(String measurementId, TSDataType type, TSEncoding encoding, CompressionType compressionType) - public MeasurementSchema(String measurementId, TSDataType type, TSEncoding encoding, CompressionType compressionType, - Map<String, String> props) - ``` - - **Parameters:** - - * measurementID: The name of this measurement, typically the name of the sensor. - - * type: The data type, now support six types: `BOOLEAN`, `INT32`, `INT64`, `FLOAT`, `DOUBLE`, `TEXT`; - - * encoding: The data encoding. - - * compression: The data compression. - - * props: Properties for special data types.Such as `max_point_number` for `FLOAT` and `DOUBLE`, `max_string_length` for - `TEXT`. Use as string pairs into a map such as ("max_point_number", "3"). - - > **Notice:** Although one measurement name can be used in multiple deltaObjects, the properties cannot be changed. I.e. - it's not allowed to add one measurement name for multiple times with different type or encoding. - Here is a bad example: - - ```java - // The measurement "sensor_1" is float type - addMeasurement(new MeasurementSchema("sensor_1", TSDataType.FLOAT, TSEncoding.RLE)); - - // This call will throw a WriteProcessException exception - addMeasurement(new MeasurementSchema("sensor_1", TSDataType.INT32, TSEncoding.RLE)); - ``` - ``` - - ``` - -3. insert and write data continually. - - Use this interface to create a new `TSRecord`(a timestamp and device pair). - - ```java - public TSRecord(long timestamp, String deviceId) - ``` - ``` - Then create a `DataPoint`(a measurement and value pair), and use the addTuple method to add the DataPoint to the correct - TsRecord. - - Use this method to write - - ```java - public void write(TSRecord record) throws IOException, WriteProcessException - ``` - -4. call `close` to finish this writing process. - - ```java - public void close() throws IOException - ``` - -We are also able to write data into a closed TsFile. - -1. Use `ForceAppendTsFileWriter` to open a closed file. - - ```java - public ForceAppendTsFileWriter(File file) throws IOException - ``` - -2. call `doTruncate` truncate the part of Metadata - -3. Then use `ForceAppendTsFileWriter` to construct a new `TsFileWriter` - -```java -public TsFileWriter(TsFileIOWriter fileWriter) throws IOException -``` -Please note, we should redo the step of adding measurements before writing new data to the TsFile. - -### Example - -You could write a TsFile by constructing **TSRecord** if you have the **non-aligned** (e.g. not all sensors contain values) time series data. - -A more thorough example can be found at `java/examples/src/main/java/org/apache/tsfile/tsfile/TsFileWriteWithTSRecord.java` - -You could write a TsFile by constructing **Tablet** if you have the **aligned** time series data. - -A more thorough example can be found at `java/examples/src/main/java/org/apache/tsfile/tsfile/TsFileWriteWithTablet.java` - -You could write data into a closed TsFile by using **ForceAppendTsFileWriter**. - -A more thorough example can be found at `java/examples/src/main/java/org/apache/tsfile/tsfile/TsFileForceAppendWrite.java` - -## Interface for Reading TsFile - -* Definition of Path - -A path is a dot-separated string which uniquely identifies a time-series in TsFile, e.g., "root.area_1.device_1.sensor_1". -The last section "sensor_1" is called "measurementId" while the remaining parts "root.area_1.device_1" is called deviceId. -As mentioned above, the same measurement in different devices has the same data type and encoding, and devices are also unique. - -In read interfaces, The parameter `paths` indicates the measurements to be selected. - -Path instance can be easily constructed through the class `Path`. For example: - -```java -Path p = new Path("device_1.sensor_1"); -``` - -We will pass an ArrayList of paths for final query call to support multiple paths. - -```java -List<Path> paths = new ArrayList<Path>(); -paths.add(new Path("device_1.sensor_1")); -paths.add(new Path("device_1.sensor_3")); -``` - -> **Notice:** When constructing a Path, the format of the parameter should be a dot-separated string, the last part will - be recognized as measurementId while the remaining parts will be recognized as deviceId. - - -* Definition of Filter - - * Usage Scenario -Filter is used in TsFile reading process to select data satisfying one or more given condition(s). - - * IExpression -The `IExpression` is a filter expression interface and it will be passed to our final query call. -We create one or more filter expressions and may use binary filter operators to link them to our final expression. - -* **Create a Filter Expression** - - There are two types of filters. - - * TimeFilter: A filter for `time` in time-series data. - ``` - IExpression timeFilterExpr = new GlobalTimeExpression(TimeFilter); - ``` - Use the following relationships to get a `TimeFilter` object (value is a long int variable). - - |Relationship|Description| - |---|---| - |TimeFilter.eq(value)|Choose the time equal to the value| - |TimeFilter.lt(value)|Choose the time less than the value| - |TimeFilter.gt(value)|Choose the time greater than the value| - |TimeFilter.ltEq(value)|Choose the time less than or equal to the value| - |TimeFilter.gtEq(value)|Choose the time greater than or equal to the value| - |TimeFilter.notEq(value)|Choose the time not equal to the value| - |TimeFilter.not(TimeFilter)|Choose the time not satisfy another TimeFilter| - - * ValueFilter: A filter for `value` in time-series data. - - ``` - IExpression valueFilterExpr = new SingleSeriesExpression(Path, ValueFilter); - ``` - The usage of `ValueFilter` is the same as using `TimeFilter`, just to make sure that the type of the value - equal to the measurement's(defined in the path). - -* **Binary Filter Operators** +[Java](./java/tsfile/README.md#building-with-java) - Binary filter operators can be used to link two single expressions. - - * BinaryExpression.and(Expression, Expression): Choose the value satisfy for both expressions. - * BinaryExpression.or(Expression, Expression): Choose the value satisfy for at least one expression. - - -Filter Expression Examples - -* **TimeFilterExpression Examples** - - ```java - IExpression timeFilterExpr = new GlobalTimeExpression(TimeFilter.eq(15)); // series time = 15 - ``` -``` - ```java - IExpression timeFilterExpr = new GlobalTimeExpression(TimeFilter.ltEq(15)); // series time <= 15 -``` -```java - IExpression timeFilterExpr = new GlobalTimeExpression(TimeFilter.lt(15)); // series time < 15 -``` - ```java -IExpression timeFilterExpr = new GlobalTimeExpression(TimeFilter.gtEq(15)); // series time >= 15 - ``` - ```java - IExpression timeFilterExpr = new GlobalTimeExpression(TimeFilter.notEq(15)); // series time != 15 -``` - ```java - IExpression timeFilterExpr = BinaryExpression.and( - new GlobalTimeExpression(TimeFilter.gtEq(15L)), - new GlobalTimeExpression(TimeFilter.lt(25L))); // 15 <= series time < 25 -``` - ```java - IExpression timeFilterExpr = BinaryExpression.or( - new GlobalTimeExpression(TimeFilter.gtEq(15L)), - new GlobalTimeExpression(TimeFilter.lt(25L))); // series time >= 15 or series time < 25 - ``` -* Read Interface - -First, we open the TsFile and get a `ReadOnlyTsFile` instance from a file path string `path`. - -```java -TsFileSequenceReader reader = new TsFileSequenceReader(path); - -ReadOnlyTsFile readTsFile = new ReadOnlyTsFile(reader); -``` -Next, we prepare the path array and query expression, then get final `QueryExpression` object by this interface: - -```java -QueryExpression queryExpression = QueryExpression.create(paths, statement); -``` - -The ReadOnlyTsFile class has two `query` method to perform a query. -* **Method 1** - - ```java - public QueryDataSet query(QueryExpression queryExpression) throws IOException - ``` - -* **Method 2** - - ```java - public QueryDataSet query(QueryExpression queryExpression, long partitionStartOffset, long partitionEndOffset) throws IOException - ``` - - This method is designed for advanced applications such as the TsFile-Spark Connector. - - * **params** : For method 2, two additional parameters are added to support partial query: - * ```partitionStartOffset```: start offset for a TsFile - * ```partitionEndOffset```: end offset for a TsFile - - > **What is Partial Query ?** - > - > In some distributed file systems(e.g. HDFS), a file is split into severval parts which are called "Blocks" and stored in different nodes. Executing a query paralleled in each nodes involved makes better efficiency. Thus Partial Query is needed. Paritial Query only selects the results stored in the part split by ```QueryConstant.PARTITION_START_OFFSET``` and ```QueryConstant.PARTITION_END_OFFSET``` for a TsFile. - -* QueryDataset Interface - -The query performed above will return a `QueryDataset` object. - -Here's the useful interfaces for user. - - * `bool hasNext();` - - Return true if this dataset still has elements. - * `List<Path> getPaths()` - - Get the paths in this data set. - * `List<TSDataType> getDataTypes();` - - Get the data types. The class TSDataType is an enum class, the value will be one of the following: - - BOOLEAN, - INT32, - INT64, - FLOAT, - DOUBLE, - TEXT; - * `RowRecord next() throws IOException;` - - Get the next record. - - The class `RowRecord` consists of a `long` timestamp and a `List<Field>` for data in different sensors, - we can use two getter methods to get them. - - ```java - long getTimestamp(); - List<Field> getFields(); - ``` - - To get data from one Field, use these methods: - - ```java - TSDataType getDataType(); - Object getObjectValue(); - ``` - - - -### Example +[C++](./cpp/tsfile/README.md#build) -You should install TsFile to your local maven repository. +## Use TsFile +[Java](./java/tsfile/README.md#use-tsfile) -A more thorough example with query statement can be found at -`java/examples/src/main/java/org/apache/tsfile/TsFileRead.java` -`java/examples/src/main/java/org/apache/tsfile/TsFileSequenceRead.java` \ No newline at end of file +[C++](./cpp/tsfile/README.md#use-tsfile) diff --git a/cpp/tsfile/README-zh.md b/cpp/tsfile/README-zh.md new file mode 100644 index 00000000..0878f43a --- /dev/null +++ b/cpp/tsfile/README-zh.md @@ -0,0 +1,34 @@ +<!-- + + Licensed to the Apache Software Foundation (ASF) under one + or more contributor license agreements. See the NOTICE file + distributed with this work for additional information + regarding copyright ownership. The ASF licenses this file + to you under the Apache License, Version 2.0 (the + "License"); you may not use this file except in compliance + with the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + + Unless required by applicable law or agreed to in writing, + software distributed under the License is distributed on an + "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + KIND, either express or implied. See the License for the + specific language governing permissions and limitations + under the License. + +--> + +# TsFile C++ Document +<pre> +___________ ___________.__.__ +\__ ___/____\_ _____/|__| | ____ + | | / ___/| __) | | | _/ __ \ + | | \___ \ | \ | | |_\ ___/ + |____|/____ >\___ / |__|____/\___ > version 1.0.0 + \/ \/ \/ +</pre> + +## 开发 + +## 使用 \ No newline at end of file diff --git a/cpp/tsfile/README.md b/cpp/tsfile/README.md new file mode 100644 index 00000000..e6db19b1 --- /dev/null +++ b/cpp/tsfile/README.md @@ -0,0 +1,34 @@ +<!-- + + Licensed to the Apache Software Foundation (ASF) under one + or more contributor license agreements. See the NOTICE file + distributed with this work for additional information + regarding copyright ownership. The ASF licenses this file + to you under the Apache License, Version 2.0 (the + "License"); you may not use this file except in compliance + with the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + + Unless required by applicable law or agreed to in writing, + software distributed under the License is distributed on an + "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + KIND, either express or implied. See the License for the + specific language governing permissions and limitations + under the License. + +--> + +# TsFile C++ Document +<pre> +___________ ___________.__.__ +\__ ___/____\_ _____/|__| | ____ + | | / ___/| __) | | | _/ __ \ + | | \___ \ | \ | | |_\ ___/ + |____|/____ >\___ / |__|____/\___ > version 1.0.0 + \/ \/ \/ +</pre> + +## Build + +## Use TsFile \ No newline at end of file diff --git a/java/tsfile/README-zh.md b/java/tsfile/README-zh.md new file mode 100644 index 00000000..45820503 --- /dev/null +++ b/java/tsfile/README-zh.md @@ -0,0 +1,198 @@ +<!-- + + Licensed to the Apache Software Foundation (ASF) under one + or more contributor license agreements. See the NOTICE file + distributed with this work for additional information + regarding copyright ownership. The ASF licenses this file + to you under the Apache License, Version 2.0 (the + "License"); you may not use this file except in compliance + with the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + + Unless required by applicable law or agreed to in writing, + software distributed under the License is distributed on an + "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + KIND, either express or implied. See the License for the + specific language governing permissions and limitations + under the License. + +--> + +# TsFile Java Document +<pre> +___________ ___________.__.__ +\__ ___/____\_ _____/|__| | ____ + | | / ___/| __) | | | _/ __ \ + | | \___ \ | \ | | |_\ ___/ + |____|/____ >\___ / |__|____/\___ > version 1.0.0 + \/ \/ \/ +</pre> + +## 开发 + +### 前置条件 + +构建 Java 版的 TsFile,必须要安装以下依赖: + +1. Java >= 1.8 (1.8, 11 到 17 都经过验证. 请确保设置了环境变量). +2. Maven >= 3.6 (如果要从源代码编译TsFile). + + +### 使用 maven 构建 + +``` +mvn clean package -P with-java -DskipTests +``` + +### 安装到本地机器 + +``` +mvn install -P with-java -DskipTests +``` + +## 使用 + +### 在 Maven 中添加 TsFile 依赖 + +当前发布版本是 `1.0.0`,可以这样引用 + +```xml +<dependencies> + <dependency> + <groupId>org.apache.tsfile</groupId> + <artifactId>tsfile</artifactId> + <version>1.0.0</version> + </dependency> +<dependencies> +``` + +当前 SNAPSHOT 版本是 `1.0.1-SNAPSHOT`, 可以这样引用 + +```xml +<dependencies> + <dependency> + <groupId>org.apache.tsfile</groupId> + <artifactId>tsfile-java</artifactId> + <version>1.0.1-SNAPSHOT</version> + </dependency> +<dependencies> +``` + +### TsFile Java API + +#### 写入 TsFile +TsFile 可以通过以下三个步骤生成,完整的代码参见"写入 TsFile 示例"章节。 + +1. 注册元数据 (Schema) + + 创建一个`Schema`类的实例。 + + `Schema`类保存的是一个映射关系,key 是一个 measurement 的名字,value 是 measurement schema. + + 下面是一系列接口: + + ```java + + /** + * measurementID: 物理量的名称,通常是传感器的名称 + * type: 数据类型,现在支持六种类型:`BOOLEAN`, `INT32`, `INT64`, `FLOAT`, `DOUBLE`, `TEXT` + * encoding: 编码类型 + */ + public MeasurementSchema(String measurementId, TSDataType type, TSEncoding encoding) // 默认使用 LZ4 压缩算法 + + // 使用预定义的 measurement 列表初始化 Schema + public Schema(Map<String, MeasurementSchema> measurements) + + /** + * 构造 TsFileWriter 进行数据写入 + * file : 写入 TsFile 数据的文件 + * schema : 文件的 schemas + */ + public TsFileWriter(File file, Schema schema) throws IOException + ``` + +2. 使用 `TsFileWriter` 写入数据。 + + ```java + /** + * 使用接口创建一个新的`TSRecord`(时间戳和设备) + */ + public TSRecord(long timestamp, String deviceId) + + /** + * 创建一个`DataPoint`(度量 (measurement) 和值的对应),并使用 addTuple 方法将数据 DataPoint 添加正确的值到 TsRecord。 + */ + for (IMeasurementSchema schema : schemas) { + tsRecord.addTuple( + DataPoint.getDataPoint( + schema.getType(), + schema.getMeasurementId(), + Objects.requireNonNull(DataGenerator.generate(schema.getType(), (int) startValue)) + .toString())); + startValue++; + } + /** + * 写入数据 + */ + public void write(TSRecord record) throws IOException, WriteProcessException + ``` + +3. 调用`close`方法来关闭文件,关闭后才能进行查询。 + + ```java + public void close() throws IOException + ``` + +写入 TsFile 完整示例 + +[构造 TSRecord 来写入数据](../examples/src/main/java/org/apache/tsfile/TsFileWriteAlignedWithTSRecord.java)。 + +[构造 Tablet 来写入数据](../examples/src/main/java/org/apache/tsfile/TsFileWriteAlignedWithTablet.java)。 + + +#### 读取 TsFile + +* 构造查询条件 +```java +/** + * 构造待读取的时间序列 + * 时间序列由 deviceId.measurementId 的格式组成(deviceId内可以有.) + */ +List<Path> paths = new ArrayList<Path>(); +paths.add(new Path("device_1.sensor_1")); +paths.add(new Path("device_1.sensor_3")); + +/** + * 构造一个时间范围过滤条件 + */ +IExpression timeFilterExpr = BinaryExpression.and( + new GlobalTimeExpression(TimeFilter.gtEq(15L)), + new GlobalTimeExpression(TimeFilter.lt(25L))); // 15 <= time < 25 + +/** + * 构造完整的查询表达式 + */ +QueryExpression queryExpression = QueryExpression.create(paths, timeFilterExpr); +``` + +* 读取数据 + +```java +/** + * 根据文件路径`filePath`构造一个`ReadOnlyTsFile`实例。 + */ +TsFileSequenceReader reader = new TsFileSequenceReader(filePath); +ReadOnlyTsFile readTsFile = new ReadOnlyTsFile(reader); + +/** + * 查询数据 + */ +public QueryDataSet query(QueryExpression queryExpression) throws IOException +``` + +读取 TsFile 完整示例 + +[查询数据](../examples/src/main/java/org/apache/tsfile/TsFileRead.java) + +[全文件读取](../examples/src/main/java/org/apache/tsfile/TsFileSequenceRead.java) diff --git a/java/tsfile/README.md b/java/tsfile/README.md index 2afa2fb9..3706dff7 100644 --- a/java/tsfile/README.md +++ b/java/tsfile/README.md @@ -19,7 +19,7 @@ --> -# TsFile Document +# TsFile Java Document <pre> ___________ ___________.__.__ \__ ___/____\_ _____/|__| | ____ @@ -28,36 +28,171 @@ ___________ ___________.__.__ |____|/____ >\___ / |__|____/\___ > version 1.0.0 \/ \/ \/ </pre> -## Abstract -TsFile is a columnar storage file format designed for time series data, which supports efficient compression and query. It is easy to integrate TsFile into your IoT big data processing frameworks. +## Building With Java +### Prerequisites -## Motivation +To build TsFile wirh Java, you need to have: -Nowadays, the implementation of IoT is becoming increasingly popular in areas such as Industry 4.0, Smart Home, wearables and Connected Healthcare. Comparing with traditional IT infrastructure usage monitoring scenarios, applications like intelligent control and alarm reporting stimulate more advanced analytics requirements on time series data generated by sensors. Especially when IoT dives into industrial Internet, intelligent equipments produce one to two orders of magnitudes of data m [...] +1. Java >= 1.8 (1.8, 11 to 17 are verified. Please make sure the environment path has been set accordingly). +2. Maven >= 3.6 (If you want to compile TsFile from source code). -Recent advances in time series data management system are developed for data center monitoring. Currently there is not a file format optimized specifically for time series data in above scenarios. So TsFile was born. TsFile is a specially designed file format rather than a database. Users can open, write, read, and close a TsFile easily like doing operations on a normal file. Besides, more interfaces are available on a TsFile. -The target of TsFile project is to support: high ingestion rate up to tens of million data points per second and rare updates only for the correction of low quality data; compact data packaging and deep compression for long-live historical data; traditional sequential and conditional query, complex exploratory query, signal processing, data mining and machine learning. +### Build TsFile with Maven -The features of TsFile is as follow: +``` +mvn clean package -P with-java -DskipTests +``` -* **Write** - * Fast data import - * Efficiently compression - * diverse data encoding types -* **Read** - * Efficiently query - * Time-sorted query data set -* **Integration** - * HDFS - * Spark and Hive - * etc. +### Install to local machine -## Online Documents -* [Installation](https://github.com/thulab/tsfile/wiki/Installation) -* [Get Started](https://github.com/thulab/tsfile/wiki/Get-Started) -* [TsFile-Spark Connector](https://github.com/thulab/tsfile/wiki/TsFile-Spark-Connector) +``` +mvn install -P with-java -DskipTests +``` - +## Use TsFile + +### Add TsFile as a dependency in Maven + +The current release version is `1.0.0` + +```xml +<dependencies> + <dependency> + <groupId>org.apache.tsfile</groupId> + <artifactId>tsfile</artifactId> + <version>1.0.0</version> + </dependency> +<dependencies> +``` + +The current SNAPSHOT version is `1.0.1-SNAPSHOT`, you can use it after Maven install + +```xml +<dependencies> + <dependency> + <groupId>org.apache.tsfile</groupId> + <artifactId>tsfile-java</artifactId> + <version>1.0.1-SNAPSHOT</version> + </dependency> +<dependencies> +``` + +### TsFile Java API + +#### Write TsFile +TsFile can be generated through the following three steps, and the complete code can be found in the "Write TsFile Example" section. + +1. Register Schema + + you can make an instance of class `Schema` first and pass this to the constructor of class `TsFileWriter` + + The class `Schema` contains a map whose key is the name of one measurement schema, and the value is the schema itself. + + Here are the interfaces: + + ```java + + /** + * measurementID: The name of this measurement, typically the name of the sensor + * type: The data type, now support six types: `BOOLEAN`, `INT32`, `INT64`, `FLOAT`, `DOUBLE`, `TEXT` + * encoding: The data encoding + */ + public MeasurementSchema(String measurementId, TSDataType type, TSEncoding encoding) // default use LZ4 Compression + + // Initialize the schema using a predefined measurement list + public Schema(Map<String, MeasurementSchema> measurements) + + /** + * construct TsFileWriter for write + * file : The TsFile to write + * schema : The file schemas + */ + public TsFileWriter(File file, Schema schema) throws IOException + ``` + +2. use `TsFileWriter` write data. + + ```java + /** + * Use this interface to create a new `TSRecord`(a timestamp and device pair) + */ + public TSRecord(long timestamp, String deviceId) + + /** + * Then create a `DataPoint`(a measurement and value pair), and use the addTuple method to add the DataPoint to the correct TsRecord. + */ + for (IMeasurementSchema schema : schemas) { + tsRecord.addTuple( + DataPoint.getDataPoint( + schema.getType(), + schema.getMeasurementId(), + Objects.requireNonNull(DataGenerator.generate(schema.getType(), (int) startValue)) + .toString())); + startValue++; + } + /** + * write data + */ + public void write(TSRecord record) throws IOException, WriteProcessException + ``` + +3. call `close` to finish this writing process,Query can only be performed after close. + + ```java + public void close() throws IOException + ``` + +Write TsFile Example + +[Construct TSRecord Write Data](../examples/src/main/java/org/apache/tsfile/TsFileWriteAlignedWithTSRecord.java)。 + +[Construct Tablet Write Data](../examples/src/main/java/org/apache/tsfile/TsFileWriteAlignedWithTablet.java)。 + + +#### Read TsFile + +* Construct Query Expression +```java +/** + * Construct a time series to be read + * The time series is composed of the format deviceId.measurementId (there can be.) + */ +List<Path> paths = new ArrayList<Path>(); +paths.add(new Path("device_1.sensor_1")); +paths.add(new Path("device_1.sensor_3")); + +/** + * Construct Time Filter + */ +IExpression timeFilterExpr = BinaryExpression.and( + new GlobalTimeExpression(TimeFilter.gtEq(15L)), + new GlobalTimeExpression(TimeFilter.lt(25L))); // 15 <= time < 25 + +/** + * Construct Full Query Expression + */ +QueryExpression queryExpression = QueryExpression.create(paths, timeFilterExpr); +``` + +* Read Data + +```java +/** + * Construct an instance of 'ReadOnlyTsFile' based on the file path 'filePath'. + */ +TsFileSequenceReader reader = new TsFileSequenceReader(filePath); +ReadOnlyTsFile readTsFile = new ReadOnlyTsFile(reader); + +/** + * Query Data + */ +public QueryDataSet query(QueryExpression queryExpression) throws IOException +``` + +Read TsFile Example + +[Read Data](../examples/src/main/java/org/apache/tsfile/TsFileRead.java) + +[Sequence Read Data](../examples/src/main/java/org/apache/tsfile/TsFileSequenceRead.java) \ No newline at end of file
