Lei Rui created IOTDB-92: ---------------------------- Summary: The data locality principle used by Spark loses ground in the face of TsFile. Key: IOTDB-92 URL: https://issues.apache.org/jira/browse/IOTDB-92 Project: Apache IoTDB Issue Type: Improvement Reporter: Lei Rui
In the development of TsFile-Spark-Connector, we discover the problem that the data locality principle used by Spark loses ground in the face of TsFile. We believe the problem is rooted in the storage structure design of TsFile. Our latest implementation of TsFile-Spark-Connector finds a way to guarantee the proper functionality despite the constraint. The resolvement of the data locality problem is left for future work. Below are the details. h1. 1. Spark Partition In Apache Spark, the data is stored in the form of RDDs and divided into partitions across various nodes. A partition is a logical chunk of a large distributed data set that helps parallelize distributed data processing. Spark works on data locality principle to minimize the network traffic for sending data between executors. https://jaceklaskowski.gitbooks.io/mastering-apache-spark/spark-rdd-partitions.html https://techvidvan.com/tutorials/spark-partition/ h1. 2. TsFile Structure TsFile is a columnar storage file format designed for time series data, which supports efficient compression and query. Data in TsFile are organized by device-measurement hierarchy. As the figure below shows, the storage unit of a device is a chunk group and that of a measurement is a chunk. Measurements of a device are grouped together. Under this architecture, different partitions of Spark logically contain different sets of chunkgroups of TsFile. Now consider the query process of this TsFile on Spark. Supposing that we query “select * from root where d1.s6<2.5 and d2.s1>10”, a scheduled task of a partition has to deal with the whole data to get the right answer. However, this also means that it is nearly impossible to apply the data locality principle. h1. 3. Problem Now we can summarize two problems. The first problem is how to guarantee the correctness of the queried answer integrated from all the partition task without changing the current storage structure of TsFile. To solve this problem, we propose a solution by converting the space partition constraint to the time partition constraint while still requiring a single task to have access to the whole TsFile data. As shown in the figure below, the task of partition 1 is assigned the yellow marked time partition constraint; the task of partition 2 is assigned the green marked time partition constraint; the task of partition 3 is assigned empty time partition constraint because the former two tasks have completed the query. The second problem is more fundamental. That is, how we can adjust to enable some extent of data locality of TsFile when it is queried on Spark. -- This message was sent by Atlassian JIRA (v7.6.3#76005)