Now Jenkins CI is not stable ( from Build #96 to #104)... It is caused by one of your previous PR about Tsfile-Spark-Connector ( #108 or #173)..
I notice that PR #177 is for fixing that. You can attach it to this JIRA issue. Look forward to fix it ASAP.. ----------------------------------- Xiangdong Huang School of Software, Tsinghua University 黄向东 清华大学 软件学院 Lei Rui (JIRA) <j...@apache.org> 于2019年5月17日周五 上午12:21写道: > Lei Rui created IOTDB-92: > ---------------------------- > > Summary: The data locality principle used by Spark loses > ground in the face of TsFile. > Key: IOTDB-92 > URL: https://issues.apache.org/jira/browse/IOTDB-92 > Project: Apache IoTDB > Issue Type: Improvement > Reporter: Lei Rui > > > In the development of TsFile-Spark-Connector, we discover the problem > that the data locality principle used by Spark loses ground in the face of > TsFile. We believe the problem is rooted in the storage structure design of > TsFile. Our latest implementation of TsFile-Spark-Connector finds a way to > guarantee the proper functionality despite the constraint. The resolvement > of the data locality problem is left for future work. Below are the details. > h1. 1. Spark Partition > > In Apache Spark, the data is stored in the form of RDDs and divided > into partitions across various nodes. A partition is a logical chunk of a > large distributed data set that helps parallelize distributed data > processing. Spark works on data locality principle to minimize the network > traffic for sending data between executors. > > > https://jaceklaskowski.gitbooks.io/mastering-apache-spark/spark-rdd-partitions.html > > https://techvidvan.com/tutorials/spark-partition/ > h1. 2. TsFile Structure > > TsFile is a columnar storage file format designed for time series > data, which supports efficient compression and query. Data in TsFile are > organized by device-measurement hierarchy. As the figure below shows, the > storage unit of a device is a chunk group and that of a measurement is a > chunk. Measurements of a device are grouped together. > > Under this architecture, different partitions of Spark logically > contain different sets of chunkgroups of TsFile. Now consider the query > process of this TsFile on Spark. Supposing that we query “select * from > root where d1.s6<2.5 and d2.s1>10”, a scheduled task of a partition has > to deal with the whole data to get the right answer. However, this also > means that it is nearly impossible to apply the data locality principle. > h1. 3. Problem > > Now we can summarize two problems. > > The first problem is how to guarantee the correctness of the queried > answer integrated from all the partition task without changing the current > storage structure of TsFile. To solve this problem, we propose a solution > by converting the space partition constraint to the time partition > constraint while still requiring a single task to have access to the whole > TsFile data. As shown in the figure below, the task of partition 1 is > assigned the yellow marked time partition constraint; the task of partition > 2 is assigned the green marked time partition constraint; the task of > partition 3 is assigned empty time partition constraint because the former > two tasks have completed the query. > > The second problem is more fundamental. That is, how we can adjust to > enable some extent of data locality of TsFile when it is queried on Spark. > > > > -- > This message was sent by Atlassian JIRA > (v7.6.3#76005) >