Now Jenkins CI is not stable ( from Build #96 to #104)... It is caused by
one of your previous PR about Tsfile-Spark-Connector ( #108 or #173)..
I notice that PR #177 is for fixing that. You can attach it to this JIRA
issue.
Look forward to fix it ASAP..
---
Xiangdong Huang
School of Software, Tsinghua University
黄向东
清华大学 软件学院
Lei Rui (JIRA) 于2019年5月17日周五 上午12:21写道:
> Lei Rui created IOTDB-92:
>
>
> Summary: The data locality principle used by Spark loses
> ground in the face of TsFile.
> Key: IOTDB-92
> URL: https://issues.apache.org/jira/browse/IOTDB-92
> Project: Apache IoTDB
> Issue Type: Improvement
> Reporter: Lei Rui
>
>
> In the development of TsFile-Spark-Connector, we discover the problem
> that the data locality principle used by Spark loses ground in the face of
> TsFile. We believe the problem is rooted in the storage structure design of
> TsFile. Our latest implementation of TsFile-Spark-Connector finds a way to
> guarantee the proper functionality despite the constraint. The resolvement
> of the data locality problem is left for future work. Below are the details.
> h1. 1. Spark Partition
>
> In Apache Spark, the data is stored in the form of RDDs and divided
> into partitions across various nodes. A partition is a logical chunk of a
> large distributed data set that helps parallelize distributed data
> processing. Spark works on data locality principle to minimize the network
> traffic for sending data between executors.
>
>
> https://jaceklaskowski.gitbooks.io/mastering-apache-spark/spark-rdd-partitions.html
>
> https://techvidvan.com/tutorials/spark-partition/
> h1. 2. TsFile Structure
>
> TsFile is a columnar storage file format designed for time series
> data, which supports efficient compression and query. Data in TsFile are
> organized by device-measurement hierarchy. As the figure below shows, the
> storage unit of a device is a chunk group and that of a measurement is a
> chunk. Measurements of a device are grouped together.
>
> Under this architecture, different partitions of Spark logically
> contain different sets of chunkgroups of TsFile. Now consider the query
> process of this TsFile on Spark. Supposing that we query “select * from
> root where d1.s6<2.5 and d2.s1>10”, a scheduled task of a partition has
> to deal with the whole data to get the right answer. However, this also
> means that it is nearly impossible to apply the data locality principle.
> h1. 3. Problem
>
> Now we can summarize two problems.
>
> The first problem is how to guarantee the correctness of the queried
> answer integrated from all the partition task without changing the current
> storage structure of TsFile. To solve this problem, we propose a solution
> by converting the space partition constraint to the time partition
> constraint while still requiring a single task to have access to the whole
> TsFile data. As shown in the figure below, the task of partition 1 is
> assigned the yellow marked time partition constraint; the task of partition
> 2 is assigned the green marked time partition constraint; the task of
> partition 3 is assigned empty time partition constraint because the former
> two tasks have completed the query.
>
> The second problem is more fundamental. That is, how we can adjust to
> enable some extent of data locality of TsFile when it is queried on Spark.
>
>
>
> --
> This message was sent by Atlassian JIRA
> (v7.6.3#76005)
>