Re: [jira] [Created] (IOTDB-92) The data locality principle used by Spark loses ground in the face of TsFile.

Xiangdong Huang Thu, 16 May 2019 09:31:33 -0700

Now Jenkins CI is not stable ( from Build #96 to #104)... It is caused by
one of your previous PR about Tsfile-Spark-Connector ( #108 or #173)..


I notice that PR #177 is for fixing that. You can attach it to this JIRA
issue.

Look forward to fix it ASAP..

-----------------------------------
Xiangdong Huang
School of Software, Tsinghua University

 黄向东
清华大学 软件学院


Lei Rui (JIRA) <j...@apache.org> 于2019年5月17日周五 上午12:21写道：

> Lei Rui created IOTDB-92:
> ----------------------------
>
>              Summary: The data locality principle used by Spark loses
> ground in the face of TsFile.
>                  Key: IOTDB-92
>                  URL: https://issues.apache.org/jira/browse/IOTDB-92
>              Project: Apache IoTDB
>           Issue Type: Improvement
>             Reporter: Lei Rui
>
>
>     In the development of TsFile-Spark-Connector, we discover the problem
> that the data locality principle used by Spark loses ground in the face of
> TsFile. We believe the problem is rooted in the storage structure design of
> TsFile. Our latest implementation of TsFile-Spark-Connector finds a way to
> guarantee the proper functionality despite the constraint. The resolvement
> of the data locality problem is left for future work. Below are the details.
> h1. 1. Spark Partition
>
>     In Apache Spark, the data is stored in the form of RDDs and divided
> into partitions across various nodes. A partition is a logical chunk of a
> large distributed data set that helps parallelize distributed data
> processing. Spark works on data locality principle to minimize the network
> traffic for sending data between executors.
>
>
> https://jaceklaskowski.gitbooks.io/mastering-apache-spark/spark-rdd-partitions.html
>
> https://techvidvan.com/tutorials/spark-partition/
> h1. 2. TsFile Structure
>
>     TsFile is a columnar storage file format designed for time series
> data, which supports efficient compression and query. Data in TsFile are
> organized by device-measurement hierarchy. As the figure below shows, the
> storage unit of a device is a chunk group and that of a measurement is a
> chunk. Measurements of a device are grouped together.
>
>     Under this architecture, different partitions of Spark logically
> contain different sets of chunkgroups of TsFile. Now consider the query
> process of this TsFile on Spark. Supposing that we query “select * from
> root where d1.s6<2.5 and d2.s1>10”, a scheduled task of a partition has
> to deal with the whole data to get the right answer. However, this also
> means that it is nearly impossible to apply the data locality principle.
> h1. 3. Problem
>
> Now we can summarize two problems.
>
> The first problem is how to guarantee the correctness of the queried
> answer integrated from all the partition task without changing the current
> storage structure of TsFile. To solve this problem, we propose a solution
> by converting the space partition constraint to the time partition
> constraint while still requiring a single task to have access to the whole
> TsFile data. As shown in the figure below, the task of partition 1 is
> assigned the yellow marked time partition constraint; the task of partition
> 2 is assigned the green marked time partition constraint; the task of
> partition 3 is assigned empty time partition constraint because the former
> two tasks have completed the query.
>
> The second problem is more fundamental. That is, how we can adjust to
> enable some extent of data locality of TsFile when it is queried on Spark.
>
>
>
> --
> This message was sent by Atlassian JIRA
> (v7.6.3#76005)
>

Re: [jira] [Created] (IOTDB-92) The data locality principle used by Spark loses ground in the face of TsFile.

Reply via email to