[jira] [Created] (IOTDB-92) The data locality principle used by Spark loses ground in the face of TsFile.

Lei Rui (JIRA) Thu, 16 May 2019 09:21:10 -0700

Lei Rui created IOTDB-92:
----------------------------

             Summary: The data locality principle used by Spark loses ground in 
the face of TsFile.
                 Key: IOTDB-92
                 URL: https://issues.apache.org/jira/browse/IOTDB-92
             Project: Apache IoTDB
          Issue Type: Improvement
            Reporter: Lei Rui



    In the development of TsFile-Spark-Connector, we discover the problem that 
the data locality principle used by Spark loses ground in the face of TsFile. 
We believe the problem is rooted in the storage structure design of TsFile. Our 
latest implementation of TsFile-Spark-Connector finds a way to guarantee the 
proper functionality despite the constraint. The resolvement of the data 
locality problem is left for future work. Below are the details.
h1. 1. Spark Partition

    In Apache Spark, the data is stored in the form of RDDs and divided into 
partitions across various nodes. A partition is a logical chunk of a large 
distributed data set that helps parallelize distributed data processing. Spark 
works on data locality principle to minimize the network traffic for sending 
data between executors.

https://jaceklaskowski.gitbooks.io/mastering-apache-spark/spark-rdd-partitions.html

https://techvidvan.com/tutorials/spark-partition/
h1. 2. TsFile Structure

    TsFile is a columnar storage file format designed for time series data, 
which supports efficient compression and query. Data in TsFile are organized by 
device-measurement hierarchy. As the figure below shows, the storage unit of a 
device is a chunk group and that of a measurement is a chunk. Measurements of a 
device are grouped together.

    Under this architecture, different partitions of Spark logically contain 
different sets of chunkgroups of TsFile. Now consider the query process of this 
TsFile on Spark. Supposing that we query “select * from root where d1.s6<2.5 
and d2.s1>10”, a scheduled task of a partition has to deal with the whole data 
to get the right answer. However, this also means that it is nearly impossible 
to apply the data locality principle.
h1. 3. Problem

Now we can summarize two problems. 

The first problem is how to guarantee the correctness of the queried answer 
integrated from all the partition task without changing the current storage 
structure of TsFile. To solve this problem, we propose a solution by converting 
the space partition constraint to the time partition constraint while still 
requiring a single task to have access to the whole TsFile data. As shown in 
the figure below, the task of partition 1 is assigned the yellow marked time 
partition constraint; the task of partition 2 is assigned the green marked time 
partition constraint; the task of partition 3 is assigned empty time partition 
constraint because the former two tasks have completed the query.

The second problem is more fundamental. That is, how we can adjust to enable 
some extent of data locality of TsFile when it is queried on Spark.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (IOTDB-92) The data locality principle used by Spark loses ground in the face of TsFile.

Reply via email to