[jira] [Created] (IOTDB-92) The data locality principle used by Spark loses ground in the face of TsFile.

2019-05-16 Thread Lei Rui (JIRA)
Lei Rui created IOTDB-92:


 Summary: The data locality principle used by Spark loses ground in 
the face of TsFile.
 Key: IOTDB-92
 URL: https://issues.apache.org/jira/browse/IOTDB-92
 Project: Apache IoTDB
  Issue Type: Improvement
Reporter: Lei Rui


    In the development of TsFile-Spark-Connector, we discover the problem that 
the data locality principle used by Spark loses ground in the face of TsFile. 
We believe the problem is rooted in the storage structure design of TsFile. Our 
latest implementation of TsFile-Spark-Connector finds a way to guarantee the 
proper functionality despite the constraint. The resolvement of the data 
locality problem is left for future work. Below are the details.
h1. 1. Spark Partition

    In Apache Spark, the data is stored in the form of RDDs and divided into 
partitions across various nodes. A partition is a logical chunk of a large 
distributed data set that helps parallelize distributed data processing. Spark 
works on data locality principle to minimize the network traffic for sending 
data between executors.

https://jaceklaskowski.gitbooks.io/mastering-apache-spark/spark-rdd-partitions.html

https://techvidvan.com/tutorials/spark-partition/
h1. 2. TsFile Structure

    TsFile is a columnar storage file format designed for time series data, 
which supports efficient compression and query. Data in TsFile are organized by 
device-measurement hierarchy. As the figure below shows, the storage unit of a 
device is a chunk group and that of a measurement is a chunk. Measurements of a 
device are grouped together.

    Under this architecture, different partitions of Spark logically contain 
different sets of chunkgroups of TsFile. Now consider the query process of this 
TsFile on Spark. Supposing that we query “select * from root where d1.s6<2.5 
and d2.s1>10”, a scheduled task of a partition has to deal with the whole data 
to get the right answer. However, this also means that it is nearly impossible 
to apply the data locality principle.
h1. 3. Problem

Now we can summarize two problems. 

The first problem is how to guarantee the correctness of the queried answer 
integrated from all the partition task without changing the current storage 
structure of TsFile. To solve this problem, we propose a solution by converting 
the space partition constraint to the time partition constraint while still 
requiring a single task to have access to the whole TsFile data. As shown in 
the figure below, the task of partition 1 is assigned the yellow marked time 
partition constraint; the task of partition 2 is assigned the green marked time 
partition constraint; the task of partition 3 is assigned empty time partition 
constraint because the former two tasks have completed the query.

The second problem is more fundamental. That is, how we can adjust to enable 
some extent of data locality of TsFile when it is queried on Spark.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: [jira] [Created] (IOTDB-92) The data locality principle used by Spark loses ground in the face of TsFile.

2019-05-16 Thread Xiangdong Huang
Now Jenkins CI is not stable ( from Build #96 to #104)... It is caused by
one of your previous PR about Tsfile-Spark-Connector ( #108 or #173)..

I notice that PR #177 is for fixing that. You can attach it to this JIRA
issue.

Look forward to fix it ASAP..

---
Xiangdong Huang
School of Software, Tsinghua University

 黄向东
清华大学 软件学院


Lei Rui (JIRA)  于2019年5月17日周五 上午12:21写道:

> Lei Rui created IOTDB-92:
> 
>
>  Summary: The data locality principle used by Spark loses
> ground in the face of TsFile.
>  Key: IOTDB-92
>  URL: https://issues.apache.org/jira/browse/IOTDB-92
>  Project: Apache IoTDB
>   Issue Type: Improvement
> Reporter: Lei Rui
>
>
> In the development of TsFile-Spark-Connector, we discover the problem
> that the data locality principle used by Spark loses ground in the face of
> TsFile. We believe the problem is rooted in the storage structure design of
> TsFile. Our latest implementation of TsFile-Spark-Connector finds a way to
> guarantee the proper functionality despite the constraint. The resolvement
> of the data locality problem is left for future work. Below are the details.
> h1. 1. Spark Partition
>
> In Apache Spark, the data is stored in the form of RDDs and divided
> into partitions across various nodes. A partition is a logical chunk of a
> large distributed data set that helps parallelize distributed data
> processing. Spark works on data locality principle to minimize the network
> traffic for sending data between executors.
>
>
> https://jaceklaskowski.gitbooks.io/mastering-apache-spark/spark-rdd-partitions.html
>
> https://techvidvan.com/tutorials/spark-partition/
> h1. 2. TsFile Structure
>
> TsFile is a columnar storage file format designed for time series
> data, which supports efficient compression and query. Data in TsFile are
> organized by device-measurement hierarchy. As the figure below shows, the
> storage unit of a device is a chunk group and that of a measurement is a
> chunk. Measurements of a device are grouped together.
>
> Under this architecture, different partitions of Spark logically
> contain different sets of chunkgroups of TsFile. Now consider the query
> process of this TsFile on Spark. Supposing that we query “select * from
> root where d1.s6<2.5 and d2.s1>10”, a scheduled task of a partition has
> to deal with the whole data to get the right answer. However, this also
> means that it is nearly impossible to apply the data locality principle.
> h1. 3. Problem
>
> Now we can summarize two problems.
>
> The first problem is how to guarantee the correctness of the queried
> answer integrated from all the partition task without changing the current
> storage structure of TsFile. To solve this problem, we propose a solution
> by converting the space partition constraint to the time partition
> constraint while still requiring a single task to have access to the whole
> TsFile data. As shown in the figure below, the task of partition 1 is
> assigned the yellow marked time partition constraint; the task of partition
> 2 is assigned the green marked time partition constraint; the task of
> partition 3 is assigned empty time partition constraint because the former
> two tasks have completed the query.
>
> The second problem is more fundamental. That is, how we can adjust to
> enable some extent of data locality of TsFile when it is queried on Spark.
>
>
>
> --
> This message was sent by Atlassian JIRA
> (v7.6.3#76005)
>