Re: [Discuss] How to delivery the device concept to users

Xiangdong Huang Tue, 21 Jul 2020 05:55:01 -0700

Ah, Don't we have?  we have the time partition folder from v0.10 on..

Best,
-----------------------------------
Xiangdong Huang
School of Software, Tsinghua University


 黄向东
清华大学 软件学院


Jialin Qiao <[email protected]> 于2020年7月21日周二 下午8:37写道：

> Hi,
>
> This is not the current implementation... We do not have a partition
> folder on disk now.
> By adding a partition folder, there is no need to store all
> TsFileResources in the memory, and the device index will not hurt us.
>
> Thanks,
> --
> Jialin Qiao
> School of Software, Tsinghua University
>
> 乔嘉林
> 清华大学 软件学院
>
> > -----原始邮件-----
> > 发件人: "Xiangdong Huang" <[email protected]>
> > 发送时间: 2020-07-21 18:46:31 (星期二)
> > 收件人: dev <[email protected]>
> > 抄送:
> > 主题: Re: [Discuss] How to delivery the device concept to users
> >
> > Hi Jialin,
> >
> > Yes it is current logic. But I do not know the relation between what you
> > said and this discussion...
> >
> > Best,
> > -----------------------------------
> > Xiangdong Huang
> > School of Software, Tsinghua University
> >
> >  黄向东
> > 清华大学 软件学院
> >
> >
> > Jialin Qiao <[email protected]> 于2020年7月21日周二 下午4:47写道：
> >
> > > Hi,
> > >
> > > I would like to give a vision about managing the data files according
> to
> > > time partition.
> > >
> > > After we introduce the time partition (data is partitioned by time
> > > interval), we do split them in memory and different TsFiles. But we may
> > > lake a partition folder layer on top of the TsFiles.
> > >
> > > Maybe it should work as follows:
> > >
> > > E.g., we insert data into storage group root.sg from 2020-07-19 to
> > > 2020-07-21 and the partition interval is 1 day.
> > > First, we create three folders (2020-07-19, 2020-07-20, 2020-07-21)
> under
> > > root.sg that belongs to each partition.
> > > Then, we store TsFiles to its related partition folder.
> > >
> > > An example of TsFiles on disk is as follows:
> > >
> > > sequence
> > > ├── root.sg
> > > │   ├── 2020-07-19
> > > │   │   └── timestamp1-version1-merge.tsfile
> > > │   │   └── timestamp1-version1-merge.tsfile.resource
> > > │   │   └── ...
> > > │   ├── 2020-07-20
> > > │   ├── 2020-07-21
> > >
> > >
> > > unsequence(similar with sequence folder)
> > > ├── root.sg
> > > │   ├── 2020-07-19
> > > │   ├── 2020-07-19
> > > │   ├── 2020-07-19
> > >
> > > We only need to store the whole partition folders in memory as a
> > > List<String>, this memory consumption is negligible.
> > >
> > > For the hot partition, e.g., recent 10 days' partition, we could cache
> > > their TsFileResources in memory to accelerate
> > > queries.
> > >
> > > Then, how to do a query?
> > >
> > > Suppose we receive a query: select * from root.sg where time >=
> > > 2020-07-20 and time <= 2020-07-21
> > >
> > > - We could locate two partitions under root.sg that may contains the
> > > results: 2020-07-20, 2020-07-21
> > > - Then we traverse the partition folder to get all TsFileResources in
> this
> > > partition.
> > > - Finally, we do queries.
> > >
> > > Is this feasible?
> > >
> > > Thanks,
> > > --
> > > Jialin Qiao
> > > School of Software, Tsinghua University
> > >
> > > 乔嘉林
> > > 清华大学 软件学院
> > >
> > > > -----原始邮件-----
> > > > 发件人: "Xiangdong Huang" <[email protected]>
> > > > 发送时间: 2020-07-20 20:03:34 (星期一)
> > > > 收件人: dev <[email protected]>
> > > > 抄送:
> > > > 主题: Re: Re: [Discuss] How to delivery the device concept to users
> > > >
> > > > Hi,
> > > >
> > > > >  I wonder whether we could index the file by its name. (naming the
> > > tsfile
> > > > by date)
> > > >
> > > > I think it is a good idea, but maybe not very easy to implement. If
> we
> > > can
> > > > organize the data like this, then it is very very regular and very
> easy
> > > to
> > > > access or delete expired data...
> > > >
> > > > > we would need is a tree strucutre where each node has start time /
> end
> > > > time for "everything" in the file.
> > > >
> > > > This is also a good idea.
> > > >
> > > > When we are discussing the granularity of "device", we are worrying
> about
> > > > the size of the index, actually.
> > > > So, we do not care whether there is a so called "sub device", we just
> > > care
> > > > how many entities will be indexed.
> > > >
> > > > Suppose an IoTDB instance can bear 1 million index entries <some_id
> ->
> > > > (start time, end time)>,  and given a tree schema, if there are
> about 1
> > > > million nodes from level 0 to level 3, then we can index the nodes on
> > > > level3 (so level 3 is so-called "device" in current version).
> > > >
> > > > Meantime, index the nodes from level0 to level2, as Julian proposed,
> is
> > > > also beneficial.
> > > >
> > > > The nature of the above idea is letting IoTDB decides which are
> "devices"
> > > > automatically.
> > > >
> > > > At the beginning of this discussion, I just want to let user claim
> which
> > > > are "devices" (or, which prefixes of Paths have time indexes.. but
> this
> > > > kind of description may be not user friendly..). As it is more
> easy....
> > > but
> > > > may carry risk if the user set too many devices.
> > > >
> > > > Best,
> > > > -----------------------------------
> > > > Xiangdong Huang
> > > > School of Software, Tsinghua University
> > > >
> > > >  黄向东
> > > > 清华大学 软件学院
> > > >
> > > >
> > > > [email protected] <[email protected]> 于2020年7月20日周一
> 下午7:47写道：
> > > >
> > > > > Hi，
> > > > >
> > > > > > I wonder whether we could index the file by its name. (naming the
> > > tsfile
> > > > > by date) E.g., we store each day's data in one file and name it as
> > > > > sg-2020-07-20.TsFile. Then, we do not need to maintain the index in
> > > memory,
> > > > > we just need to check whether the file exist in the queried
> interval.
> > > > >
> > > > > So, how to deal with the out of order data? Could you give more
> > > details.
> > > > >
> > > > >
> > > > >
> > > > > Thanks!
> > > > >
> > > > > [email protected]
> > > > >
> > > > >
> > > > > From: Jialin Qiao
> > > > > Date: 2020-07-20 18:21
> > > > > To: dev
> > > > > Subject: Re: [Discuss] How to delivery the device concept to users
> > > > > Hi,
> > > > >
> > > > > > The question I would ask is why "devices" hurt us.
> > > > >
> > > > > I'd like to introduce this a bit. For each storage group, we flush
> the
> > > > > memtable into TsFiles one by one. For each TsFile, we maintain a
> > > temporal
> > > > > index on device level in memory. Suppose there are 3 devices in one
> > > TsFile,
> > > > > the index is like this:
> > > > >
> > > > > start time array: long[3] = {1, 1, 2}
> > > > > end time array: long[3] = {5, 6, 10}
> > > > > devicesToIndexInArray: Map<String, Integer> = {"root.sg.d1" -> 0,
> > > > > "root.sg.d2" -> 1, "root.sg.d3" -> 2}
> > > > >
> > > > > If we have millions of devices, for each TsFile, this index will
> reach
> > > > > dozens of MB in memory. Although we could introduce the
> persistence of
> > > the
> > > > > index. It is still recommended to decrease the number of devices.
> > > > >
> > > > > I wonder whether we could index the file by its name. (naming the
> > > tsfile
> > > > > by date) E.g., we store each day's data in one file and name it as
> > > > > sg-2020-07-20.TsFile. Then, we do not need to maintain the index in
> > > memory,
> > > > > we just need to check whether the file exist in the queried
> interval.
> > > > >
> > > > > Thanks,
> > > > > --
> > > > > Jialin Qiao
> > > > > School of Software, Tsinghua University
> > > > >
> > > > > 乔嘉林
> > > > > 清华大学 软件学院
> > > > >
> > > > > > -----原始邮件-----
> > > > > > 发件人: "Julian Feinauer" <[email protected]>
> > > > > > 发送时间: 2020-07-20 17:34:40 (星期一)
> > > > > > 收件人: "[email protected]" <[email protected]>
> > > > > > 抄送:
> > > > > > 主题: Re: [Discuss] How to delivery the device concept to users
> > > > > >
> > > > > > Hey Jialin, xinagdong,
> > > > > >
> > > > > > very good question!
> > > > > >
> > > > > > And I tend to agree with Xiangdong.
> > > > > > If the users do it that way it probably makes most sense for
> them.
> > > > > > The question I would ask is why "devices" hurt us (I know a bit
> about
> > > > > the implementation of course but probably we have to adopt our
> > > datamodel
> > > > > also a bit in the future).
> > > > > >
> > > > > > Generally speaking, form e it also makes sense tob e allowed to
> have
> > > > > "subcategories" below my devices as my devices usually are "big".
> > > > > > And technically speaking in the current version this is totally
> > > possible
> > > > > to have nested structures below devices or measurements (but these
> will
> > > > > then again be devices).
> > > > > >
> > > > > > So my question is:
> > > > > > - Do we really need the static construct of a "device" or can we
> > > > > probably use a different datastructure where I "select" my device
> only
> > > at
> > > > > query time and we just select everything under that tree as ist
> > > > > measurements or "sub-measurements" in cases of nesting.
> > > > > >
> > > > > > WDYT?
> > > > > >
> > > > > > Julian
> > > > > >
> > > > > > Am 20.07.20, 09:34 schrieb "Xiangdong Huang" <
> [email protected]>:
> > > > > >
> > > > > >     Hi,
> > > > > >
> > > > > >     This is a quite good topic!
> > > > > >
> > > > > >     1. maybe we should hear more users opinions.
> > > > > >
> > > > > >     For me, I think emphasize the concept of "device" is good.
> We can
> > > > > even
> > > > > >     expose the concept in our APIs.
> > > > > >
> > > > > >     2.
> > > > > >
> > > > > >     > A more efficient way is
> > > > > >     > root.sg.device1.measurement1_int0
> > > > > >     > root.sg.device1.measurement1_int1
> > > > > >     >  root.sg.device1.measurement1_int2
> > > > > >     > root.sg.device1.measurement2_long
> > > > > >
> > > > > >     I think the more efficient way is:
> > > > > >
> > > > > >     root.sg.device1.measurement1.0
> > > > > >     root.sg.device1.measurement1.1
> > > > > >     root.sg.device1.measurement1.2
> > > > > >     root.sg.device1.measurement2
> > > > > >
> > > > > >     And, as you said "a device has a sensor that collects some
> data
> > > in
> > > > > array
> > > > > >     format (int[3]) and some in long type",
> > > > > >     will the user query just one element from the int[3]? If
> not, a
> > > > > better
> > > > > >     schema is:
> > > > > >
> > > > > >     root.sg.device1.measurement1 (the dataType is int[])
> > > > > >     root.sg.device1.measurement2 (the dataType is long)
> > > > > >
> > > > > >     Best,
> > > > > >     -----------------------------------
> > > > > >     Xiangdong Huang
> > > > > >     School of Software, Tsinghua University
> > > > > >
> > > > > >      黄向东
> > > > > >     清华大学 软件学院
> > > > > >
> > > > > >
> > > > > >     Jialin Qiao <[email protected]> 于2020年7月20日周一
> > > 下午3:28写道：
> > > > > >
> > > > > >     > Hi
> > > > > >     >
> > > > > >     > Recently, I find that some users create timeseries do not
> > > > > following the
> > > > > >     > real world semantic of device
> > > > > >     >
> > > > > >     >
> > > > > >     > E.g., a device has a sensor that collects some data in
> array
> > > format
> > > > > >     > (int[3]) and some in long type.
> > > > > >     >
> > > > > >     >
> > > > > >     > Many users will create timeseries like this:
> > > > > >     >
> > > > > >     >
> > > > > >     > root.sg.device1.measurement1.int0
> > > > > >     > root.sg.device1.measurement1.int1
> > > > > >     > root.sg.device1.measurement1.int2
> > > > > >     > root.sg.device1.measurement2.long
> > > > > >     >
> > > > > >     >
> > > > > >     > As a consequence, there will be two devices instead of one
> > > device.
> > > > > This
> > > > > >     > will cause the real number of devices is much bigger than
> the
> > > real
> > > > > devices
> > > > > >     > they thought. The drawback is: more devices leads to more
> > > memory
> > > > > >     > consumption.
> > > > > >     >
> > > > > >     >
> > > > > >     > A more efficient way is
> > > > > >     >
> > > > > >     >
> > > > > >     > root.sg.device1.measurement1_int0
> > > > > >     > root.sg.device1.measurement1_int1
> > > > > >     > root.sg.device1.measurement1_int2
> > > > > >     > root.sg.device1.measurement2_long
> > > > > >     >
> > > > > >     >
> > > > > >     > In this schema, there will be only one device and 4
> > > measurements.
> > > > > >     >
> > > > > >     >
> > > > > >     > The problem is we extract the device id automatically.
> Users
> > > > > usually do
> > > > > >     > not have a clear concept about "device". Should we
> emphasize
> > > the
> > > > > concept of
> > > > > >     > device by letting users create device manually?
> > > > > >     >
> > > > > >     >
> > > > > >     > What do you think?
> > > > > >     >
> > > > > >     > Thanks,
> > > > > >     > --
> > > > > >     > Jialin Qiao
> > > > > >     > School of Software, Tsinghua University
> > > > > >     >
> > > > > >     > 乔嘉林
> > > > > >     > 清华大学 软件学院
> > > > > >
> > > > >
> > >
>

Re: [Discuss] How to delivery the device concept to users

Reply via email to