Re: Operation and robustness of iotDB

Xiangdong Huang Thu, 07 Mar 2019 16:42:59 -0800

 > (We need to have a merge sort when querying more than one measurement)


Users do not need to care about that, because IoTDB/TsFile APIs have
merge-sorted the data for users.

-----------------------------------
Xiangdong Huang
School of Software, Tsinghua University

 黄向东
清华大学 软件学院


Xiangdong Huang <[email protected]> 于2019年3月8日周五 上午8:40写道：

> Hi,
>
> Yes, every Chunk has its timestamp column. We design like this because
> different sensor may have different frequency.. For example, the rotate
> speed of an engine may be collected 100 per second, while the GPS info is
> collected every 1 second. And, their start time may be not aligned...
>
> So, if you consider the data as a table (timestamp, device name, sensor1,
> sensor2,...), then it is a quite sparse table. Parquet introduces Definition
> and Repeated Level Fields to read data row by row, but we think it is not
> so natural for time series data.
>
> As a result, we store timestamp data on each column (I mean, Chunk).
> Experiences show the disk overhead is little. And there is many advantages
> for query. (We need to have a merge sort when querying more than one
> measurement).
>
> Best,
> -----------------------------------
> Xiangdong Huang
> School of Software, Tsinghua University
>
>  黄向东
> 清华大学 软件学院
>
>
> Julian Feinauer <[email protected]> 于2019年3月8日周五 上午2:08写道：
>
>> Hey Xu Yi,
>>
>> thanks fort he information.
>> I checked the code and indeed I was wrong.
>> Every Chunk also stores its timestamp.
>>
>> So when I read values through a Query all timestamps are "interpolated"
>> or merged together from all sensors, or?
>>
>> Julian
>>
>> Am 07.03.19, 18:48 schrieb "Xu yi" <[email protected]>:
>>
>>     Hi,
>>
>>     In my opinion, different measurements use their own timestamp even
>> though they are grouped into one chunk group.they don’t share from each
>> other.
>>
>>     What do you think of this @xiangdong
>>
>>     Thanks
>>     XuYi
>>
>>     iPhoneから送信
>>
>>     2019/03/08 1:41、Julian Feinauer <[email protected]>のメール:
>>
>>     > Hi,
>>     >
>>     > Yes this is what I meant.
>>     >
>>     > Julian
>>     >
>>     > Von meinem Mobiltelefon gesendet
>>     >
>>     >
>>     > -------- Ursprüngliche Nachricht --------
>>     > Betreff: Re: Operation and robustness of iotDB
>>     > Von: 徐毅
>>     > An: [email protected]
>>     > Cc:
>>     >
>>     > Hi,
>>     > In the definition of ChunkGroup, what is the meaning of 'share one
>> time signal'? Do these measurements share same timestamps?
>>     >
>>     >
>>     > Thanks
>>     > XuYi
>>     > On 3/8/2019 01:11，Julian Feinauer<[email protected]>
>> wrote：
>>     > Hey Xiangdong,
>>     > hey all,
>>     >
>>     > I like the documentation much.
>>     > The only thing I'm a bit unsure is about the names (as there is no
>> clarification).
>>     > So, before I update it with any wrong information I would like to
>> ensure that I have the correct understanding.
>>     >
>>     > I assume that most naming is similar to Parquet.
>>     >
>>     > Page - Contains one Measurement, smallest source of compression
>>     > Chunk - Collection of multiple Pages, still one measurement
>>     > ChunkGroup - Collection of chunks of which share one time signal
>> (one Chunk for each measurement)
>>     >
>>     > Is this correct so?
>>     >
>>     > Julian
>>     >
>>     > Am 05.03.19, 12:26 schrieb "Xiangdong Huang" <[email protected]>:
>>     >
>>     > Hi,
>>     >
>>     > 1. We have a document to introduce that:
>>     > https://cwiki.apache.org/confluence/display/IOTDB/TsFile+Format
>>     >
>>     > 2. The new API for recovering data is almost done. I am writing the
>> UTs
>>     > now. Maybe I can submit a PR tonight (if everything is fine...)
>>     >
>>     > Best,
>>     > -----------------------------------
>>     > Xiangdong Huang
>>     > School of Software, Tsinghua University
>>     >
>>     > 黄向东
>>     > 清华大学 软件学院
>>     >
>>     >
>>     > Julian Feinauer <[email protected]> 于2019年3月5日周二
>> 下午6:00写道：
>>     >
>>     > Hi Xiangdong,
>>     >
>>     > that sounds excellent.
>>     > Do you have a short overview of how the file format is designed on
>> disk?
>>     > I know that its somewhat similar to parquet but I did not find more
>>     > details.
>>     > Basically what would suffice for us would be something like
>> skipping an
>>     > invalid column group (or how you name it) and go on with the next,
>> or so.
>>     >
>>     > Julian
>>     >
>>     > Am 04.03.19, 13:21 schrieb "Xiangdong Huang" <[email protected]>:
>>     >
>>     > Hi,
>>     >
>>     > If so, I think I need to add a new API to allow you continue to
>> write
>>     > data
>>     > in an existing  but not closed correctly TsFile. Then everything is
>>     > fine
>>     > for you :D
>>     >
>>     > Best,
>>     > -----------------------------------
>>     > Xiangdong Huang
>>     > School of Software, Tsinghua University
>>     >
>>     > 黄向东
>>     > 清华大学 软件学院
>>     >
>>     >
>>     > Julian Feinauer <[email protected]> 于2019年3月4日周一
>> 下午8:08写道：
>>     >
>>     > Hey Xiangdong,
>>     >
>>     > thanks for the great explanation.
>>     > And in fact, I agree with you that it would be best if we start to
>>     > play
>>     > around with it and reply all our findings or wishes back to this
>>     > list (in
>>     > fact that proved to be beneficial in plc4x as well).
>>     >
>>     > You confirm my thoughts about the two "levels" of APIs (DB and file)
>>     > and
>>     > the file api is exactly what we looked for for our use case.
>>     > As we do not care much about data loss (when an edge device fails
>>     > its...
>>     > gone).
>>     > The crucial point for us is that no corrupt files can be generated.
>>     > This means I'm fine when the last data submitted is lost but I'm not
>>     > fine
>>     > if we can get to a situation where the last datafile is completely
>>     > lost
>>     > (well, perhaps this could be acceptable).
>>     >
>>     > @tim: Perhaps its best when you give some more information to
>>     > Xiangdong
>>     > about our idea, and we can also point to our current code in github
>>     >
>>     > Julian
>>     >
>>     > Am 04.03.19, 13:03 schrieb "Xiangdong Huang" <[email protected]>:
>>     >
>>     > Hi,
>>     >
>>     > TsFile API is not deprecated. In fact, it is designed for this
>>     > scenario and
>>     > MapReduce/Spark computing.
>>     >
>>     > If you just use Reader and Writer API, there is something you
>>     > need to
>>     > know:
>>     >
>>     > Let's suppose your block size is x Bytes,
>>     > (tsfile-format.properties:
>>     > group_size_in_byte).
>>     >
>>     > 1. If you write data and a shutdown occurs, then all data that is
>>     > flushed
>>     > on disk is ok, and you can read the data ( class
>>     > org.apache.iotdb.tsfile.TsFileSequenceRead is an example, but
>>     > you need
>>     > to
>>     > change it a little. I think I can write an example.)
>>     >
>>     > 2. Actually, TsFile has the ability to allow you continue to
>>     > write
>>     > data at
>>     > the end of the incomplete file. However, We do not provide this
>>     > API
>>     > now...
>>     > If needed, I can add the API.
>>     >
>>     > 3. In this scenario, you will lose at most x Bytes data. If you
>>     > do not
>>     > accept that, something like WAL is needed. (It is not very
>>     > complex,
>>     > but I
>>     > am not sure that whether it should be an embedded function for
>>     > TsFile).
>>     >
>>     > Up to now, we can consider that TsFile API is suitable for your
>>     > scenario
>>     > (even though we need to add a little more API if you desire).
>>     > And you
>>     > can
>>     > get the ability to compress data, and query data from the TsFile
>>     > rather
>>     > than scan the data from the head to the tail.
>>     >
>>     > However, TsFile has one constraint: You can not write
>>     > out-of-order data
>>     > into a TsFile, otherwise the query API may return incomplete
>>     > result.
>>     > But I think it is ok for real applications, because I do not
>>     > think
>>     > that a
>>     > device can generate out-of-order data....
>>     >
>>     > For example, If you write two devices' data into one TsFile, it
>>     > is ok
>>     > if
>>     > you write data like:
>>     > - d1.t1, d1.t2, d2.t1, d2.t2, d2.t3, d1.t4, d1.t5 ....
>>     > or:
>>     > - d1.m1.t1, d1.m1.t2, d1.m2.t1, d1.m2.t2, d2.m1.t1 ...
>>     >
>>     > But you can not write data like:
>>     > - d1.m1.t2, d1.m1.t1 ...
>>     >
>>     > I think it is a good chance to improve TsFile to make it more
>>     > suitable
>>     > for
>>     > real applications, so please do not hesitate to tell me more
>>     > about
>>     > what you
>>     > think TsFile should want to have?
>>     >
>>     > Best,
>>     > -----------------------------------
>>     > Xiangdong Huang
>>     > School of Software, Tsinghua University
>>     >
>>     > 黄向东
>>     > 清华大学 软件学院
>>     >
>>     >
>>     > Julian Feinauer <[email protected]> 于2019年3月4日周一
>>     > 下午7:17写道：
>>     >
>>     > Hi Xiangdong,
>>     >
>>     > thanks for the info.
>>     > How is it in the case when you use the Reader / Writer API for
>>     > the
>>     > tsfiles
>>     > directly (or should this be considered "deprecated")?
>>     > Can these files come to corrupted state?
>>     >
>>     > One Situation where we have to deal with these situations is
>>     > "at the
>>     > edge"
>>     > when we have devices inside large machines.
>>     > Usually at the end of the shift these machines (and therefore
>>     > our
>>     > device)
>>     > is powered off hard, so no shutdown or de-initialization is
>>     > possible.
>>     >
>>     > Best
>>     > Julian
>>     >
>>     > Am 04.03.19, 12:14 schrieb "Xiangdong Huang" <
>>     > [email protected]>:
>>     >
>>     > Hi,
>>     >
>>     > IoTDB can support either on a server with 7*24 or a
>>     > RaspberryPi.
>>     > We
>>     > have
>>     > tested both the two scenario.
>>     >
>>     > When you shutdown an IoTDB instance in force (e.g., power
>>     > off)
>>     > and
>>     > restart
>>     > it again, no data loses ( if you enable the WAL).
>>     >
>>     > However, currently we do not optimize the time cost of the
>>     > restart
>>     > process.
>>     > It is an important feature that we need to do, because we
>>     > hope
>>     > IoTDB
>>     > can
>>     > support data management either on the edge devices or the
>>     > data
>>     > center.
>>     >
>>     > And, the default configuration is not so suitable for
>>     > running on
>>     > the
>>     > edge
>>     > device. (e.g., block size is 128MB, which is too large for
>>     > a
>>     > RaspberryPi,
>>     > and will slow down the restart process because there are
>>     > too
>>     > much WAL
>>     > data
>>     > on disk).
>>     >
>>     > Best,
>>     > -----------------------------------
>>     > Xiangdong Huang
>>     > School of Software, Tsinghua University
>>     >
>>     > 黄向东
>>     > 清华大学 软件学院
>>     >
>>     >
>>     > Tim Mitsch <[email protected]> 于2019年3月4日周一
>>     > 下午6:53写道：
>>     >
>>     > Hello development-team
>>     >
>>     > First of all thanks for developing this kind of
>>     > interesting
>>     > project
>>     > and
>>     > bringing it into apache incubator.
>>     >
>>     > I have a question regarding the place of operation and
>>     > robustness:
>>     >
>>     > *   Is iotDB concepted as application on a server
>>     > which is
>>     > running
>>     > 24/7
>>     > or
>>     > *   Is it also possible to run it on a device like
>>     > RaspberryPi or
>>     > IPC,
>>     > where operation can interrupt.
>>     > I’m asking because i’m searching for solution for a
>>     > temporary
>>     > storage that
>>     > is robust against spontaneous interrupt, e.g. switch off
>>     > electricity
>>     > without regular shutdown of OS – have u tested something
>>     > like
>>     > this
>>     > yet?
>>     >
>>     > Best regards
>>     > Tim
>>     >
>>     >
>>     >
>>     >
>>     >
>>     >
>>     >
>>     >
>>     >
>>     >
>>     >
>>     >
>>     >
>>     >
>>
>>
>>
>>

Re: Operation and robustness of iotDB

Reply via email to