Re: Introducing a memory control mechanism during the query planning stage #12573

2024-05-22 Thread William Song
Hi Lanyu,

Bravo work! Learned a lot from your PR.

Best
Ziyang

> 2024年5月23日 11:08,Liao Lanyu <1435078...@qq.com.INVALID> 写道:
> 
> Hi,
> Currently, the IoTDB query engine does not implement memory control at the FE 
> (Frontend) stage. In scenarios with massive series queries (e.g., select * 
> from root.**), the query plans generated at the FE stage can become 
> excessively large. Roughly estimating, the size of a single SeriesScanNode is 
> about 1/2 KB, which means that two million series corresponding to two 
> million SeriesScanNodes would occupy 1 GB, posing a potential risk of 
> Out-Of-Memory (OOM). In high concurrency scenarios, even if a single query 
> plan is not large, the total memory occupied by multiple query plans can 
> still lead to OOM.
> Therefore, it is now desired to introduce memory size control for FE query 
> plans within the query engine.
> The PR is:https://github.com/apache/iotdb/pull/12573
> 
> 
> 
> 
> 1435078631
> 1435078...@qq.com
> 
> 
> 
> 



[DISCUSS] Enable auto balance for schemaregion

2024-01-02 Thread William Song
Hi Dev Team,

I'm writing to discuss the auto balancing feature in IoTDB, which currently is 
operational for data region. This feature is designed to optimize resource use, 
maximize throughput, minimize response time, and prevent overloading on any 
individual resource. 

However, it's presently inactive by default for the schema region.The initial 
rationale for this decision was based on the instability observed in the 
underlying consensus (Ratis) during leader transitions. In scenarios where a 
leader election failed, the previous leader was forced to step down and the new 
leader is not elected out. This situation may led to temporary unavailability 
in the schema region.

Encouragingly, with the upgrade to Ratis 3.0.0, ratis community introduced a 
notable enhancement in leader transition processes. 3.0.0 facilitates smoother 
transitions between nodes and we've successfully incorporated as detailed in 
our recent PR: 

https://github.com/apache/iotdb/pull/11785
Given these advancements, I propose we revisit our current settings for the 
schema region. Specifically, I recommend enabling the auto balance feature by 
default for the schema region.  What do you think?

Best regards,

Ziyang

Celebrating the Release of Ratis 3.0.0 on Our Community's Contributions

2023-12-26 Thread William Song
Dear Community Members,

I am excited to announce the release of Ratis version 3.0.0, marking the first 
major version update since IoTDB began utilizing Ratis for consensus services. 
This release encompasses a multitude of new features, enhancements, and fixes, 
all of which are the contributions of the dedicated efforts of our IoTDB 
community. Reflecting on the past year, it's hard to imagine the remarkable 
journey that unfolded after several key members (Mr. Huang, Mr. Qiao, and Xinyu 
Tan) made the pivotal decision to utilize Ratis for data replication.

I wish to extend my gratitude to each one of you for your contributions, 
whether direct or indirect. Our collective success is a testament to the 
strength of our community spirit, resonating perfectly with the Apache Software 
Foundation's ethos: “Community Over Code.” 

In the upcoming period, we plan to integrate these features and improvements 
into the master branch. Each addition will undergo backtesting and validation 
to ensure optimal performance and reliability. Your feedback and insights are 
invaluable to us, and I eagerly look forward to your thoughts and suggestions.

Thank you once again for your support and dedication. Together, we are shaping 
a brighter future for our project.



Warm regards,
William

Re: Ratis SNAPSHOT versions in our latest release ...

2023-09-13 Thread William Song
Hi Chris,

Thanks very much for pointing out this problem!

> So it’s still not ideal, as the referenced artifacts will never go to Maven 
> Central and could cause problems with the one or the other user

I agree it’s not ideal. We’ll drive the Ratis Community to release an official 
version before our next 1.2.x release. 

Initially, the intention in employing Ratis snapshot versions on the master 
branch was to enable our dev / test teams to swiftly validate each Ratis issue 
that we encountered, reported, and fixed.  That’s why periodically we would 
cherry-pick the patches and release a temporary snapshot version. 

However, I am unsure of the rationale behind the subsequent decision to rely on 
Ratis snapshot versions in our release versions. It appears that this approach 
may not be appropriate.

In conclusion, I think it’s OK to use a snapshot version in master branch but 
we should use an official steady version in our releases.

Best,
William


> 2023年9月13日 22:31,Christofer Dutz  写道:
> 
> Hi all,
> 
> after some discussions with colleagues it turns out that it’s not quite as 
> dramatic as I first throuhgt. So first I thought the commit hash was some way 
> to address one fixed SNAPSHOT version via some mechanism I just didn’t know 
> yet, but it turns out to be a lot simpler …. It produces a SNAPSHOT for 
> version “2.5.2-a4398bf“ … so it’s an artificial version for which then again 
> 3-5 SNAPSHOTS will be keept.
> 
> Seems it’s some shorthand version of inofficially releasing things without 
> actually releasing them.
> 
> So it’s still not ideal, as the referenced artifacts will never go to Maven 
> Central and could cause problems with the one or the other user, I don’t see 
> it as an immediate threat.
> 
> Chris
> 
> Von: Christofer Dutz 
> Datum: Mittwoch, 13. September 2023 um 11:01
> An: dev@iotdb.apache.org 
> Betreff: Ratis SNAPSHOT versions in our latest release ...
> Hi,
> 
> I’m currently working on resolving some of the dependency version issues we 
> are having.
> Mostly people will not have noticed, but currently we’re pulling in up to 4 
> different versions of a jar in our build. This can cause many extremely hard 
> to spot problems.
> 
> While trying to fix a problem with metrics-core in version 4.2.7 but pulling 
> in on older version in Ratis I noticed us using:
> 
> 2.5.2-a4398bf-SNAPSHOT
> 
> This is extremely problematic. Currently the Apache Nexus server only keeps 5 
> SNAPSHOT versions and then deletes old ones. This means that we regularly 
> have to bump the SNAPSHOT version of Ratis.
> 
> This got me thinking and I checked the release branch for the 1.2.x branch. 
> Here we’re using the same.
> 
> The problem with using SNAPSHOTS on master is not that severe, but using them 
> in releases it very problematic. I guess we’ll only be able to build our last 
> release for a few more days/weeks and then it will no longer be buildable.
> 
> Are we relying on things in Ratis, that are not yet released?
> 
> We should probably encourage the Ratis folks to head for a new release 
> (Ideally with my latest Ratis-PR merged).
> 
> Chris



Re: Fixing flaky tests?

2023-08-06 Thread William Song
Sure, will take a look.
William

> 2023年8月7日 10:47,Xinyu Tan  写道:
> 
> Hi William,
> 
> In my PR (https://github.com/apache/iotdb/pull/10789), there was an NPE 
> (NullPointerException) error in the test for 'oneMemberGroupChange' 
> (https://github.com/apache/iotdb/actions/runs/5764037692/job/15640048487?pr=10789).
>  You may want to investigate the cause of this issue.
> 
> Thanks
> --
> Xinyu Tan
> 
> On 2023/08/04 14:59:51 William Song wrote:
>> Hi Chris,
>> 
>> I will take a look at RatisConsensusTest. In case the tests fail next time, 
>> feel free to mention me directly in the PR. This way, I can view the 
>> complete error stack. 
>> 
>> William
>> 
>>> 2023年8月4日 17:13,Christofer Dutz  写道:
>>> 
>>> Hi all,
>>> 
>>> So, in the past days I‘ve been building IoTDB on several OSes and have 
>>> noticed some tests to repeatedly failing the build, but succeeding as soon 
>>> as I run them again.
>>> To sum it up it’s mostly these tests:
>>> 
>>> — IoTDB: Core: Consensus
>>> 
>>> RatisConsensusTest.removeMemberFromGroup:148->doConsensus:258 NullPointer 
>>> Cann…
>>> 
>>> 
>>> RatisConsensusTest.addMemberToGroup:116->doConsensus:258 NullPointer Cannot 
>>> in...
>>> 
>>> 
>>> 
>>> ReplicateTest.replicateUsingWALTest:257->initServer:147 » IO 
>>> org.apache.iotdb
>>> 
>>> 
>>> — IoTDB: Core: Node Commons
>>> 
>>> Keeps on failing because of left-over iotdb server instances.
>>> 
>>> I would be happy to tackle the Node Commons tests regularly failing by 
>>> implementing the Test-Runner, that I mentioned before, which will start and 
>>> run IoTDB inside the VM running the tests, so the instance will be shut 
>>> down as soon as the test is finished. This should eliminate that problem. 
>>> However I have no idea if anyone is working on the RatisConsensusTest and 
>>> the ReplicateTest.
>>> 
>>> Chris
>> 
>> 



Re: Fixing flaky tests?

2023-08-04 Thread William Song
Hi Chris,

I will take a look at RatisConsensusTest. In case the tests fail next time, 
feel free to mention me directly in the PR. This way, I can view the complete 
error stack. 

William

> 2023年8月4日 17:13,Christofer Dutz  写道:
> 
> Hi all,
> 
> So, in the past days I‘ve been building IoTDB on several OSes and have 
> noticed some tests to repeatedly failing the build, but succeeding as soon as 
> I run them again.
> To sum it up it’s mostly these tests:
> 
> — IoTDB: Core: Consensus
> 
> RatisConsensusTest.removeMemberFromGroup:148->doConsensus:258 NullPointer 
> Cann…
> 
> 
> RatisConsensusTest.addMemberToGroup:116->doConsensus:258 NullPointer Cannot 
> in...
> 
> 
> 
> ReplicateTest.replicateUsingWALTest:257->initServer:147 » IO 
> org.apache.iotdb
> 
> 
> — IoTDB: Core: Node Commons
> 
> Keeps on failing because of left-over iotdb server instances.
> 
> I would be happy to tackle the Node Commons tests regularly failing by 
> implementing the Test-Runner, that I mentioned before, which will start and 
> run IoTDB inside the VM running the tests, so the instance will be shut down 
> as soon as the test is finished. This should eliminate that problem. However 
> I have no idea if anyone is working on the RatisConsensusTest and the 
> ReplicateTest.
> 
> Chris



Re: [PROPOSAL] Enhance Read Consistency Level During Restart in RatisConsensus

2023-07-19 Thread William Song
Hi Chris,

> Trust lost easily, and hard to regain.

Can’t agree more. 

Maybe we shall consider implementing lease read to achieve consistency & 
latency balance later after pull/10597 
<https://github.com/apache/iotdb/pull/10597>. CC Xinyu.

William

> 2023年7月19日 14:46,Christofer Dutz  写道:
> 
> Hi,
> 
> I agree that it’s better to have the defaults produce safer (more consistent) 
> results and document optimization options for users, that want/need them and 
> know about potential drawbacks.
> Admittedly I’m not yet too deep in the internals of IoTDB, but at least this 
> would be my expectation on a user-level.
> 
> I’m currently reviewing our “competitor” solutions and inconsistencies were 
> what made me dislike the one or the other solution instantly. Trust lost 
> easily, and hard to regain.
> 
> Chris
> 
> 
> Von: William Song 
> Datum: Mittwoch, 19. Juli 2023 um 04:14
> An: dev@iotdb.apache.org 
> Betreff: [PROPOSAL] Enhance Read Consistency Level During Restart in 
> RatisConsensus
> Hi dev,
> 
> I'd like to draw your attention to an existing issue in our current read 
> consistency level within the RatisConsensus module. As it stands, the default 
> level is set to "query statemachine directly”, which, while latency-friendly, 
> has led to user-reported bugs. Specifically, these bugs relate to the 
> production of inconsistent results in subsequent SQL queries during a 
> restart, creating a phantom read problem that may be confusing for our users.
> 
> To address this issue, I propose that we temporarily increase the read 
> consistency level to linearizable read during restarts. This will ensure that 
> we maintain data consistency during the critical recovery period. Once the 
> cluster has successfully finished recovering from previous logs, we can then 
> revert to the default consistency level.
> 
> You can find more details about this proposed solution in the linked pull 
> request: https://github.com/apache/iotdb/pull/10597。
> 
> **Please note** that this change may affect module (including CQ, schema 
> region, and data region) that calls RatisConsensus.read during the restart 
> process. In such cases, a RatisUnderRecoveryException may be returned, 
> indicating that RatisConsensus cannot serve read requests while it's 
> replaying RaftLog. Therefore, we strongly encourage the affected modules to 
> handle this situation appropriately, such as implementing a retry mechanism.
> 
> I look forward to hearing your thoughts on this proposal. Your feedback and 
> suggestions will be appreciated.
> 
> Regards
> William Song



[PROPOSAL] Enhance Read Consistency Level During Restart in RatisConsensus

2023-07-18 Thread William Song
Hi dev,

I'd like to draw your attention to an existing issue in our current read 
consistency level within the RatisConsensus module. As it stands, the default 
level is set to "query statemachine directly”, which, while latency-friendly, 
has led to user-reported bugs. Specifically, these bugs relate to the 
production of inconsistent results in subsequent SQL queries during a restart, 
creating a phantom read problem that may be confusing for our users.

To address this issue, I propose that we temporarily increase the read 
consistency level to linearizable read during restarts. This will ensure that 
we maintain data consistency during the critical recovery period. Once the 
cluster has successfully finished recovering from previous logs, we can then 
revert to the default consistency level.

You can find more details about this proposed solution in the linked pull 
request: https://github.com/apache/iotdb/pull/10597。

**Please note** that this change may affect module (including CQ, schema 
region, and data region) that calls RatisConsensus.read during the restart 
process. In such cases, a RatisUnderRecoveryException may be returned, 
indicating that RatisConsensus cannot serve read requests while it's replaying 
RaftLog. Therefore, we strongly encourage the affected modules to handle this 
situation appropriately, such as implementing a retry mechanism.

I look forward to hearing your thoughts on this proposal. Your feedback and 
suggestions will be appreciated.

Regards
William Song



Re: [VOTE] Apache IoTDB 1.0.0 RC5 release

2022-12-04 Thread William Song
+1 (non-binding)

* Verified git hash and checksum
* Built IoTDB locally from source
* Ran IoTDB tests locally

Best Regards,
Song Ziyang



Re: Chaneg the name of StandAloneConsensus

2022-10-31 Thread William Song
Since there’s one replica and do not involve any consensus process, how about 
we remove the ‘Consensus’ suffix and call it ‘SingleReplica’?

Regards,
Song

> 2022年10月31日 12:39,Yuan Tian  写道:
> 
> Hi, all
> 
> Now, we name the consensus which is optmized for only single
> replica(actualy this consensus can only support one replica) as
> StandAloneConsensus. However, `StandAlone` is ambiguous, it was used
> for IoTDB StandAlone version(v.s. distributed version).
> 
> So, we decided to change its name, here are some candidates:
> 
> 1. OneCopyConsensus
> 2. NoCopyConsensus
> 3. SimpleConsensus
> 4. ZereCostConsensus
> 
> Do you guys have any suggestions? Or Which one of the above names you voted 
> for?
> 
> 
> Best,
> --
> Yuan Tian