Re: Introducing a memory control mechanism during the query planning stage #12573
Hi Lanyu, Bravo work! Learned a lot from your PR. Best Ziyang > 2024年5月23日 11:08,Liao Lanyu <1435078...@qq.com.INVALID> 写道: > > Hi, > Currently, the IoTDB query engine does not implement memory control at the FE > (Frontend) stage. In scenarios with massive series queries (e.g., select * > from root.**), the query plans generated at the FE stage can become > excessively large. Roughly estimating, the size of a single SeriesScanNode is > about 1/2 KB, which means that two million series corresponding to two > million SeriesScanNodes would occupy 1 GB, posing a potential risk of > Out-Of-Memory (OOM). In high concurrency scenarios, even if a single query > plan is not large, the total memory occupied by multiple query plans can > still lead to OOM. > Therefore, it is now desired to introduce memory size control for FE query > plans within the query engine. > The PR is:https://github.com/apache/iotdb/pull/12573 > > > > > 1435078631 > 1435078...@qq.com > > > >
[DISCUSS] Enable auto balance for schemaregion
Hi Dev Team, I'm writing to discuss the auto balancing feature in IoTDB, which currently is operational for data region. This feature is designed to optimize resource use, maximize throughput, minimize response time, and prevent overloading on any individual resource. However, it's presently inactive by default for the schema region.The initial rationale for this decision was based on the instability observed in the underlying consensus (Ratis) during leader transitions. In scenarios where a leader election failed, the previous leader was forced to step down and the new leader is not elected out. This situation may led to temporary unavailability in the schema region. Encouragingly, with the upgrade to Ratis 3.0.0, ratis community introduced a notable enhancement in leader transition processes. 3.0.0 facilitates smoother transitions between nodes and we've successfully incorporated as detailed in our recent PR: https://github.com/apache/iotdb/pull/11785 Given these advancements, I propose we revisit our current settings for the schema region. Specifically, I recommend enabling the auto balance feature by default for the schema region. What do you think? Best regards, Ziyang
Celebrating the Release of Ratis 3.0.0 on Our Community's Contributions
Dear Community Members, I am excited to announce the release of Ratis version 3.0.0, marking the first major version update since IoTDB began utilizing Ratis for consensus services. This release encompasses a multitude of new features, enhancements, and fixes, all of which are the contributions of the dedicated efforts of our IoTDB community. Reflecting on the past year, it's hard to imagine the remarkable journey that unfolded after several key members (Mr. Huang, Mr. Qiao, and Xinyu Tan) made the pivotal decision to utilize Ratis for data replication. I wish to extend my gratitude to each one of you for your contributions, whether direct or indirect. Our collective success is a testament to the strength of our community spirit, resonating perfectly with the Apache Software Foundation's ethos: “Community Over Code.” In the upcoming period, we plan to integrate these features and improvements into the master branch. Each addition will undergo backtesting and validation to ensure optimal performance and reliability. Your feedback and insights are invaluable to us, and I eagerly look forward to your thoughts and suggestions. Thank you once again for your support and dedication. Together, we are shaping a brighter future for our project. Warm regards, William
Re: Ratis SNAPSHOT versions in our latest release ...
Hi Chris, Thanks very much for pointing out this problem! > So it’s still not ideal, as the referenced artifacts will never go to Maven > Central and could cause problems with the one or the other user I agree it’s not ideal. We’ll drive the Ratis Community to release an official version before our next 1.2.x release. Initially, the intention in employing Ratis snapshot versions on the master branch was to enable our dev / test teams to swiftly validate each Ratis issue that we encountered, reported, and fixed. That’s why periodically we would cherry-pick the patches and release a temporary snapshot version. However, I am unsure of the rationale behind the subsequent decision to rely on Ratis snapshot versions in our release versions. It appears that this approach may not be appropriate. In conclusion, I think it’s OK to use a snapshot version in master branch but we should use an official steady version in our releases. Best, William > 2023年9月13日 22:31,Christofer Dutz 写道: > > Hi all, > > after some discussions with colleagues it turns out that it’s not quite as > dramatic as I first throuhgt. So first I thought the commit hash was some way > to address one fixed SNAPSHOT version via some mechanism I just didn’t know > yet, but it turns out to be a lot simpler …. It produces a SNAPSHOT for > version “2.5.2-a4398bf“ … so it’s an artificial version for which then again > 3-5 SNAPSHOTS will be keept. > > Seems it’s some shorthand version of inofficially releasing things without > actually releasing them. > > So it’s still not ideal, as the referenced artifacts will never go to Maven > Central and could cause problems with the one or the other user, I don’t see > it as an immediate threat. > > Chris > > Von: Christofer Dutz > Datum: Mittwoch, 13. September 2023 um 11:01 > An: dev@iotdb.apache.org > Betreff: Ratis SNAPSHOT versions in our latest release ... > Hi, > > I’m currently working on resolving some of the dependency version issues we > are having. > Mostly people will not have noticed, but currently we’re pulling in up to 4 > different versions of a jar in our build. This can cause many extremely hard > to spot problems. > > While trying to fix a problem with metrics-core in version 4.2.7 but pulling > in on older version in Ratis I noticed us using: > > 2.5.2-a4398bf-SNAPSHOT > > This is extremely problematic. Currently the Apache Nexus server only keeps 5 > SNAPSHOT versions and then deletes old ones. This means that we regularly > have to bump the SNAPSHOT version of Ratis. > > This got me thinking and I checked the release branch for the 1.2.x branch. > Here we’re using the same. > > The problem with using SNAPSHOTS on master is not that severe, but using them > in releases it very problematic. I guess we’ll only be able to build our last > release for a few more days/weeks and then it will no longer be buildable. > > Are we relying on things in Ratis, that are not yet released? > > We should probably encourage the Ratis folks to head for a new release > (Ideally with my latest Ratis-PR merged). > > Chris
Re: Fixing flaky tests?
Sure, will take a look. William > 2023年8月7日 10:47,Xinyu Tan 写道: > > Hi William, > > In my PR (https://github.com/apache/iotdb/pull/10789), there was an NPE > (NullPointerException) error in the test for 'oneMemberGroupChange' > (https://github.com/apache/iotdb/actions/runs/5764037692/job/15640048487?pr=10789). > You may want to investigate the cause of this issue. > > Thanks > -- > Xinyu Tan > > On 2023/08/04 14:59:51 William Song wrote: >> Hi Chris, >> >> I will take a look at RatisConsensusTest. In case the tests fail next time, >> feel free to mention me directly in the PR. This way, I can view the >> complete error stack. >> >> William >> >>> 2023年8月4日 17:13,Christofer Dutz 写道: >>> >>> Hi all, >>> >>> So, in the past days I‘ve been building IoTDB on several OSes and have >>> noticed some tests to repeatedly failing the build, but succeeding as soon >>> as I run them again. >>> To sum it up it’s mostly these tests: >>> >>> — IoTDB: Core: Consensus >>> >>> RatisConsensusTest.removeMemberFromGroup:148->doConsensus:258 NullPointer >>> Cann… >>> >>> >>> RatisConsensusTest.addMemberToGroup:116->doConsensus:258 NullPointer Cannot >>> in... >>> >>> >>> >>> ReplicateTest.replicateUsingWALTest:257->initServer:147 » IO >>> org.apache.iotdb >>> >>> >>> — IoTDB: Core: Node Commons >>> >>> Keeps on failing because of left-over iotdb server instances. >>> >>> I would be happy to tackle the Node Commons tests regularly failing by >>> implementing the Test-Runner, that I mentioned before, which will start and >>> run IoTDB inside the VM running the tests, so the instance will be shut >>> down as soon as the test is finished. This should eliminate that problem. >>> However I have no idea if anyone is working on the RatisConsensusTest and >>> the ReplicateTest. >>> >>> Chris >> >>
Re: Fixing flaky tests?
Hi Chris, I will take a look at RatisConsensusTest. In case the tests fail next time, feel free to mention me directly in the PR. This way, I can view the complete error stack. William > 2023年8月4日 17:13,Christofer Dutz 写道: > > Hi all, > > So, in the past days I‘ve been building IoTDB on several OSes and have > noticed some tests to repeatedly failing the build, but succeeding as soon as > I run them again. > To sum it up it’s mostly these tests: > > — IoTDB: Core: Consensus > > RatisConsensusTest.removeMemberFromGroup:148->doConsensus:258 NullPointer > Cann… > > > RatisConsensusTest.addMemberToGroup:116->doConsensus:258 NullPointer Cannot > in... > > > > ReplicateTest.replicateUsingWALTest:257->initServer:147 » IO > org.apache.iotdb > > > — IoTDB: Core: Node Commons > > Keeps on failing because of left-over iotdb server instances. > > I would be happy to tackle the Node Commons tests regularly failing by > implementing the Test-Runner, that I mentioned before, which will start and > run IoTDB inside the VM running the tests, so the instance will be shut down > as soon as the test is finished. This should eliminate that problem. However > I have no idea if anyone is working on the RatisConsensusTest and the > ReplicateTest. > > Chris
Re: [PROPOSAL] Enhance Read Consistency Level During Restart in RatisConsensus
Hi Chris, > Trust lost easily, and hard to regain. Can’t agree more. Maybe we shall consider implementing lease read to achieve consistency & latency balance later after pull/10597 <https://github.com/apache/iotdb/pull/10597>. CC Xinyu. William > 2023年7月19日 14:46,Christofer Dutz 写道: > > Hi, > > I agree that it’s better to have the defaults produce safer (more consistent) > results and document optimization options for users, that want/need them and > know about potential drawbacks. > Admittedly I’m not yet too deep in the internals of IoTDB, but at least this > would be my expectation on a user-level. > > I’m currently reviewing our “competitor” solutions and inconsistencies were > what made me dislike the one or the other solution instantly. Trust lost > easily, and hard to regain. > > Chris > > > Von: William Song > Datum: Mittwoch, 19. Juli 2023 um 04:14 > An: dev@iotdb.apache.org > Betreff: [PROPOSAL] Enhance Read Consistency Level During Restart in > RatisConsensus > Hi dev, > > I'd like to draw your attention to an existing issue in our current read > consistency level within the RatisConsensus module. As it stands, the default > level is set to "query statemachine directly”, which, while latency-friendly, > has led to user-reported bugs. Specifically, these bugs relate to the > production of inconsistent results in subsequent SQL queries during a > restart, creating a phantom read problem that may be confusing for our users. > > To address this issue, I propose that we temporarily increase the read > consistency level to linearizable read during restarts. This will ensure that > we maintain data consistency during the critical recovery period. Once the > cluster has successfully finished recovering from previous logs, we can then > revert to the default consistency level. > > You can find more details about this proposed solution in the linked pull > request: https://github.com/apache/iotdb/pull/10597。 > > **Please note** that this change may affect module (including CQ, schema > region, and data region) that calls RatisConsensus.read during the restart > process. In such cases, a RatisUnderRecoveryException may be returned, > indicating that RatisConsensus cannot serve read requests while it's > replaying RaftLog. Therefore, we strongly encourage the affected modules to > handle this situation appropriately, such as implementing a retry mechanism. > > I look forward to hearing your thoughts on this proposal. Your feedback and > suggestions will be appreciated. > > Regards > William Song
[PROPOSAL] Enhance Read Consistency Level During Restart in RatisConsensus
Hi dev, I'd like to draw your attention to an existing issue in our current read consistency level within the RatisConsensus module. As it stands, the default level is set to "query statemachine directly”, which, while latency-friendly, has led to user-reported bugs. Specifically, these bugs relate to the production of inconsistent results in subsequent SQL queries during a restart, creating a phantom read problem that may be confusing for our users. To address this issue, I propose that we temporarily increase the read consistency level to linearizable read during restarts. This will ensure that we maintain data consistency during the critical recovery period. Once the cluster has successfully finished recovering from previous logs, we can then revert to the default consistency level. You can find more details about this proposed solution in the linked pull request: https://github.com/apache/iotdb/pull/10597。 **Please note** that this change may affect module (including CQ, schema region, and data region) that calls RatisConsensus.read during the restart process. In such cases, a RatisUnderRecoveryException may be returned, indicating that RatisConsensus cannot serve read requests while it's replaying RaftLog. Therefore, we strongly encourage the affected modules to handle this situation appropriately, such as implementing a retry mechanism. I look forward to hearing your thoughts on this proposal. Your feedback and suggestions will be appreciated. Regards William Song
Re: [VOTE] Apache IoTDB 1.0.0 RC5 release
+1 (non-binding) * Verified git hash and checksum * Built IoTDB locally from source * Ran IoTDB tests locally Best Regards, Song Ziyang
Re: Chaneg the name of StandAloneConsensus
Since there’s one replica and do not involve any consensus process, how about we remove the ‘Consensus’ suffix and call it ‘SingleReplica’? Regards, Song > 2022年10月31日 12:39,Yuan Tian 写道: > > Hi, all > > Now, we name the consensus which is optmized for only single > replica(actualy this consensus can only support one replica) as > StandAloneConsensus. However, `StandAlone` is ambiguous, it was used > for IoTDB StandAlone version(v.s. distributed version). > > So, we decided to change its name, here are some candidates: > > 1. OneCopyConsensus > 2. NoCopyConsensus > 3. SimpleConsensus > 4. ZereCostConsensus > > Do you guys have any suggestions? Or Which one of the above names you voted > for? > > > Best, > -- > Yuan Tian