[BUILD-FAILURE]: Job 'IoTDB/IoTDB-pip-new/master [master] [310]'

2023-08-04 Thread Apache Jenkins Server
BUILD-FAILURE: Job 'IoTDB/IoTDB-pip-new/master [master] [310]':

Check console output at "https://ci-builds.apache.org/job/IoTDB/job/IoTDB-pip-new/job/master/310/;>IoTDB/IoTDB-pip-new/master
 [master] [310]"

Re: Fixing flaky tests?

2023-08-04 Thread 谭新宇
Hi Chris,

I deeply apologize for the instability of replicateUsingWALTest. The test 
failures are occurring due to the frequent start and stop of the thrift server 
in the consensus module tests, which can lead to some tests being unable to 
bind to the socket during startup and resulting in failures.

Regarding the root cause of this issue, we suspect that TCP connections, when 
disconnected, remain in a TIME_WAIT state for about 4 minutes before the 
corresponding port becomes available for reuse. Although we have confirmed that 
the thrift server sets the socket as reusable during startup 
(https://github.com/apache/thrift/blob/master/lib/java/src/main/java/org/apache/thrift/transport/TNonblockingServerSocket.java#L103),
 it seems that this setting does not work in some CI environments.

As a result, we added logic in https://github.com/apache/iotdb/pull/10530 to 
block and wait for 60 seconds if the socket cannot be bound. However, test 
failures may still occur, and we suspect that waiting for more than 4 minutes 
might be necessary. Consequently, in 
https://github.com/apache/iotdb/pull/10540, we increased the timeout waiting 
period to 300 seconds. Regrettably, test failures still occasionally happen. As 
a result, in https://github.com/apache/iotdb/pull/10723, we introduced logic to 
dynamically detect available ports, hoping that switching to different ports 
could reduce the probability of failure. However, the current situation is that 
even after confirming the port's availability, failures occur during the actual 
startup of the thrift server. Now, I have started another attempt 
(https://github.com/apache/iotdb/pull/10789), but I am uncertain whether it 
will be effective.

Through this series of efforts, we have managed to significantly reduce the 
probability of encountering issues in the CI, but unfortunately, the problem 
still occasionally reoccurs. This issue is truly frustrating and disheartening. 
I wonder if the community has any better solutions that could help me…

Thanks
—
Xinyu Tan


> 2023年8月4日 22:59,William Song  写道:
> 
> Hi Chris,
> 
> I will take a look at RatisConsensusTest. In case the tests fail next time, 
> feel free to mention me directly in the PR. This way, I can view the complete 
> error stack. 
> 
> William
> 
>> 2023年8月4日 17:13,Christofer Dutz  写道:
>> 
>> Hi all,
>> 
>> So, in the past days I‘ve been building IoTDB on several OSes and have 
>> noticed some tests to repeatedly failing the build, but succeeding as soon 
>> as I run them again.
>> To sum it up it’s mostly these tests:
>> 
>> — IoTDB: Core: Consensus
>> 
>> RatisConsensusTest.removeMemberFromGroup:148->doConsensus:258 NullPointer 
>> Cann…
>> 
>> 
>> RatisConsensusTest.addMemberToGroup:116->doConsensus:258 NullPointer Cannot 
>> in...
>> 
>> 
>> 
>> ReplicateTest.replicateUsingWALTest:257->initServer:147 » IO 
>> org.apache.iotdb
>> 
>> 
>> — IoTDB: Core: Node Commons
>> 
>> Keeps on failing because of left-over iotdb server instances.
>> 
>> I would be happy to tackle the Node Commons tests regularly failing by 
>> implementing the Test-Runner, that I mentioned before, which will start and 
>> run IoTDB inside the VM running the tests, so the instance will be shut down 
>> as soon as the test is finished. This should eliminate that problem. However 
>> I have no idea if anyone is working on the RatisConsensusTest and the 
>> ReplicateTest.
>> 
>> Chris
> 



AW: Fixing flaky tests?

2023-08-04 Thread Christofer Dutz
Hi Xinyu,

No need to apologize … I’m happy that you have an idea on what’s going wrong.

I don’t know if you saw it, but I proposed a test-server module, which starts 
the two parts on random free ports and reports them back to the test starting 
it … this way we’d simply use free ports every time a server is started.

Could this help?

Chris




Von: Xinyu Tan 
Datum: Freitag, 4. August 2023 um 17:56
An: dev@iotdb.apache.org 
Betreff: Re: Fixing flaky tests?
Hi Chris,

I deeply apologize for the instability of replicateUsingWALTest. The test 
failures are occurring due to the frequent start and stop of the thrift server 
in the consensus module tests, which can lead to some tests being unable to 
bind to the socket during startup and resulting in failures.

Regarding the root cause of this issue, we suspect that TCP connections, when 
disconnected, remain in a TIME_WAIT state for about 4 minutes before the 
corresponding port becomes available for reuse. Although we have confirmed that 
the thrift server sets the socket as reusable during startup 
(https://github.com/apache/thrift/blob/master/lib/java/src/main/java/org/apache/thrift/transport/TNonblockingServerSocket.java#L103),
 it seems that this setting does not work in some CI environments.

As a result, we added logic in https://github.com/apache/iotdb/pull/10530 to 
block and wait for 60 seconds if the socket cannot be bound. However, test 
failures may still occur, and we suspect that waiting for more than 4 minutes 
might be necessary. Consequently, in 
https://github.com/apache/iotdb/pull/10540, we increased the timeout waiting 
period to 300 seconds. Regrettably, test failures still occasionally happen. As 
a result, in https://github.com/apache/iotdb/pull/10723, we introduced logic to 
dynamically detect available ports, hoping that switching to different ports 
could reduce the probability of failure. However, the current situation is that 
even after confirming the port's availability, failures occur during the actual 
startup of the thrift server. Now, I have started another attempt 
(https://github.com/apache/iotdb/pull/10789), but I am uncertain whether it 
will be effective.

Through this series of efforts, we have managed to significantly reduce the 
probability of encountering issues in the CI, but unfortunately, the problem 
still occasionally reoccurs. This issue is truly frustrating and disheartening. 
I wonder if the community has any better solutions that could help me…

Thanks
—
Xinyu Tan

On 2023/08/04 14:13:38 Christofer Dutz wrote:
> Hi all,
>
> So, in the past days I‘ve been building IoTDB on several OSes and have 
> noticed some tests to repeatedly failing the build, but succeeding as soon as 
> I run them again.
> To sum it up it’s mostly these tests:
>
> — IoTDB: Core: Consensus
>
> RatisConsensusTest.removeMemberFromGroup:148->doConsensus:258 NullPointer 
> Cann…
>
>
> RatisConsensusTest.addMemberToGroup:116->doConsensus:258 NullPointer Cannot 
> in...
>
>
>
> ReplicateTest.replicateUsingWALTest:257->initServer:147 » IO 
> org.apache.iotdb
>
>
> — IoTDB: Core: Node Commons
>
> Keeps on failing because of left-over iotdb server instances.
>
> I would be happy to tackle the Node Commons tests regularly failing by 
> implementing the Test-Runner, that I mentioned before, which will start and 
> run IoTDB inside the VM running the tests, so the instance will be shut down 
> as soon as the test is finished. This should eliminate that problem. However 
> I have no idea if anyone is working on the RatisConsensusTest and the 
> ReplicateTest.
>
> Chris
>


Re: Fixing flaky tests?

2023-08-04 Thread Xinyu Tan
Hi Chris,

I deeply apologize for the instability of replicateUsingWALTest. The test 
failures are occurring due to the frequent start and stop of the thrift server 
in the consensus module tests, which can lead to some tests being unable to 
bind to the socket during startup and resulting in failures.

Regarding the root cause of this issue, we suspect that TCP connections, when 
disconnected, remain in a TIME_WAIT state for about 4 minutes before the 
corresponding port becomes available for reuse. Although we have confirmed that 
the thrift server sets the socket as reusable during startup 
(https://github.com/apache/thrift/blob/master/lib/java/src/main/java/org/apache/thrift/transport/TNonblockingServerSocket.java#L103),
 it seems that this setting does not work in some CI environments.

As a result, we added logic in https://github.com/apache/iotdb/pull/10530 to 
block and wait for 60 seconds if the socket cannot be bound. However, test 
failures may still occur, and we suspect that waiting for more than 4 minutes 
might be necessary. Consequently, in 
https://github.com/apache/iotdb/pull/10540, we increased the timeout waiting 
period to 300 seconds. Regrettably, test failures still occasionally happen. As 
a result, in https://github.com/apache/iotdb/pull/10723, we introduced logic to 
dynamically detect available ports, hoping that switching to different ports 
could reduce the probability of failure. However, the current situation is that 
even after confirming the port's availability, failures occur during the actual 
startup of the thrift server. Now, I have started another attempt 
(https://github.com/apache/iotdb/pull/10789), but I am uncertain whether it 
will be effective.

Through this series of efforts, we have managed to significantly reduce the 
probability of encountering issues in the CI, but unfortunately, the problem 
still occasionally reoccurs. This issue is truly frustrating and disheartening. 
I wonder if the community has any better solutions that could help me…

Thanks
—
Xinyu Tan

On 2023/08/04 14:13:38 Christofer Dutz wrote:
> Hi all,
> 
> So, in the past days I‘ve been building IoTDB on several OSes and have 
> noticed some tests to repeatedly failing the build, but succeeding as soon as 
> I run them again.
> To sum it up it’s mostly these tests:
> 
> — IoTDB: Core: Consensus
> 
> RatisConsensusTest.removeMemberFromGroup:148->doConsensus:258 NullPointer 
> Cann…
> 
> 
> RatisConsensusTest.addMemberToGroup:116->doConsensus:258 NullPointer Cannot 
> in...
> 
> 
> 
> ReplicateTest.replicateUsingWALTest:257->initServer:147 » IO 
> org.apache.iotdb
> 
> 
> — IoTDB: Core: Node Commons
> 
> Keeps on failing because of left-over iotdb server instances.
> 
> I would be happy to tackle the Node Commons tests regularly failing by 
> implementing the Test-Runner, that I mentioned before, which will start and 
> run IoTDB inside the VM running the tests, so the instance will be shut down 
> as soon as the test is finished. This should eliminate that problem. However 
> I have no idea if anyone is working on the RatisConsensusTest and the 
> ReplicateTest.
> 
> Chris
> 


Re: Fixing flaky tests?

2023-08-04 Thread William Song
Hi Chris,

I will take a look at RatisConsensusTest. In case the tests fail next time, 
feel free to mention me directly in the PR. This way, I can view the complete 
error stack. 

William

> 2023年8月4日 17:13,Christofer Dutz  写道:
> 
> Hi all,
> 
> So, in the past days I‘ve been building IoTDB on several OSes and have 
> noticed some tests to repeatedly failing the build, but succeeding as soon as 
> I run them again.
> To sum it up it’s mostly these tests:
> 
> — IoTDB: Core: Consensus
> 
> RatisConsensusTest.removeMemberFromGroup:148->doConsensus:258 NullPointer 
> Cann…
> 
> 
> RatisConsensusTest.addMemberToGroup:116->doConsensus:258 NullPointer Cannot 
> in...
> 
> 
> 
> ReplicateTest.replicateUsingWALTest:257->initServer:147 » IO 
> org.apache.iotdb
> 
> 
> — IoTDB: Core: Node Commons
> 
> Keeps on failing because of left-over iotdb server instances.
> 
> I would be happy to tackle the Node Commons tests regularly failing by 
> implementing the Test-Runner, that I mentioned before, which will start and 
> run IoTDB inside the VM running the tests, so the instance will be shut down 
> as soon as the test is finished. This should eliminate that problem. However 
> I have no idea if anyone is working on the RatisConsensusTest and the 
> ReplicateTest.
> 
> Chris



Fixing flaky tests?

2023-08-04 Thread Christofer Dutz
Hi all,

So, in the past days I‘ve been building IoTDB on several OSes and have noticed 
some tests to repeatedly failing the build, but succeeding as soon as I run 
them again.
To sum it up it’s mostly these tests:

— IoTDB: Core: Consensus

RatisConsensusTest.removeMemberFromGroup:148->doConsensus:258 NullPointer Cann…


RatisConsensusTest.addMemberToGroup:116->doConsensus:258 NullPointer Cannot 
in...



ReplicateTest.replicateUsingWALTest:257->initServer:147 » IO 
org.apache.iotdb


— IoTDB: Core: Node Commons

Keeps on failing because of left-over iotdb server instances.

I would be happy to tackle the Node Commons tests regularly failing by 
implementing the Test-Runner, that I mentioned before, which will start and run 
IoTDB inside the VM running the tests, so the instance will be shut down as 
soon as the test is finished. This should eliminate that problem. However I 
have no idea if anyone is working on the RatisConsensusTest and the 
ReplicateTest.

Chris


AW: [DISCUSS] Refactor the way we're building/using Thrift

2023-08-04 Thread Christofer Dutz
So,

I’ve just merged the PR after updating it with the latest changes from master 
and the CI/CD gave it’s green light.
So now it should be a lot simpler to build thrift and to manage new 
os/architecture/thrift-versions as we no longer have to manually copy 
hand-built executables to our second git-repo.

Right now, all that would be needed, would be to run this command:

./mvnw clean deploy -P with-tools -pl :iotdb-tools-thrift

It would build the assembly for the system the command is run on and deploy 
that in the Nexus snapshot repo, from where everyone can fetch it, without 
building it locally.

As the tools are currently part of the build, we would need to deploy these 
convenience assemblies for each supported platform. I would therefore propose 
to move that part of the build out of the main project.
We could simply add it to a separate git-reposiroty. There we would simply 
release that whenever we want to switch to a new Thrift version (which is 
probably going to happen very infrequently).

This would simplify releasing even more.


Chris




Von: Christofer Dutz 
Datum: Dienstag, 1. August 2023 um 16:00
An: dev@iotdb.apache.org 
Betreff: AW: [DISCUSS] Refactor the way we're building/using Thrift
Ok …

so after quite a while of refactoring and cleaning up and tweaking, I think my 
PR is ready for a review.

Chris

Von: Christofer Dutz 
Datum: Montag, 31. Juli 2023 um 10:00
An: dev@iotdb.apache.org 
Betreff: AW: [DISCUSS] Refactor the way we're building/using Thrift
And again sorry,

turns out I selected the branches the wrong way around …
this is the right PR:
https://github.com/apache/iotdb/pull/10742

Chris

Von: Christofer Dutz 
Datum: Montag, 31. Juli 2023 um 09:33
An: dev@iotdb.apache.org 
Betreff: AW: [DISCUSS] Refactor the way we're building/using Thrift
Hi Folks,

So, I’ve literally spent all weekend working on my refactoring and must admit 
I’m quite happy with the results.

However, did I decide to take the PR out of this repo and work on a fork in my 
GitHub account as I didn’t want to swamp you folks with build failure emails.
So, I closed the old PR and opened a new one:
https://github.com/apache/iotdb/pull/10741

Right now, I still have two TODOs that I would like to work on:

  *   Make the cpp-client build again on Windows (AARCH64 and x86_64) 
(Currently struggling with it not detecting that it needs a different socket 
implementation)
  *   Introduce an “iotdb-test-server” module, that integration tests use for 
running tests against a server:
 *   The current usage of calling scripts in distribution introduces a 
logical cycle in Maven (Example: cpp-client needs distribution to run tests, 
distribution bundles cpp-client)
 *   I had many build failures, as the IoTDB server instances are not shut 
down cleanly.

I have noticed that we generally seem to have a number of flaky tests, that 
randomly fail … as I’ve been running dozens of full builds with integration 
tests on all supported platforms, I’ve had to re-run many builds to continue 
from the last failure.
I’ve started compiling a list of these tests and I think we should fix them, as 
the CI should generally pass and not require an arbitrary number of re-runs.

Chris


Von: Christofer Dutz 
Datum: Samstag, 29. Juli 2023 um 16:22
An: dev@iotdb.apache.org 
Betreff: AW: [DISCUSS] Refactor the way we're building/using Thrift
Well it seems it’s not quite that simple.

Because for example the CMakeList.txt references directories outside of the 
module itself.
https://github.com/apache/iotdb/blob/master/iotdb-client/client-cpp/src/main/CMakeLists.txt

If you wanted to build the client without having built thrift, this will not 
work.

I’m currently working on making the thrift module sort of independent from the 
rest.
The idea is to build the artifact once on every target platform and to make the 
produced artifacts available via Maven.
Instead of just having one executable downloaded, I’d propose to have an 
archive containing the:

  *   Thrift executable
  *   Thrift runtime shared libraries
  *   Thrift header files
With this every module, would just unpack that archive and use that locally.
This removes any circular dependencies.

However, this module sort of doesn’t fit nicely into the build and it will 
cause problems.
It will always just build thrift for the one platform the RM is currently using.

So, I would propose to move this tiny project into the other git repository, 
that we currently use to serve the pre-compiled binaries.
There whenever we need a new OS, CPU Architecture or Thrift version, we update 
that module and release it (and stage new thrift assembies in Nexus)
This would eliminate all complexities in the build.

Chris





Von: Xiangdong Huang 
Datum: Samstag, 29. Juli 2023 um 08:37
An: dev@iotdb.apache.org 
Betreff: Re: [DISCUSS] Refactor the way we're building/using Thrift
Hi  Chris,

> I would like to move the compile-tools directory into the root of the project 
> and detach it from 

AW: [DISCUSS] Updating to a newer Thrift version?

2023-08-04 Thread Christofer Dutz
Ok …

So, it turns out that the Thrift folks intentionally updated to Java 11 but 
took that back and the next release should be based on Java 8 again.
So hopefully that will come soon and then we could have a look about 
compatibility of the generated code. I guess if the wire-protocol didn’t change 
there should generally be no reason for it to not be compatible, and if they 
don’t release a major version, usually this compatibility should stay intact 
(if they sort of follow SemVer)

I guess the best path of action would be to do nothing right now … as an 
alternative, we could of course turn on the Java compilation when building 
thrift and to bundle the jar built by that in the assembly.
If we then replace the thrift-lib dependency with a “system” scope dependency, 
we could already do that now. But admittedly I would not be in favor of doing 
that. I would recommend for us to wait for the next official release.

Chris

Von: Christofer Dutz 
Datum: Dienstag, 1. August 2023 um 08:29
An: dev@iotdb.apache.org 
Betreff: Re: [DISCUSS] Updating to a newer Thrift version?
I opened a ticket in their jira. Might even prepare a pr... Shouldn't be too 
difficult.

Chris

Gesendet von Outlook für Android

From: 谭新宇 <1025599...@qq.com.INVALID>
Sent: Monday, July 31, 2023 1:57:28 PM
To: dev@iotdb.apache.org 
Subject: Re: [DISCUSS] Updating to a newer Thrift version?

Hi, Chris

In the latest version of thrift, there are some improvements we'd like to make, 
For example, https://issues.apache.org/jira/browse/THRIFT-5502 will down-cycled 
 "connection reset"  warn logs.

+1 for upgrading thrift.


Thanks

Xinyu Tan

> 2023年7月31日 19:50,Christofer Dutz  写道:
>
> Hi all,
>
> While working on the cleanup of the build, I noticed we’re working with 
> Thrift in version 0.14.1 however the latest version is 0.18.1
>
> Is there a reason we’re sticking to a two-year older version than the newest?
>
> If not: with the pom-cleanup refactoring it should be a thing of minutes to 
> update this.
>
> Chris
>


AW: [DISCUSS] Adding the generation of sboms to our build?

2023-08-04 Thread Christofer Dutz
Ok …

so after merging my branch with the pom refactoring, also the sbom generation 
is now part of an apache-release.

Chris

Von: Christofer Dutz 
Datum: Dienstag, 1. August 2023 um 17:00
An: dev@iotdb.apache.org 
Betreff: AW: [DISCUSS] Adding the generation of sboms to our build?
However,

this includes a LOT more than that change, so I guess a bit more review would 
be needed, right? ;-)

Chris

Von: Xiangdong Huang 
Datum: Dienstag, 1. August 2023 um 16:02
An: dev@iotdb.apache.org 
Betreff: Re: [DISCUSS] Adding the generation of sboms to our build?
+1 for moving to the master branch.

---
Xiangdong Huang
School of Software, Tsinghua University

Christofer Dutz  于2023年8月1日周二 22:00写道:
>
> I added the config to my pr here:
> https://github.com/apache/iotdb/pull/10742/commits/c4f4d2e874fd7c1ae4332062e29770925dce7024
>
> Chris
>
>
> Von: Xiangdong Huang 
> Datum: Samstag, 29. Juli 2023 um 08:48
> An: dev@iotdb.apache.org 
> Betreff: Re: [DISCUSS] Adding the generation of sboms to our build?
> Cool, CycloneDX is famous. Look forward!
> ---
> Xiangdong Huang
>
>
> Christofer Dutz  于2023年7月15日周六 22:59写道:
> >
> > Well in PLC4X the plugin generates an XML version of the SBOM.
> > We’re using this plugin:
> > https://github.com/CycloneDX/cyclonedx-maven-plugin
> >
> > Chris
> >
> > Von: Xiangdong Huang 
> > Datum: Samstag, 15. Juli 2023 um 07:58
> > An: dev@iotdb.apache.org 
> > Betreff: Re: [DISCUSS] Adding the generation of sboms to our build?
> > Hi Chris,
> >
> > Look forward! SBOM has also received a lot of attention in China.
> > Which kind of  format/standard it will obey?
> >
> > Best,
> > ---
> > Xiangdong Huang
> >
> > Christofer Dutz  于2023年7月14日周五 21:28写道:
> > >
> > > Hi all,
> > >
> > > here in Europe we’re currently preparing for quite a bit of an earthquake 
> > > caused by the Cyber-Resiliency-Act. In some projects I’m involved in 
> > > (Mainly PLC4X) I’ve started initiating small changes which could make us 
> > > come out without too many problems.
> > >
> > > One thing that seems to be coming up in both the EU as well as the US 
> > > acts, are the requirement to publish SBOM information (Software Bill Of 
> > > Material). As we are also using Maven as a build tool, I’ve got a 
> > > configuration in our poms that ensures an Apache release also produces an 
> > > SBOM, that we will be able to deploy.
> > >
> > > Are we interested in adding that to the IoTDB build?
> > >
> > > Chris