Hi, Since Monday afternoon, Policeman Jenkins was replaced by newer hardware (AMD Ryzen 7 3700X Octa-Core, DDR4 ECC, 2x1 Terabyte NVME SSD as RAID), located again at Hetzner. The reason for the change was the old Intel Xeon E3-1275 v5 Quad-Core Skylake and the NVMe SSDs which were reaching the end of their lifetime (they were already replaced by data center service a while ago after first smart errors). So Lucene and Solr tests are eating your SSD, be aware of that!!! 😊
One reason for the higher SSD destroy rate was also some "bug" in the original Ubuntu install image of the older server. The RAID was created back at that time with Linux and using the mdadm framework. The partitions were already aligned to 2048 sectors (default on many operating systems), but the RAID device inside the partitions were using a stupid alignment for the first data sector behind the RAID header. This caused that the actually used filesystem device was not starting at a multiple of 2048 sectors -> this causes additional reads and writes, because SSDs work with large blocks. On the new server (that was just copied over using dd of filesystem devices over network) I made the alignment of RAID correct. Jenkins is also using libeatmydata.so since a few weeks in LD_PRELOAD of all jobs, so fsync is swallowed by the libc. I'd recommend that to others, because data safety is not an issue (apt install eatmydata, in Jenkins node config add ENVIRONMENT variable through GUI that’s applied to all jobs: LD_PRELOAD=libeatmydata.so). With new CPU, builds seem up to 2 times faster. FYI, the Linux builds are running with tests.multiplicator=3 to better trigger JVM failures. You have to remind about that when you reproduce failures - it looks like this is not printed in reproduce lines. There were some false failures during the night Monday to Tuesday (caused by the usual gettimeofday() issue on MacOS), which is a bug in the MacOS kernel, affecting Java and all software calling nanotime/getSystemTimeMillies very often (causing a sigfpe on "time jumps"). Recent Darwin kernels replaced the assembly code by a more safe C version (the old and new code reads the time volatile from a special "comm" memory page that is hooked into every process, so a syscall is not needed). So I updated MacOS machine from El Captain to Mojave (no Catalina yet, as 32 bit support was dropped there). Since that time the gettimeofday errors seem to be fixed. The reason for it happening more often was possibly more CPU cores and more parallel calls to gettimeofday. I disabled the Solaris builds for a while, they seem to hang more often (more parallelization). If somebody wants to look into this, I triggered a stack trace in the logs: It looks like the BSD networking layer causes those "hangs". [https://jenkins.thetaphi.de/job/Lucene-Solr-8.x-Solaris/503/console]. But we can also drop Solaris builds, as they only work with 8.x - no Java 11 support on Solaris anymore by Oracle. Any comments? Uwe ----- Uwe Schindler Achterdiek 19, D-28357 Bremen https://www.thetaphi.de eMail: u...@thetaphi.de --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org