Hi,

Since Monday afternoon, Policeman Jenkins was replaced by newer hardware (AMD 
Ryzen 7 3700X Octa-Core, DDR4 ECC, 2x1 Terabyte NVME SSD as RAID), located 
again at Hetzner. The reason for the change was the old Intel Xeon E3-1275 v5 
Quad-Core Skylake and the NVMe SSDs which were reaching the end of their 
lifetime (they were already replaced by data center service a while ago after 
first smart errors). So Lucene and Solr tests are eating your SSD, be aware of 
that!!! 😊

One reason for the higher SSD destroy rate was also some "bug" in the original 
Ubuntu install image of the older server. The RAID was created back at that 
time with Linux and using the mdadm framework. The partitions were already 
aligned to 2048 sectors (default on many operating systems), but the RAID 
device inside the partitions were using a stupid alignment for the first data 
sector behind the RAID header. This caused that the actually used filesystem 
device was not starting at a multiple of 2048 sectors -> this causes additional 
reads and writes, because SSDs work with large blocks. On the new server (that 
was just copied over using dd of filesystem devices over network) I made the 
alignment of RAID correct. Jenkins is also using libeatmydata.so since a few 
weeks in LD_PRELOAD of all jobs, so fsync is swallowed by the libc. I'd 
recommend that to others, because data safety is not an issue (apt install 
eatmydata, in Jenkins node config add ENVIRONMENT variable through GUI that’s 
applied to all jobs: LD_PRELOAD=libeatmydata.so).

With new CPU, builds seem up to 2 times faster. FYI, the Linux builds are 
running with tests.multiplicator=3 to better trigger JVM failures. You have to 
remind about that when you reproduce failures - it looks like this is not 
printed in reproduce lines.

There were some false failures during the night Monday to Tuesday (caused by 
the usual gettimeofday() issue on MacOS), which is a bug in the MacOS kernel, 
affecting Java and all software calling nanotime/getSystemTimeMillies very 
often (causing a sigfpe on "time jumps"). Recent Darwin kernels replaced the 
assembly code by a more safe C version (the old and new code reads the time 
volatile from a special "comm" memory page that is hooked into every process, 
so a syscall is not needed). So I updated MacOS machine from El Captain to 
Mojave (no Catalina yet, as 32 bit support was dropped there). Since that time 
the gettimeofday errors seem to be fixed. The reason for it happening more 
often was possibly more CPU cores and more parallel calls to gettimeofday.

I disabled the Solaris builds for a while, they seem to hang more often (more 
parallelization). If somebody wants to look into this, I triggered a stack 
trace in the logs: It looks like the BSD networking layer causes those "hangs". 
[https://jenkins.thetaphi.de/job/Lucene-Solr-8.x-Solaris/503/console]. But we 
can also drop Solaris builds, as they only work with 8.x - no Java 11 support 
on Solaris anymore by Oracle. Any comments?

Uwe

-----
Uwe Schindler
Achterdiek 19, D-28357 Bremen
https://www.thetaphi.de
eMail: u...@thetaphi.de



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to