So, this morning we have been working on validating this hypothesis and making a meaningful and reproducible test for it. I'm just the messenger here because Pablo who has done most of the job is shy.
## The main story Since we are working with concurrency and non-determinism, we have designed a test that reproduces the case with an accuracy of ~94%. Basically the test renders a Rubric morph in concurrent green threads on a big string, and then we compare that the drawn morphs should be bit identical. We have made some measurements and observed that putting three concurrent processes made the chance of hitting the bug to almost 94%. So running it more than ~5 times makes it highly reproducible. Good thing: this test inside a loop of 10 iterations (to increase the probability of hitting the bug to >~99%) detects the bug in the stock image and not in the one with our patch. Bad thing: this test relies on rubric, we have tried to simplify it (like just doing a collect on the string to fetch the glyph for each character), but simpler code produces a tighter loop which makes it far less reproducible (down from 94 to 33%). Still, there are several questions open: - can we simplify the test? :) - we have to think about a smart way to serialize Freetype access in the image (there are several users) - I'll paste below our math, so if somebody would like to review it would be good too, our probability is a bit rusty. - we have been doing benchmarks with Pablo this morning and adding a mutex does not seem to have any negative impact in rendering. Help is super welcome of course!! The patch that Pablo has done in the PR will just work for the most common cases (at least for the calypso tabs), but it shows that FT* needs some love. Still we would like to be able to backport something like this in Pharo7, but backporting a big Freetype refactor would not be so good. Maybe the mutex is a good enough change for Pharo7? Any more ideas? input? Cheers, Guille and Pablo ## The code The code of the test is the following: ``` | sem text canvases blocky | sem := Semaphore new. text := (String loremIpsum: 25*1024). FreeTypeCache current removeAll. canvases := OrderedCollection new. blocky := [ | canvas | canvas := FormCanvas extent: 1000@1000. canvases add: canvas. (RubScrolledTextMorph new) setText: text; font: StandardFonts codeFont; bounds: (0@0 corner: canvas form extent); fullDrawOn: canvas. sem signal ]. blocky forkAt: 39. blocky forkAt: 39. blocky forkAt: 39. sem wait; wait; wait. self assert: (((canvases at:1) form bits = (canvases at:2) form bits) and: [ ((canvases at:2) form bits = (canvases at:3) form bits) ]) ``` ## The math p := 0.94 asScaledDecimal. hittingTheBug := 0. notHittingTheBug := 1 - hittingTheBug. 5 timesRepeat: [ hittingTheBug := hittingTheBug + (notHittingTheBug * p). notHittingTheBug := 1 - hittingTheBug ]. hittingTheBug. "after 5 iterations" "0.99999969s8" On Wed, Apr 10, 2019 at 8:04 PM Sven Van Caekenberghe <s...@stfx.eu> wrote: > Thanks a lot for actually diving into this, this mysterious issue has been > bugging us for a very long time. I do hope you can finally solve this. I am > sure many people will be grateful. > > Sven > > > On 10 Apr 2019, at 15:38, teso...@gmail.com wrote: > > > > Hello, > > > > After checking the problem with Guille, we have the hypothesis of the > > source of the problem. > > We have seen that accessing Free Type is not thread-safe. > > Basically, the FTFace holds a structure that is filled up with the > > glyph and its information. As this structure is part of the Font Face > > (a font face is a font plus size and attributes), only one request to > > a glyph can be executed at the time. As we are sharing the same font > > in many places, you will be starting to see the problem. > > > > Also, we have seen that there many accesses to the glyphs outside the > > UI process. > > This problem started to appear as we starting to use Calypso (but it > > is not limited to it), as Calypso uses lazy tabs. The creation of > > these tabs is done outside the UI process. > > > > This is only a hack to test our hypothesis. > > To fix it correctly we will have to rewrite the drawing of the glyphs > > and synchronize the access to the glyph data information. Also, we > > want to do it in a way that does not penalize the processing in the UI > > process. > > > > I have only patched the code that is used by the rendering of morphic, > > other renderings like Athens or any other user using FT should be > > correctly rewritten. > > > > Once we are sure that synchronizing the access to FT fixes the > > problem, we will do a real fix not a dirty hack like this. > > > > I will be testing this new Image to see if there is an improvement. > > > > I will really love you try to use this image and tell me if you still > > find the problem. > > > > There is a PR generating a Pharo8 image (it is called wrong, but... it > > is a Pharo8) > > > > 32 bits > > ===== > > > > > https://ci.inria.fr/pharo-ci-jenkins2/job/Test%20pending%20pull%20request%20and%20branch%20Pipeline/job/PR-3157/2/artifact/bootstrap-cache/Pharo7.0-PR-32bit-f8c6957.zip > > > > 64 bits > > ===== > > > > > https://ci.inria.fr/pharo-ci-jenkins2/job/Test%20pending%20pull%20request%20and%20branch%20Pipeline/job/PR-3157/2/artifact/bootstrap-cache/Pharo7.0-PR-64bit-f8c6957.zip > > > > If somebody is willing to use it in Pharo 7, I can create a PR against > > Pharo7 to generate patched P7 images. > > > > Thanks! > > > > -- > > Pablo Tesone. > > teso...@gmail.com > > > > > -- Guille Polito Research Engineer Centre de Recherche en Informatique, Signal et Automatique de Lille CRIStAL - UMR 9189 French National Center for Scientific Research - *http://www.cnrs.fr <http://www.cnrs.fr>* *Web:* *http://guillep.github.io* <http://guillep.github.io> *Phone: *+33 06 52 70 66 13