We instrumented the GHC locking mechanism as suggested by Viktor and deployed the instrumented GHC in CI. We got our first crash after 4 months!
Summary from the preliminary investigation of the crash: * The problem is not filesystem dependent or inode reuse related. * After understanding it better, now I am able to reproduce it on the local machine. * The problem may be related to hClose getting interrupted by an async exception. The GHC panic message from the instrumented code is as follows: 2025-06-27T09:11:41.8177372Z lockFile: first lock: pid 21804, tid 21806, id 17 dev 2065 ino 262152 write 1 2025-06-27T09:11:41.8178053Z FileSystem.Event: internal error: close: lock exists: pid 21804, tid 21806, fd 17 2025-06-27T09:11:41.8178492Z 2025-06-27T09:11:41.8178604Z (GHC version 9.2.8 for x86_64_unknown_linux) 2025-06-27T09:11:41.8179160Z Please report this as a GHC bug: https://www.haskell.org/ghc/reportabug The message is coming from the "close" system call wrapper that I created. It means we are closing the fd without releasing the lock first. Based on the test logs I figured that the "close" call in the panic message is happening from this code: createFile :: FilePath -> FilePath -> IO () createFile file parent = openFile (parent </> file) WriteMode >>= hClose Now the interesting thing is that in this particular test, we are running two threads. In one thread, we are watching for file system events related to this file, and in the other thread we are running the createFile code snippet given above. As soon as the file gets created we receive a file CREATED event in the first thread and we use throwTo to send a ThreadAbort async exception to the createFile thread to kill it. My theory is that if the exception is delivered at a certain point during hClose it returns without releasing the lock but closes the file. To test that theory I wrote the createFile thread code as shown below, so that I keep doing open and close in a loop, to make it more likely to interrupt hClose by the exception: createFile :: FilePath -> FilePath -> IO () createFile file parent = go where go = do h <- openFile (parent </> file) WriteMode hClose h go Voila! With this code I am able to reproduce it locally now, though it takes many runs (in a loop) and more than a few minutes to reproduce. Any ideas, whether the problem is in hClose or it may be something else? Next, I will try a code review of hClose and instrumenting further to narrow down the problem. Thanks, Harendra On Sat, 16 Nov 2024 at 13:02, Viktor Dukhovni <ietf-d...@dukhovni.org> wrote: > > On Fri, Nov 15, 2024 at 06:45:40PM +0530, Harendra Kumar wrote: > > > Coming back to this issue after a break. I reviewed the code carefully > > and I cannot find anything where we are doing something in the > > application code that affects the RTS locking mechanism. Let me walk > > through the steps of the test up to failure and what we are doing in > > the code. The test output is like this: > > It is indeed not immediately clear where in your code or in some > dependency (including base, GHC, ...) a descriptor that contributes to > the RTS file reader/writer count (indexed by (dev, ino)) might be closed > without adjusting the count by calling the RTS `unlockFile()` function > (via GHC.IO.FD.release). > > It may be worth noting that GHC does not *reliably* prevent simultaneous > open handles for the same underlying file, because handles returned by > hDuplicate do not contribute to the count: > > demo.hs: > import GHC.IO.Handle (hDuplicate) > import System.IO > > main :: IO () > main = do > fh1 <- dupOpen "/tmp/foo" > fh2 <- dupOpen "/tmp/foo" > writeNow fh1 "abc\n" > writeNow fh2 "def\n" > readFile "/tmp/foo" >>= putStr > hClose fh1 > hClose fh2 > where > -- Look Mom, no lock! > dupOpen path = do > fh <- openFile path WriteMode > hDuplicate fh <* hClose fh > > writeNow fh s = hPutStr fh s >> hFlush fh > > $ ghc -O demo.hs > [1 of 2] Compiling Main ( demo.hs, demo.o ) > [2 of 2] Linking demo > $ ./demo > def > > I am not sure that Haskell really should be holding the application's > hand in this area, corrupting output files by concurrent writers can > just as easily happen by running two independent processes. But letting > go of this guardrail would IIRC be a deviation from the Haskell report, > and there are likely applications that depend on this (and don't use > hDupicate or equivalent to break the reader/writer tracking). > > -- > Viktor. > _______________________________________________ > ghc-devs mailing list > ghc-devs@haskell.org > http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs _______________________________________________ ghc-devs mailing list ghc-devs@haskell.org http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs