We instrumented the GHC locking mechanism as suggested by Viktor and
deployed the instrumented GHC in CI. We got our first crash after 4 months!

Summary from the preliminary investigation of the crash:

* The problem is not filesystem dependent or inode reuse related.
* After understanding it better, now I am able to reproduce it on the
local machine.
* The problem may be related to hClose getting interrupted by an async
exception.

The GHC panic message from the instrumented code is as follows:

2025-06-27T09:11:41.8177372Z lockFile: first lock: pid 21804, tid
21806, id 17 dev 2065 ino 262152 write 1
  2025-06-27T09:11:41.8178053Z FileSystem.Event: internal error:
close: lock exists: pid 21804, tid 21806, fd 17
2025-06-27T09:11:41.8178492Z
2025-06-27T09:11:41.8178604Z     (GHC version 9.2.8 for x86_64_unknown_linux)
2025-06-27T09:11:41.8179160Z     Please report this as a GHC bug:
https://www.haskell.org/ghc/reportabug

The message is coming from the "close" system call wrapper that I
created. It means we are closing the fd without releasing the lock first.

Based on the test logs I figured that the "close" call in the panic message is
happening from this code:

createFile :: FilePath -> FilePath -> IO ()
createFile file parent = openFile (parent </> file) WriteMode >>= hClose

Now the interesting thing is that in this particular test, we are
running two threads. In one thread, we are watching for file system
events related to this file, and in the other thread we are running
the createFile code snippet given above. As soon as the file gets
created we receive a file CREATED event in the first thread and we use
throwTo to send a ThreadAbort async exception to the createFile thread to
kill it.

My theory is that if the exception is delivered at a certain point
during hClose it returns without releasing the lock but closes the
file. To test that theory I wrote the createFile thread code as shown
below, so that I keep doing open and close in a loop, to make it more
likely to interrupt hClose by the exception:

createFile :: FilePath -> FilePath -> IO ()
createFile file parent = go
    where
    go =
        do
            h <- openFile (parent </> file) WriteMode
            hClose h
            go

Voila! With this code I am able to reproduce it locally now, though it
takes many runs (in a loop) and more than a few minutes to reproduce.

Any ideas, whether the problem is in hClose or it may be something else?
Next, I will try a code review of hClose and instrumenting further to narrow
down the problem.

Thanks,
Harendra

On Sat, 16 Nov 2024 at 13:02, Viktor Dukhovni <ietf-d...@dukhovni.org> wrote:
>
> On Fri, Nov 15, 2024 at 06:45:40PM +0530, Harendra Kumar wrote:
>
> > Coming back to this issue after a break. I reviewed the code carefully
> > and I cannot find anything where we are doing something in the
> > application code that affects the RTS locking mechanism. Let me walk
> > through the steps of the test up to failure and what we are doing in
> > the code. The test output is like this:
>
> It is indeed not immediately clear where in your code or in some
> dependency (including base, GHC, ...) a descriptor that contributes to
> the RTS file reader/writer count (indexed by (dev, ino)) might be closed
> without adjusting the count by calling the RTS `unlockFile()` function
> (via GHC.IO.FD.release).
>
> It may be worth noting that GHC does not *reliably* prevent simultaneous
> open handles for the same underlying file, because handles returned by
> hDuplicate do not contribute to the count:
>
>     demo.hs:
>         import GHC.IO.Handle (hDuplicate)
>         import System.IO
>
>         main :: IO ()
>         main = do
>             fh1 <- dupOpen "/tmp/foo"
>             fh2 <- dupOpen "/tmp/foo"
>             writeNow fh1 "abc\n"
>             writeNow fh2 "def\n"
>             readFile "/tmp/foo" >>= putStr
>             hClose fh1
>             hClose fh2
>          where
>             -- Look Mom, no lock!
>             dupOpen path = do
>                 fh <- openFile path WriteMode
>                 hDuplicate fh <* hClose fh
>
>             writeNow fh s = hPutStr fh s >> hFlush fh
>
>     $ ghc -O demo.hs
>     [1 of 2] Compiling Main             ( demo.hs, demo.o )
>     [2 of 2] Linking demo
>     $ ./demo
>     def
>
> I am not sure that Haskell really should be holding the application's
> hand in this area, corrupting output files by concurrent writers can
> just as easily happen by running two independent processes.  But letting
> go of this guardrail would IIRC be a deviation from the Haskell report,
> and there are likely applications that depend on this (and don't use
> hDupicate or equivalent to break the reader/writer tracking).
>
> --
>     Viktor.
> _______________________________________________
> ghc-devs mailing list
> ghc-devs@haskell.org
> http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs
_______________________________________________
ghc-devs mailing list
ghc-devs@haskell.org
http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs

Reply via email to