Re: [PATCH 1/3] ext2fs: Add JBD2 on-disk layout headers

gfleury Thu, 19 Mar 2026 07:08:16 -0700

Hi,

I'm attempting to build Hurd with the JBD2 journaling patch on AMD64,but I'm encountering a compilation error that I need help resolving.


Here's what I'm doing:

1. apt update
2. apt source hurd
3. cd hurd-0.9.git20251029

4. patch -p1 <../v5-0001-ext2fs-Add-JBD2-journaling-to-ext2-libdiskfs.patch

5. sudo dpkg-buildpackage -us -uc -b

The build fails with this error:

```
../../ext2fs/journal.c: In function 'flush_to_disk':

../../ext2fs/journal.c:592:17: error: implicit declaration of function'store_sync' [-Wimplicit-function-declaration]

592 |   error_t err = store_sync (store);
    |                 ^~~~~~~~~~
```

Interestingly, the store_sync function doesn't seem to exist inhurd-0.9.git20251029, but it is present in the current master branch.Could this be a version compatibility issue with the patch?


Any guidance would be appreciated. Thanks!

Le 2026-03-18 01:04, Milos Nikic a écrit :

Hello again Samuel,

First of all, I want to apologize again for the patch churn over the
past week. I wanted to put this to rest properly, and I am now sending
my final, stable version.

This is it. I have applied numerous fixes, performance tweaks, and
cleanups. I am happy to report that this now performs on par with
unjournaled ext2 on normal workloads, such as configuring/compiling
the Hurd, installing and reinstalling packages via APT, and untarring
large archives (like the Linux kernel). I have also heavily tested it
against artificial stress conditions (which I am happy to share if
there is interest), and it handles highly concurrent loads beautifully
without deadlocks or memory leaks.

Progressive checkpointing ensures the filesystem runs smoothly, and
the feature remains strictly opt-in (until a partition is tuned with
tune2fs -j, the journal is completely inactive).

The new API in libdiskfs is minimal but expressive enough to wrap all
filesystem operations in transactions and handle strict POSIX sync
barriers.

Since v4, I have made several major architectural improvements:

Smart Auto-Commit: diskfs_journal_stop_transaction now automatically
commits to disk if needs_sync has been flagged anywhere in the nested
RPC chain and the reference count drops to zero.

Cleaned ext2 internal Journal API: I have exposed journal_store_write
and journal_store_read as block-device filter layers. Internal state
checks (journal_has_active_transaction, etc.) are now strictly hidden.
How the journal preserves the WAL property is now very obvious, as it
directly intercepts physical store operations.

The "Lifeboat" Cache: Those store wrappers now utilize a small,
temporary internal cache to handle situations where the Mach VM pager
rushes blocks due to memory pressure. The Lifeboat seamlessly
intercepts and absorbs these hazard blocks without blocking the pager
or emitting warnings, even at peak write throughput.

As before, I have added detailed comments across the patch to explain
the state machine and locking hierarchy. I know this is a complex
subsystem, so I am more than happy to write additional documentation
in whatever form is needed.

Once again, apologies for the rapid iterations. I won't be touching
this code further until I hear your feedback.

Kind regards,
Milos
On Sun, Mar 15, 2026 at 9:01 PM Milos Nikic <[email protected]>wrote:
Hi Samuel,
I am writing to sincerely apologize for the insane amount of patchchurn over the last week. I know the rapid version bumps from v2 up tov4 have been incredibly noisy, and I want to hit the brakes before youspend any more time reviewing the current code.
While running some extreme stress tests on a very small ext2 partitionwith the tiniest (allowed by the tooling) journal, I caught a fewcritical edge cases. While fixing those, I also realized that mylibdiskfs VFS API boundary is clunkier than it needs to be. I amcurrently rewriting it to more closely match Linux's JBD2 semantics,where the VFS simply flags a transaction for sync and calls stop,allowing the journal to auto-commit when the reference count drops tozero.
I'm also adding handling for cases where the Mach VM pager rushesblocks to the disk while they are in the process of committing. Thissafely intercepts them and will remove those warnings and WALviolations in almost all cases.
Please completely disregard v4.
I promise the churn is coming to an end. I am going to take a littletime to finish this API contraction, stress-test it, polish it, andmake sure it is 100% rock-solid. I will be back soon with a finalizedv5.
Thanks for your patience with my crazy iteration process!

Best, Milos
On Thu, Mar 12, 2026 at 8:53 AM Milos Nikic <[email protected]>wrote:
Hi Samuel,
As promised, here is the thoroughly tested and benchmarked V4 revisionof the JBD2 Journal for Hurd.
This revision addresses a major performance bottleneck present in V3under heavy concurrent workloads. The new design restores performanceto match vanilla Hurd's unjournaled ext2fs while preserving full crashconsistency.
Changes since V3:
- Removed eager memcpy() from the journal_dirty_block() hot-path.
- Introduced deferred block copying that triggers only when thetransaction becomes quiescent.
- Added a `needs_copy` flag to prevent redundant memory copies.
- Eliminated the severe lock contention and memory bandwidth pressureobserved in V3.
Why the changes in v4 vs v3?
I have previously identified that the last remaining performancebottleneck is memcpy of 4k byte every time journal_dirty_block iscalled. And i was thinking about how to improve it.
Deferred copy comes to mind, But...
The Hurd VFS locks at the node level rather than the physical blocklevel (as Linux does). Because multiple nodes may share the same 4KBdisk block, naively deferring the journal copy until commit time cancapture torn writes if another thread is actively modifying aneighboring node in the same block.
Precisely because of this V3 performed a 4KB memcpy immediately insidejournal_dirty_block() (copy on write) while the node lock was held.While safe, this placed expensive memory operations and global journallock contention directly in the VFS hot-path, causing severe slowdownsunder heavy parallel workloads.
V4 removes this eager copy entirely by leveraging an existingtransaction invariant:All VFS threads increment and decrement the active transaction's`t_updates` counter via the start/stop transaction functions. Atransaction cannot commit until this counter reaches zero.When `t_updates == 0`, we are mathematically guaranteed that no VFSthreads are mutating blocks belonging to the transaction. At that exactmoment, the memory backing those blocks has fully settled and can besafely copied without risk of torn writes. A perfect place for adeferred write!
journal_dirty_block() now simply records the dirty block id in a hashtable, making the hot-path strictly O(1). (and this is why we have anamazing performance boost between v3 and v4)
But we also need to avoiding redundant copies:
Because transactions remain open for several seconds, `t_updates` maybounce to zero and back up many times during a heavy workload (asmultiple VFS threads start/stop the transaction). To avoid repeatedlycopying the same unchanged blocks every time the counter hits zero,each shadow buffer now contains a `needs_copy` flag.
When a block is dirtied, the flag is set. When `t_updates` reacheszero, only buffers with `needs_copy == 1` are copied to the shadowbuffers, after which the flag is cleared.So two things need to be true in order for a block to be copied: 1)t_updates must just hit 0 and 2) needs_copy needs to be 1
This architecture completely removes the hot-path bottleneck. Journaledext2fs now achieves performance virtually identical to vanilla ext2fs,even under brutal concurrency (e.g. scripts doing heavy writes frommultiple shells at the same time).
I know this is a dense patch with a lot to unpack. I've documented thelocking and Mach VM interactions as thoroughly as possible in the codeitself (roughly 1/3 of the lines are comments in ext2fs/journal.c), butI understand there is only so much nuance that can fit into C comments.If it would be helpful, I would be happy to draft a dedicated documentdetailing the journal's lifecycle, its hooks into libdiskfs/ext2, andthe rationale behind the macro-transaction design, so future developershave a clear reference.
Looking forward to your thoughts.

Best,
Milos
On Tue, Mar 10, 2026 at 9:25 PM Milos Nikic <[email protected]>wrote:
Hi Samuel,

Just a quick heads-up: please hold off on reviewing this V3 series.
While V3 version works fast for simple scenarios in single threadedsituations (like configure or make ext2fs etc) I fund that whilerunning some heavy multi-threaded stress tests on V3, a significantperformance degradation happens due to lock contention bottleneckcaused by the eager VFS memcpy hot-path. (memcpy insidejournal_dirty_block which is called 1000s of time a second reallybecomes a performance problem.)
I have been working on a much cleaner approach that safely defers theblock copying to the quiescent transaction stop state. It completelyeliminates the VFS lock contention and brings the journaled performanceback to vanilla ext2fs levels even with many threads competing atwriting/reading/renaming in the same place.
I am going to test this new architecture thoroughly over the next fewdays and will send it as V4 once I am certain it is rock solid.
Thanks!
On Mon, Mar 9, 2026 at 12:15 PM Milos Nikic <[email protected]>wrote:
Hello Samuel and the Hurd team,
I am sending over v3 of the journaling patch. I know v2 is stillpending review, but while testing and profiling based on previousfeedback, I realized the standard mapping wasn't scaling well formetadata-heavy workloads. I wanted to send this updated architectureyour way to save you from spending time reviewing the obsolete v2 code.
This version keeps the core JBD2 logic from v2 but introduces severalstructural optimizations, bug fixes, and code cleanups:- Robin Hood Hash Map: Replaced ihash with a custom map forsignificantly tighter cache locality and faster lookups.- O(1) Slab Allocator: Added a pre-allocated pool to make transactionbuffers zero-allocation in the hot path.- Unified Buffer Tracking: Eliminated the dual linked-list/mapstructure in favor of just the map, fixing a synchronization bug fromv2 and simplifying the code.
- Few other small bug fixes
- Refactored Dirty Block Hooks: Moved the journal_dirty_block callsfrom inode.c directly into the ext2fs.h low-level block computationfunctions (record_global_poke, sync_global_ptr, record_indir_poke, andalloc_sync). This feels like a more natural fit and makes it mucheasier to ensure we aren't missing any call sites.
Performance Benchmarks:
I ran repeated tests on my machine to measure the overhead, comparingthis v3 journal implementation against Vanilla Hurd.
make ext2fs (CPU/Data bound - 5 runs):
Vanilla Hurd Average: ~2m 40.6s
Journal v3 Average: ~2m 41.3s
Result: Statistical tie. Journal overhead is practically zero.

make clean && ../configure (Metadata bound - 5 runs):
Vanilla Hurd Average: ~3.90s (with latency spikes up to 4.29s)
Journal v3 Average: ~3.72s (rock-solid consistency, never breaking3.9s)Result: Journaled ext2 is actually faster and more predictable here dueto the WAL absorbing random I/O.
Crash Consistency Proof:
Beyond performance, I wanted to demonstrate the actual crash recoveryin action.
Boot Hurd, log in, create a directory (/home/loshmi/test-dir3).
Wait for the 5-second kjournald commit tick.
Hard crash the machine (kill -9 the QEMU process on the host).
Inspecting from the Linux host before recovery shows the inode iscompletely busted (as expected): sudo debugfs -R "stat/home/loshmi/test-dir3" /dev/nbd0 debugfs 1.47.3 (8-Jul-2025)
Inode: 373911   Type: bad type    Mode:  0000   Flags: 0x0
Generation: 0    Version: 0x00000000
User:     0   Group:     0   Size: 0
File ACL: 0 Translator: 0
Links: 0   Blockcount: 0
Fragment:  Address: 0    Number: 0    Size: 0
ctime: 0x00000000 -- Wed Dec 31 16:00:00 1969
atime: 0x00000000 -- Wed Dec 31 16:00:00 1969
mtime: 0x00000000 -- Wed Dec 31 16:00:00 1969
BLOCKS:
Note: On Vanilla Hurd, running fsck here would permanently lose thedirectory or potentially cause further damage depending on luck.
Triggering the journal replay:
sudo e2fsck -fy /dev/nbd0
Inspecting immediately after recovery: sudo debugfs -R "stat/home/loshmi/test-dir3" /dev/nbd0 debugfs 1.47.3 (8-Jul-2025)
Inode: 373911   Type: directory    Mode:  0775   Flags: 0x0
Generation: 1773077012    Version: 0x00000000
User:  1001   Group:  1001   Size: 4096
File ACL: 0 Translator: 0
Links: 2   Blockcount: 8
Fragment:  Address: 0    Number: 0    Size: 0
ctime: 0x69af0213 -- Mon Mar  9 10:23:31 2026
atime: 0x69af0213 -- Mon Mar  9 10:23:31 2026
mtime: 0x69af0213 -- Mon Mar  9 10:23:31 2026
BLOCKS:
(0):1507738
TOTAL: 1
The journal successfully reconstructed the directory, and logdumpconfirms the transactions were consumed perfectly.
I have run similar hard-crash tests for rename, chmod, and chown etcwith the same successful recovery results.
I've attached the v3 diff. Let me know what you think of the new hashmap and slab allocator approach!
Best,
Milos
On Fri, Mar 6, 2026 at 10:06 PM Milos Nikic <[email protected]>wrote:
And here is the last one...
I hacked up an improvement for journal_dirty_block to try and see if icould speed it up a bit.1) Used specialized robin hood based hash table for speed (notombstones etc) (I took it from one of my personal projects....justspecialized it here a bit)
2) used a small slab allocator to avoid malloc-ing in the hot path
3) liberally sprinkled __rdtsc() to get a sense of cycle time insidejournal_dirty_block
Got to say, just this simple local change managed to shave off 3-5% ofslowness.
So my test is:
- Boot Hurd
- Inside Hurd go to the Hurd build directory
- run:
$ make clean && ../configure
$ time make ext2fs

I do it multiple times for 3 different versions of ext2 libraries
1) Vanilla Hurd (No Journal): ~avg, 151 seconds

2) Enhanced JBD2 (Slab + Custom Hash): ~159 seconds (5% slower!)

3) Baseline JBD2 (malloc + libihash what was sent in V2): ~168 seconds
Of course there is a lot of variability, and my laptop is not a perfectenvironment for these kinds of benchmarks, but this is what i have.
My printouts on the screen show this:
ext2fs: part:5:device:wd0: warning: === JBD2 STATS ===
ext2fs: part:5:device:wd0: warning: Total Dirty Calls:      339105
ext2fs: part:5:device:wd0: warning: Total Function: 217101909cyclesext2fs: part:5:device:wd0: warning: Total Lock Wait: 16741691cyclesext2fs: part:5:device:wd0: warning: Total Alloc: 673363cyclesext2fs: part:5:device:wd0: warning: Total Memcpy: 137938008cyclesext2fs: part:5:device:wd0: warning: Total Hash Add: 258533cyclesext2fs: part:5:device:wd0: warning: Total Hash Find: 29501960cyclesext2fs: part:5:device:wd0: warning: --- AVERAGES (Amortized per call)---
ext2fs: part:5:device:wd0: warning: Avg Function Time: 640 cycles
ext2fs: part:5:device:wd0: warning: Avg Lock Wait:     49 cycles
ext2fs: part:5:device:wd0: warning: Avg Memcpy:        406 cycles
ext2fs: part:5:device:wd0: warning: Avg Malloc 1:      1 cycles
ext2fs: part:5:device:wd0: warning: Avg Hash Add:      0 cycles
ext2fs: part:5:device:wd0: warning: Avg Hash Find:      86 cycles
ext2fs: part:5:device:wd0: warning: ==================
Averages here say a lot...with these improvements we are now down tobasically Memcpy time...and for copying 4096 bytes of ram Im not surewe can make it take less than 400 cycles...so we are hitting hardwarelimitations.It would be great if we could avoid memcpy here altogether or delay ituntil commit or similar, and i have some ideas, but they all requiredrastic changes across libdiskfs and ext2fs, not sure if a fewremaining percentage points of slowdown warrant that.
Also, wow during ext2 compilation...this function (journal_dirty_block)is being called a bit more than 1000 times per second (for each andevery block that is ever being touched by the compiler)
I am attaching here the altered journal.c with these changes if one isinterested in seeing the localized changes.
Regards,
Milos
On Fri, Mar 6, 2026 at 11:09 AM Milos Nikic <[email protected]>wrote:
Hi Samuel,
One quick detail I forgot to mention regarding the performanceanalysis:
The entire ~0.4s performance impact I measured is isolated exclusivelyto journal_dirty_block.
To verify this, I ran an experiment where I stubbed outjournal_dirty_block so it just returned immediately (which obviouslymakes for a very fast, but not very useful, journal!). With that singlefunction bypassed, the filesystem performs identically to vanilla Hurd.
This confirms that the background kjournald flusher, the transactionreference counting, and the checkpointing logic add absolutely nonoticeable latency to the VFS. The overhead is strictly tied to thephysics of the memory copying and hashmap lookups in that one blockwhich we can improve in subsequent patches.
Thanks, Milos
On Fri, Mar 6, 2026 at 10:55 AM Milos Nikic <[email protected]>wrote:
Hi Samuel,
Thanks for reviewing my mental model on V1; I appreciate the detailedfeedback.
Attached is the v2 patch. Here is a breakdown of the architecturalchanges and refactors based on your review:
1. diskfs_node_update and the Pager
Regarding the question, "Do we really want to update the node?": Yes,we must update it with every change. JBD2 works strictly at thephysical block level, not the abstract node cache level. To capture anode change in the journal, the block content must be physicallyserialized to the transaction buffer. Currently, this path isdiskfs_node_update -> diskfs_write_disknode -> journal_dirty_block.When wait is 0, this just copies the node details from the node-cacheto the pager. It is strictly an in-memory serialization and isextremely fast. I have updated the documentation for diskfs_node_updateto explicitly describe this behavior so future maintainers understandit isn't triggering synchronous disk I/O and doesn't measurablyincrease the latency of the file system.journal_dirty_block is not one of the most hammered functions inlibdiskfs/ext2 and more on that below.
2. Synchronous Wait & Factorization
I completely agree with your factorization advice:write_disknode_journaled has been folded directly intodiskfs_write_disknode, making it much cleaner.Regarding the wait flag: we are no longer ignoring it! Instead ofblocking the VFS deeply in the stack, we now set an "IOU" flag on thetransaction. This bubbles the sync requirement up to the outer RPClayer, which is the only place safe enough to actually sleep on thecommit and thus maintain the POSIX sync requirement without deadlockingetc.
3. Multiple Writes to the Same Metadata Block
"Can it happen that we write several times to the same metadata block?"Yes, multiple nodes can live in the same block. However, because theMach pager always flushes the "latest snapshot" of the block, we don'thave an issue with mixed or stale data hitting the disk.If RPCs hit while pager is actively writing that is all captured in the"RUNNING TRANSACTION". If it happens that that RUNNING TRANSACTION hasthe same blocks pager is committing RUNNING TRANSACTION will beforcebliy committed.
4. The New libdiskfs API
I added two new opaque accessors to diskfs.h:

diskfs_journal_set_sync
diskfs_journal_needs_sync
This allows inner nested functions to declare a strict need for a POSIXsync without causing lock inversions. We only commit at the top RPClayer once the operation is fully complete and locks are dropped.
5. Cleanups & Ordering
Removed the redundant record_global_poke calls.
Reordered the pager write notification in journal.c to sit after thecommitting function, as the pager write happens after the journalcommit.Merged the ext2_journal checks inside diskfs_journal_start_transactionto return early.
Reverted the bold unlock moves.
Fixed the information leaks.
Elevated the deadlock/WAL bypass logs to ext2_warning.

Performance:
I investigated the ~0.4s (increase from 4.9s to 5.3s) regression on mySSD during a heavy Hurd ../configure test. By stubbing outjournal_dirty_block, performance returned to vanilla Hurd speeds,isolating the overhead to that specific function.
A nanosecond profile reveals the cost is evenly split across themandatory physics of a block journal:
25%: Lock Contention (Global transaction serialization)

22%: Memcpy (Shadowing the 4KB blocks)

21%: Hash Find (hurd_ihash lookups for block deduplication)
I was surprised to see hurd_ihash taking up nearly a quarter of theoverhead. I added some collision mitigation, but left a furtherimprovements of this patch to keep the scope tight. In the future, wecould drop the malloc entirely using a slab allocator and optimize thehashmap to get this overhead closer to zero (along with introducing a"frozen data" concept like Linux does but that would be a bigger nonlocalized change).
Final Note on Lock Hierarchy
The intended, deadlock-free use of the journal in libdiskfs is bestillustrated by the CHANGE_NODE_FIELD macro in libdiskfs/priv.h
txn = diskfs_journal_start_transaction ();
pthread_mutex_lock (&np->lock);
(OPERATION);
diskfs_node_update (np, diskfs_synchronous);
pthread_mutex_unlock (&np->lock);
if (diskfs_synchronous || diskfs_journal_needs_sync (txn))
diskfs_journal_commit_transaction (txn);
else
diskfs_journal_stop_transaction (txn);
By keeping journal operations strictly outside of the nodelocking/unlocking phases, we treat it as the outermost "lock" on thefile system, mathematically preventing deadlocks.
Kind regards,
Milos
On Thu, Mar 5, 2026 at 12:41 PM Samuel Thibault<[email protected]> wrote:
Hello,
Milos Nikic, le jeu. 05 mars 2026 09:31:26 -0800, a ecrit: Hurd VFSworks in 3 layers:
1. Node cache layer: The abstract node lives here and it is the groundtruth
of a running file system. When one does a stat myfile.txt, we get the
information straight from the cache. When we create a new file, it gets
placed in the cache, etc.
2. Pager layer: This is where nodes are serialized into the actualphysical
representation (4KB blocks) that will later be written to disk.
3. Hard drive: The physical storage that receives the bytes from thepager.
During normal operations (not a sync mount, fsync, etc.), the VFSoperatesalmost entirely on Layer 1: The Node cache layer. This is why it'ssuper fast.User changed atime? No problem. It just fetches a node from the nodecache(hash table lookup, amortized to O(1)) and updates the struct inmemory. And
that is it.
Yes, so that we get as efficient as possible.
Only when the sync interval hits (every 30 seconds by default) does theNodecache get iterated and serialized to the pager layer(diskfs_sync_everything ->
write_all_disknodes -> write_node -> pager_sync). So basically, at that
moment, we create a snapshot of the state of the node cache and placeit onto
the pager(s).
It's not exactly a snapshot because the coherency between inodes and
data is not completely enforced (we write all disknodes before asking
the kernel to write back dirty pages, and then poke the writes).
Even then, pager_sync is called with wait = 0. It is handed to thepager, whichsends it to Mach. At some later time (seconds or so later), Mach sendsit backto the ext2 pager, which finally issues store_write to write it toLayer 3 (TheHard drive). And even that depends on how the driver reorders or delaysit.
The effect of this architecture is that when store_write is finallycalled, theabsolute latest version of the node cache snapshot is what gets writtento the
storage. Is this basically correct?
It seems to be so indeed.

Are there any edge cases or mechanics that are wrong in this model
that would make us receive a "stale" node cache snapshot?
Well, it can be "stale" if another RPC hasn't called
diskfs_node_update() yet, but that's what "safe" FS are all about: not
actually provide more than coherency of the content on the disk so fsck
is not suppposed to be needed. Then, if a program really wantscoherency
between some files etc. it has to issue sync calls, dpkg does it for
instance.

Samuel

Re: [PATCH 1/3] ext2fs: Add JBD2 on-disk layout headers

Reply via email to