Since last email I have made pretty decent progress and I overall am pretty pleased with the results - I have achieved 2-3 fold decrease in the boot time however at the cost of increased memory usage. Please see my detailed results and some comments below. I also I think there is still room for improvement especially regarding memory usage.
But first I wanted to describe some improvements made to the layout of data on a disk (I think it deserves be called mfs2, at least it is my working name;-). So here is new layout in the order stored on disk: 1. Super Block (512 bytes) that contains magic number and specifies meta information including block size and location and size of tables containing i-nodes, dentries and symbolic links 2. Files data where each file is padded to 512 bytes block 3. Table of directory entries referenced by index in directory i-node (each entry holds string with direntry name and i-node number) 4. Table of symlinks referenced by symlink i-node (each entry holds symbolic link path string) 5. Table of inodes where each specifies type (dir,file,symlink) and data offset (for files it is a block on a disk, for symlinks and directories it is an offset in one of the 2 tables above) So when OSv mounts mfs2 (mfs2_mount) it reads super block in first disk read and then all 3 tables (inode, directories and i-nodes) at once in second disk read into memory. That way all structural data stays in memory and and can be accessed by corresponding vnops calls (readdir, lookup, etc) without having to access disk. As far as time it is small improvement over initial James Root's mfs in terms of file system speed but makes many things easier and cleaner. To speed up mfs2_read I implemented simple "read-ahead" cache-like layer that loads file data on demand in 16K segments and stores it is hashmap for any subsequent read. Small files (less than 16K) are loaded in full in one shot. In essence it kind of works like memory mapping files - each file is logically divided into 16K segments and when accessed at given offset corresponding segment key is calculated (file offset / 16K) and copied from memory if hit or while segment read from disk and put in hashmap if miss. Obviously I had to add a little bit of thread synchronization logic to make it all thread safe. I tried to make it as little contentious as possible by using mutexes at segment level when reading data from disk. For now files segments loaded in memory stay there forever but only parts that are needed are loaded from disk as opposed to the current ramdisk implementation. It should not be difficult to add some cache eviction thread that would evict unused segments after some time which would be beneficial for any stateless apps that stay running longer. I also experimented with trying to detect sequential read of a file and then load it into larger segments (64K) hoping that reading more from disk (also less exits to host) would speed up things even more. Unfortunately I did not see much of an improvement in terms of speed even though I saw better hit/miss ratio (80% vs 60%) on average. I was also disappointed to observe that bumping file segment size to 32K or 64K did not speed things much (maybe 5%) at the cost of even higher cache memory waste. Which is weird because my experiments with copying files using dd indicated that 64K or even 128K should yields best read speed on the host. Is there some kind of speed limit of data transfer between host and guest on qemu? Performance tests with ramdisk images show that reading from disk and uncompressing 20MB in real mode takes on average 100ms when reading 16MB from disk in 16K chunks in my my mfs2 implementation takes on average 250ms? To test how it performs I used two test images - java 9 hello world app and nodejs simple app that lists a directory and executes some simple calculations. Each of the two I tested in 5 filesystem configurations: - ramfs - zfs - mfs - original James Root implementation - my original mfs2 implementation without cache (mfs2_reads loads as much data from disk as specified in uio - typically 4K reads - which) - mfs2 with cache as described above Results: *ramfs java* --------------------- <..skipped..> disk read (real mode): 97.03ms, (+97.03ms) .init functions: 170.98ms, (+73.96ms) SMP launched: 171.66ms, (+0.67ms) VFS initialized: 179.72ms, (+8.07ms) Network initialized: 180.11ms, (+0.38ms) pvpanic done: 180.37ms, (+0.26ms) pci enumerated: 184.97ms, (+4.60ms) drivers probe: 184.97ms, (+0.00ms) drivers loaded: 233.71ms, (+48.74ms) ZFS mounted: 237.52ms, (+3.81ms) // Time point captured before file system is mounted Total time: 242.95ms, (+5.43ms) java.so: Starting JVM app using: io/osv/nonisolated/RunNonIsolatedJvmApp java.so: Setting Java system classloader to NonIsolatingOsvSystemClassLoader random: device unblocked. Hello, World! java main took 105.96ms app main took 110.37 ms // Time point is captured before app main [I/0 dhcp]: Unicasting DHCPRELEASE message with xid: [627516448] from client: 192.168.122.15 to server: 192.168.122.1 VFS: unmounting /dev VFS: unmounting /proc VFS: unmounting / from pre-VFS mount until post-VFS unmount took *122.76 ms* *ramfs node* --------------------- <..skipped..> Chdir to: '/js' chdir done disk read (real mode): 73.28ms, (+73.28ms) .init functions: 149.10ms, (+75.82ms) SMP launched: 149.75ms, (+0.65ms) VFS initialized: 155.50ms, (+5.74ms) Network initialized: 156.00ms, (+0.51ms) pvpanic done: 156.14ms, (+0.14ms) pci enumerated: 160.83ms, (+4.69ms) drivers probe: 160.83ms, (+0.00ms) drivers loaded: 219.33ms, (+58.50ms) ZFS mounted: 223.21ms, (+3.89ms) Total time: 228.70ms, (+5.48ms) random: blocking on read. random: device unblocked. Adding 5 to 10 gives us 15 Finished program exited with status 0 app main took 153.25 ms [I/54 dhcp]: Unicasting DHCPRELEASE message with xid: [313573304] from client: 192.168.122.15 to server: 192.168.122.1 VFS: unmounting /dev VFS: unmounting /proc VFS: unmounting / from pre-VFS mount until post-VFS unmount took *165.57 ms* *zfs java* --------------------- java.so: Starting JVM app using: io/osv/nonisolated/RunNonIsolatedJvmApp java.so: Setting Java system classloader to NonIsolatingOsvSystemClassLoader Hello, World! java main took 212.75ms app main took 216.78 ms from pre-VFS mount until post-VFS unmount took *6580.25 ms* *zfs node* --------------------- <..skipped> disk read (real mode): 87.11ms, (+87.11ms) .init functions: 109.59ms, (+22.48ms) SMP launched: 110.30ms, (+0.70ms) VFS initialized: 112.25ms, (+1.95ms) Network initialized: 112.49ms, (+0.24ms) pvpanic done: 112.60ms, (+0.12ms) pci enumerated: 117.00ms, (+4.39ms) drivers probe: 117.00ms, (+0.00ms) drivers loaded: 160.96ms, (+43.96ms) ZFS mounted: 1868.34ms, (+1707.38ms) Total time: 1873.90ms, (+5.56ms) java.so: Starting JVM app using: io/osv/nonisolated/RunNonIsolatedJvmApp java.so: Setting Java system classloader to NonIsolatingOsvSystemClassLoader Hello, World! java main took 225.80ms app main took 233.77 ms [I/0 dhcp]: Unicasting DHCPRELEASE message with xid: [766543878] from client: 192.168.122.15 to server: 192.168.122.1 VFS: unmounting /dev VFS: unmounting /proc VFS: unmounting / from pre-VFS mount until post-VFS unmount took *7107.37 ms* *mfs java* --------------------- java.so: Starting JVM app using: io/osv/nonisolated/RunNonIsolatedJvmApp java.so: Setting Java system classloader to NonIsolatingOsvSystemClassLoader Hello, World! java main took 3205.35ms app main took 3240.97 ms Spent 3590 ms reading blocks Read 26591 blocks from pre-VFS mount until post-VFS unmount took *3244.95 ms* *mfs node* --------------------- Adding 5 to 10 gives us 15 Finished app main took 3308.04 ms Spent 3202 ms reading blocks Read 25672 blocks from pre-VFS mount until post-VFS unmount took *3312.35 ms* *mfs2 java without cache * --------------------- java.so: Starting JVM app using: io/osv/nonisolated/RunNonIsolatedJvmApp java.so: Setting Java system classloader to NonIsolatingOsvSystemClassLoader Hello, World! java main took 653.19ms app main took 663.39 ms Spent 617 ms reading blocks Read 26316 blocks Allocated 0 512-byte blocks of cache memory from pre-VFS mount until post-VFS unmount took *667.82 ms* *mfs2 node without cache* --------------------- Adding 5 to 10 gives us 15 Finished app main took 596.32 ms Spent 515 ms reading blocks Read 25592 blocks Allocated 0 512-byte blocks of cache memory from pre-VFS mount until post-VFS unmount took *600.37 ms* *mfs2 java with cache* --------------------- java.so: Starting JVM app using: io/osv/nonisolated/RunNonIsolatedJvmApp java.so: Setting Java system classloader to NonIsolatingOsvSystemClassLoader Hello, World! java main took 328.95ms app main took 333.29 ms Spent 243 ms reading blocks Read 32473 blocks Allocated 32642 512-byte blocks of cache memory from pre-VFS mount until post-VFS unmount took *337.70 ms* *mfs2 node with cache* --------------------- Adding 5 to 10 gives us 15 Finished app main took 284.92 ms Spent 204 ms reading blocks Read 30750 blocks Allocated 30787 512-byte blocks of cache memory from pre-VFS mount until post-VFS unmount took *289.07 ms* Given I have decent but pretty dated laptop (late 2013 MacBook pro with SSD with Ubuntu on it, please see specs below) I am interested to see if others see similar or better mfs2 results. Model Name: MacBook Pro Model Identifier: MacBookPro11,3 Processor Name: Intel Core i7 Processor Speed: 2.3 GHz Number of Processors: 1 Total Number of Cores: 4 L2 Cache (per Core): 256 KB L3 Cache: 6 MB Memory: 16 GB I can send a working patch if anyone wants to try. Here is a list of improvements to mfs I can think of: - add logic to cache to cap memory usage (some kind of LRU) and evict unused segments from memory after some time - add file compression (how would it look on disk) - implement vop_cache to eliminate unnecessary memory copy Waldek PS. Lastly does anyone have a better name then "mfs2"? On Wednesday, November 22, 2017 at 4:09:20 PM UTC-5, Nadav Har'El wrote: > > > On Wed, Nov 22, 2017 at 6:43 PM, Waldek Kozaczuk <jwkoz...@gmail.com > <javascript:>> wrote: > >> I think that rather than implementing your suggestion which might not be >> all that trivial I would rather implement some kind of proper LRU based >> solution which would work like so: >> > > This can work. Intererstingly, in the first patch which James Root sent, > he did have such an LRU cache, of 512K. It had some problems with too much > locking, but should probably work ok in your single-threaded case. > He explained why he removed the cache in his comments when he sent the > second version of his patch > > Looking back at my review of his patch, I noted that in > fs/vfs/vfs_fops.cc, vfs_file::mmap(), you have two options: > 1. implement vop_cache function, in which case mmap will take pages > directly from your cache (not making another copy) > 2. If you don't implement this, map_file_page_read (core/mmu.cc) is used, > and this indeed just calls read() which will not bypass your cache. > Perhaps with this "vop_cache" thing we can avoid the extra copy your code > is doing. But it's probably negligable (copying 17MB should take just > milliseconds) so we can ignore this for now. > >> >> - Every time mfs_read() called it would try to find corresponding >> buffer in it LRU cache (i-node is a key) >> - If found it would copy from memory buffer in cache >> >> > I guess you also mean you need to save and check also the file position of > this saved buffer, not just the i-node? > > >> - >> - Otherwise it would read data from disk like so >> - if file size in blocks for this i-node is less or equal than some >> pre-determined max buffer size (64K/128K) it would read only enough >> blocks >> to load entire file in memory and add buffer to LRU cache >> - otherwize it would load maximum buffer size of blocks from disk >> and add to LRU cache >> - if more blocks needed to satisfy same read it would again >> load remaining file blocks or maximum buffer size of blocks from >> disk and >> replace last buffer with new one >> - There would be total maximum on the cache - like 1MB >> - Old segments would be dropped if not enough room >> - After some predetermined time each i-node buffer would be freed >> from cache so eventually memory usage would go to 0 if the service is >> stateless >> >> This is just a sketch of a solution as I am not sure of the details. I >> might be also underestimating how difficult it is to implement it. My goal >> is to improve mfs to make boot time and loading of read-once files as fast >> as possible. In general it should have good performance for contiguous file >> reads. I am not too worried about random file access. >> >> Please let me know what you think. Also is there a generic LRU C++ >> template in OSv code that I could re-use for something like this? >> > > I don't recall one. You can put your hash-table values (the buffers) in a > std::list or boost::intrusive::list and every time you find a buffer, move > it to the end of the list. > James Root use in his mfs_cache.hh/cc "next" and "prev" pointers and C > style code. > > >> On Wednesday, November 22, 2017 at 1:42:01 AM UTC-5, Nadav Har'El wrote: >>> >>> >>> On Wed, Nov 22, 2017 at 8:01 AM, Waldek Kozaczuk <jwkoz...@gmail.com> >>> wrote: >>> >>>> I managed to implement my change to make msf_read() use device strategy >>>> routing directly to bulk read however many blocks at once the uio >>>> parameter >>>> to mfs_read() specifies and as expected now the java run takes ~ 850 ms >>>> (down from 3500-4500ms). Which means it still spends ~600 ms reading ~ >>>> 14MB >>>> of data. >>>> >>>> So in general my thinking was correct. >>>> >>> >>> Excellent! >>> >>> Another quick hack you could try without diving too deep into the paging >>> code is to have msf_read() read 4*4096 bytes (or whatever) when asked >>> for just 4096 bytes, and save the unrequested bytes in a memory buffer and >>> next time msf_read() is called if it's asking for the next page, we already >>> have it waiting in the buffer and can copy (unfortunately) from it instead >>> of starting another read. >>> If it will not be the byte wanted, we wasted this read-ahead and drop >>> the buffer. >>> Of course it's not a general and good cache like ZFS (it's also not a >>> *cache* at all, if the same page is read multiple times with read() we read >>> it again...) but might be good enough for improving boot time. >>> >>> >>> Nadav. >>> >> -- >> You received this message because you are subscribed to the Google Groups >> "OSv Development" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to osv-dev+u...@googlegroups.com <javascript:>. >> For more options, visit https://groups.google.com/d/optout. >> > > -- You received this message because you are subscribed to the Google Groups "OSv Development" group. To unsubscribe from this group and stop receiving emails from it, send an email to osv-dev+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.