Re: Enhance ramfs/zfs-less images

Waldek Kozaczuk Fri, 08 Dec 2017 14:57:07 -0800

Since last email I have made pretty decent progress and I overall am pretty 
pleased with the results - I have achieved 2-3 fold decrease in the boot 
time however at the cost of increased memory usage. Please see my detailed 
results and some comments below. I also I think there is still room for 
improvement especially regarding memory usage.


But first I wanted to describe some improvements made to the layout of data 
on a disk (I think it deserves be called mfs2, at least it is my working 
name;-). So here is new layout in the order stored on disk:

   1. Super Block (512 bytes) that contains magic number and specifies meta 
   information including block size and location and size of tables containing 
   i-nodes, dentries and symbolic links
   2. Files data where each file is padded to 512 bytes block  
   3. Table of directory entries referenced by index in directory i-node 
   (each entry holds string with direntry name and i-node number)
   4. Table of symlinks referenced by symlink i-node (each entry holds 
   symbolic link path string)
   5. Table of inodes where each specifies type (dir,file,symlink) and data 
   offset (for files it is a block on a disk, for symlinks and directories it 
   is an offset in one of the 2 tables above)

So when OSv mounts mfs2 (mfs2_mount) it reads super block in first disk 
read and then all 3 tables (inode, directories and i-nodes) at once in 
second disk read into memory. That way all structural data stays in memory 
and and can be accessed by corresponding vnops calls (readdir, lookup, etc) 
without having to access disk. As far as time it is small improvement over 
initial James Root's mfs in terms of file system speed but makes many 
things easier and cleaner.

To speed up mfs2_read I implemented simple "read-ahead" cache-like layer 
that loads file data on demand in 16K segments and stores it is hashmap for 
any subsequent read. Small files (less than 16K) are loaded in full in one 
shot. In essence it kind of works like memory mapping files - each file is 
logically divided into 16K segments and when accessed at given offset 
corresponding segment key is calculated (file offset / 16K) and copied from 
memory if hit or while segment read from disk and put in hashmap if miss. 
Obviously I had to add a little bit of thread synchronization logic to make 
it all thread safe. I tried to make it as little contentious as possible by 
using mutexes at segment level when reading data from disk. For now files 
segments loaded in memory stay there forever but only parts that are needed 
are loaded from disk as opposed to the current ramdisk implementation. It 
should not be difficult to add some cache eviction thread that would evict 
unused segments after some time which would be beneficial for any stateless 
apps that stay running longer. 

I also experimented with trying to detect sequential read of a file and 
then load it into larger segments (64K) hoping that reading more from disk 
(also less exits to host) would speed up things even more. Unfortunately I 
did not see much of an improvement in terms of speed even though I saw 
better hit/miss ratio (80% vs 60%) on average. I was also disappointed to 
observe that bumping file segment size to 32K or 64K did not speed things 
much (maybe 5%) at the cost of even higher cache memory waste. Which is 
weird because my experiments with copying files using dd indicated that 64K 
or even 128K should yields best read speed on the host. Is there some kind 
of speed limit of data transfer between host and guest on qemu? Performance 
tests with ramdisk images show that reading from disk and uncompressing 
20MB in real mode takes on average 100ms when reading 16MB from disk in 16K 
chunks in my my mfs2 implementation takes on average 250ms?

To test how it performs I used two test images - java 9 hello world app and 
nodejs simple app that lists a directory and executes some simple 
calculations. Each of the two I tested in 5 filesystem configurations:

   - ramfs
   - zfs
   - mfs - original James Root implementation 
   - my original mfs2 implementation without cache (mfs2_reads loads as 
   much data from disk as specified in uio - typically 4K reads - which)
   - mfs2 with cache as described above

Results:

*ramfs java*
---------------------
<..skipped..>
        disk read (real mode): 97.03ms, (+97.03ms)
        .init functions: 170.98ms, (+73.96ms)
        SMP launched: 171.66ms, (+0.67ms)
        VFS initialized: 179.72ms, (+8.07ms)
        Network initialized: 180.11ms, (+0.38ms)
        pvpanic done: 180.37ms, (+0.26ms)
        pci enumerated: 184.97ms, (+4.60ms)
        drivers probe: 184.97ms, (+0.00ms)
        drivers loaded: 233.71ms, (+48.74ms)
        ZFS mounted: 237.52ms, (+3.81ms) // Time point captured before file 
system is mounted
        Total time: 242.95ms, (+5.43ms)
java.so: Starting JVM app using: io/osv/nonisolated/RunNonIsolatedJvmApp
java.so: Setting Java system classloader to NonIsolatingOsvSystemClassLoader
random: device unblocked.
Hello, World!
java main took 105.96ms
app main took 110.37 ms // Time point is captured before app main
[I/0 dhcp]: Unicasting DHCPRELEASE message with xid: [627516448] from 
client: 192.168.122.15 to server: 192.168.122.1
VFS: unmounting /dev
VFS: unmounting /proc
VFS: unmounting /
from pre-VFS mount until post-VFS unmount took *122.76 ms*

*ramfs node*
---------------------
<..skipped..>
Chdir to: '/js'
chdir done
        disk read (real mode): 73.28ms, (+73.28ms)
        .init functions: 149.10ms, (+75.82ms)
        SMP launched: 149.75ms, (+0.65ms)
        VFS initialized: 155.50ms, (+5.74ms)
        Network initialized: 156.00ms, (+0.51ms)
        pvpanic done: 156.14ms, (+0.14ms)
        pci enumerated: 160.83ms, (+4.69ms)
        drivers probe: 160.83ms, (+0.00ms)
        drivers loaded: 219.33ms, (+58.50ms)
        ZFS mounted: 223.21ms, (+3.89ms)
        Total time: 228.70ms, (+5.48ms)
random: blocking on read.
random: device unblocked.
Adding 5 to 10 gives us 15
Finished
program exited with status 0
app main took 153.25 ms
[I/54 dhcp]: Unicasting DHCPRELEASE message with xid: [313573304] from 
client: 192.168.122.15 to server: 192.168.122.1
VFS: unmounting /dev
VFS: unmounting /proc
VFS: unmounting /
from pre-VFS mount until post-VFS unmount took *165.57 ms*

*zfs java*
---------------------
java.so: Starting JVM app using: io/osv/nonisolated/RunNonIsolatedJvmApp
java.so: Setting Java system classloader to NonIsolatingOsvSystemClassLoader
Hello, World!
java main took 212.75ms
app main took 216.78 ms
from pre-VFS mount until post-VFS unmount took *6580.25 ms*

*zfs node*
---------------------
<..skipped>
        disk read (real mode): 87.11ms, (+87.11ms)
        .init functions: 109.59ms, (+22.48ms)
        SMP launched: 110.30ms, (+0.70ms)
        VFS initialized: 112.25ms, (+1.95ms)
        Network initialized: 112.49ms, (+0.24ms)
        pvpanic done: 112.60ms, (+0.12ms)
        pci enumerated: 117.00ms, (+4.39ms)
        drivers probe: 117.00ms, (+0.00ms)
        drivers loaded: 160.96ms, (+43.96ms)
        ZFS mounted: 1868.34ms, (+1707.38ms)
        Total time: 1873.90ms, (+5.56ms)
java.so: Starting JVM app using: io/osv/nonisolated/RunNonIsolatedJvmApp
java.so: Setting Java system classloader to NonIsolatingOsvSystemClassLoader
Hello, World!
java main took 225.80ms
app main took 233.77 ms
[I/0 dhcp]: Unicasting DHCPRELEASE message with xid: [766543878] from 
client: 192.168.122.15 to server: 192.168.122.1
VFS: unmounting /dev
VFS: unmounting /proc
VFS: unmounting /
from pre-VFS mount until post-VFS unmount took *7107.37 ms*

*mfs java*
---------------------
java.so: Starting JVM app using: io/osv/nonisolated/RunNonIsolatedJvmApp
java.so: Setting Java system classloader to NonIsolatingOsvSystemClassLoader
Hello, World!
java main took 3205.35ms
app main took 3240.97 ms
Spent 3590 ms reading blocks
Read 26591 blocks
from pre-VFS mount until post-VFS unmount took *3244.95 ms*

*mfs node*
---------------------
Adding 5 to 10 gives us 15
Finished
app main took 3308.04 ms
Spent 3202 ms reading blocks
Read 25672 blocks
from pre-VFS mount until post-VFS unmount took *3312.35 ms*

*mfs2 java without cache *
---------------------
java.so: Starting JVM app using: io/osv/nonisolated/RunNonIsolatedJvmApp
java.so: Setting Java system classloader to NonIsolatingOsvSystemClassLoader
Hello, World!
java main took 653.19ms
app main took 663.39 ms
Spent 617 ms reading blocks
Read 26316 blocks
Allocated 0 512-byte blocks of cache memory
from pre-VFS mount until post-VFS unmount took *667.82 ms*

*mfs2 node without cache*
---------------------
Adding 5 to 10 gives us 15
Finished
app main took 596.32 ms
Spent 515 ms reading blocks
Read 25592 blocks
Allocated 0 512-byte blocks of cache memory
from pre-VFS mount until post-VFS unmount took *600.37 ms*

*mfs2 java with cache*
---------------------
java.so: Starting JVM app using: io/osv/nonisolated/RunNonIsolatedJvmApp
java.so: Setting Java system classloader to NonIsolatingOsvSystemClassLoader
Hello, World!
java main took 328.95ms
app main took 333.29 ms
Spent 243 ms reading blocks
Read 32473 blocks
Allocated 32642 512-byte blocks of cache memory
from pre-VFS mount until post-VFS unmount took *337.70 ms*

*mfs2 node with cache*
---------------------
Adding 5 to 10 gives us 15
Finished
app main took 284.92 ms
Spent 204 ms reading blocks
Read 30750 blocks
Allocated 30787 512-byte blocks of cache memory
from pre-VFS mount until post-VFS unmount took *289.07 ms*

Given I have decent but pretty dated laptop (late 2013 MacBook pro with SSD 
with Ubuntu on it, please see specs below) I am interested to see if others 
see similar or better mfs2 results. 

Model Name:     MacBook Pro
Model Identifier:       MacBookPro11,3
Processor Name: Intel Core i7
Processor Speed:        2.3 GHz
Number of Processors:   1
Total Number of Cores:  4
L2 Cache (per Core):    256 KB
L3 Cache:       6 MB
Memory: 16 GB

I can send a working patch if anyone wants to try.

Here is a list of improvements to mfs I can think of:

   - add logic to cache to cap memory usage (some kind of LRU) and evict 
   unused segments from memory after some time
   - add file compression (how would it look on disk)
   - implement vop_cache to eliminate unnecessary memory copy

Waldek

PS. Lastly does anyone have a better name then "mfs2"?

On Wednesday, November 22, 2017 at 4:09:20 PM UTC-5, Nadav Har'El wrote:
>
>
> On Wed, Nov 22, 2017 at 6:43 PM, Waldek Kozaczuk <jwkoz...@gmail.com 
> <javascript:>> wrote:
>
>> I think that rather than implementing your suggestion which might not be 
>> all that trivial I would rather implement some kind of proper LRU based 
>> solution which would work like so:
>>
>
> This can work. Intererstingly, in the first patch which James Root sent, 
> he did have such an LRU cache, of 512K. It had some problems with too much 
> locking, but should probably work ok in your single-threaded case.
> He explained why he removed the cache in his comments when he sent the 
> second version of his patch
>
> Looking back at my review of his patch, I noted that in 
> fs/vfs/vfs_fops.cc, vfs_file::mmap(), you have two options:
> 1. implement vop_cache function, in which case mmap will take pages 
> directly from your cache (not making another copy)
> 2. If you don't implement this,  map_file_page_read (core/mmu.cc) is used, 
> and this indeed just calls read() which will not bypass your cache.
> Perhaps with this "vop_cache" thing we can avoid the extra copy your code 
> is doing. But it's probably negligable (copying 17MB should take just 
> milliseconds) so we can ignore this for now.
>
>>
>>    - Every time mfs_read() called it would try to find corresponding 
>>    buffer in it LRU cache (i-node is a key)
>>    - If found it would copy from memory buffer in cache 
>>    
>>
> I guess you also mean you need to save and check also the file position of 
> this saved buffer, not just the i-node?
>
>
>>    -  
>>    - Otherwise it would read data from disk like so
>>    - if file size in blocks for this i-node is less or equal than some 
>>       pre-determined max buffer size (64K/128K) it would read only enough 
>> blocks 
>>       to load entire file in memory and add buffer to LRU cache
>>       - otherwize it would load maximum buffer size of blocks from disk 
>>       and add to LRU cache
>>          - if more blocks needed to satisfy same read it would again 
>>          load remaining file blocks or maximum buffer size of blocks from 
>> disk and 
>>          replace last buffer with new one 
>>       - There would be total maximum on the cache - like 1MB
>>    - Old segments would be dropped if not enough room
>>    - After some predetermined time each i-node buffer would be freed 
>>    from cache so eventually memory usage would go to 0 if the service is 
>>    stateless
>>
>> This is just a sketch of a solution as I am not sure of the details. I 
>> might be also underestimating how difficult it is to implement it. My goal 
>> is to improve mfs to make boot time and loading of read-once files as fast 
>> as possible. In general it should have good performance for contiguous file 
>> reads. I am not too worried about random file access. 
>>
>> Please let me know what you think. Also is there a generic LRU C++ 
>> template in OSv code that I could re-use for something like this?
>>
>
> I don't recall one. You can put your hash-table values (the buffers) in a 
> std::list or boost::intrusive::list and every time you find a buffer, move 
> it to the end of the list.
> James Root use in his mfs_cache.hh/cc "next" and "prev" pointers and C 
> style code.
>
>
>> On Wednesday, November 22, 2017 at 1:42:01 AM UTC-5, Nadav Har'El wrote:
>>>
>>>
>>> On Wed, Nov 22, 2017 at 8:01 AM, Waldek Kozaczuk <jwkoz...@gmail.com> 
>>> wrote:
>>>
>>>> I managed to implement my change to make msf_read() use device strategy 
>>>> routing directly to bulk read however many blocks at once the uio 
>>>> parameter 
>>>> to mfs_read() specifies and as expected now the java run takes ~ 850 ms 
>>>> (down from 3500-4500ms). Which means it still spends ~600 ms reading ~ 
>>>> 14MB 
>>>> of data.
>>>>
>>>> So in general my thinking was correct. 
>>>>
>>>
>>> Excellent!
>>>
>>> Another quick hack you could try without diving too deep into the paging 
>>> code is to have msf_read() read 4*4096 bytes (or whatever) when asked 
>>> for just 4096 bytes, and save the unrequested bytes in a memory buffer and 
>>> next time msf_read() is called if it's asking for the next page, we already 
>>> have it waiting in the buffer and can copy (unfortunately) from it instead 
>>> of starting another read.
>>> If it will not be the byte wanted, we wasted this read-ahead and drop 
>>> the buffer.
>>> Of course it's not a general and good cache like ZFS (it's also not a 
>>> *cache* at all, if the same page is read multiple times with read() we read 
>>> it again...) but might be good enough for improving boot time. 
>>>
>>>
>>> Nadav.
>>>
>> -- 
>> You received this message because you are subscribed to the Google Groups 
>> "OSv Development" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to osv-dev+u...@googlegroups.com <javascript:>.
>> For more options, visit https://groups.google.com/d/optout.
>>
>
>

-- 
You received this message because you are subscribed to the Google Groups "OSv 
Development" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to osv-dev+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: Enhance ramfs/zfs-less images

Reply via email to