Hi, I read through your interesting measurements in this thread.

I agree with you that the problem is likely the reading of 4096 bytes at a
time. The reason why this is slow may be not just that small reads have
more overhead, the problem is also the reads are completely synchronic:

  1. OSv gets a page fault for a mapped file, and the kernel asks to read
just *one* page,
  2. At which point we need to exit to the host (slow), the host starts the
read
  3. The host runs the guest again, but OSv has nothing to do (the thread
is still waiting for the page it needs...) so OSv idle loops and exits to
the host again.
  4. Eventually, the read on the host ends, and the guest is entered with
an interrupt saying the read is done.

All these exits and context switches on both guest and host need to happen
on every 4K block read... This is slow. I don't know if it can explain why
reading 17 MB takes 3 seconds (that's almost a whopping millisecond per
page...), but maybe it does, especially if you have a spinning disk?

So I think that:


On Mon, Nov 20, 2017 at 7:01 PM, Waldek Kozaczuk <jwkozac...@gmail.com>
wrote:

> If you look at the attached log you will see that most mfs_read() calls
> read 4096 bytes (8 block) which makes me think these probably are called
> when handling page faults which would happen if file was mmaped (which I
> think is the case for .so files - void file::load_segment(const Elf64_Phdr
> & phdr) in elf.cc). The problem is that all of this pages seem to be
> loaded one by one.
>
> Would some kind of simple "readahead" mechanism help here? Aren't most
> program files read sequentially anyway and in full?
>

Yes, I think exactliy that could help.

One relatively easy thing we could do for your case (of mmap) is when we
get a page fault, instead of allocating and reading just one page, let's
allocate and read *several* pages (of course we need to check that we don't
read beyond the end of the vma if we are close to the end of the mapping).


>
> Also how is it possible that zfs is much faster loading these files? Is it
> because it implements some kind of "readahead" mechanism or it is because
> it is multithreaded and it loads multiple blocks from disk at once?'
>

Yes, I think ZFS reads many blocks (I don't know the details, Raphael might
remember somethhing?) ahead into its own page cache, and later those pages
are mapped into the user's mapping.


>
> On Sunday, November 19, 2017 at 11:03:30 PM UTC-5, Waldek Kozaczuk wrote:
>>
>> Just for the reference I tried to measure the disk performance of the
>> host I am using (4 years old laptop).
>>
>> dd if=libjvm-stripped.so of=/dev/null iflag=direct ibs=512 //default dd
>> block size
>> dd: error reading 'libjvm-stripped.so': Invalid argument
>> 33534+1 records in
>> 33534+1 records out
>> 17169688 bytes (17 MB, 16 MiB) copied, 3.45808 s, 5.0 MB/s
>>
>>  dd if=r/libjvm-stripped.so of=/dev/null iflag=direct ibs=64K
>> dd: error reading 'libjvm-stripped.so': Invalid argument
>> 261+1 records in
>> 33534+1 records out
>> 17169688 bytes (17 MB, 16 MiB) copied, 0.0453535 s, 379 MB/s
>>
>> As you can see with default block size of 512 bytes dd reads data with
>> similar speed I have observed testing OSv Java image with mfs. On other
>> hand same so file (17MB) is read in ~45ms with block size of 64K.
>>
>> Waldek
>>
>> On Sunday, November 19, 2017 at 10:52:18 AM UTC-5, Waldek Kozaczuk wrote:
>>>
>>> For reference I am including the link to my mfs branch -
>>> https://github.com/wkozaczuk/osv/tree/readonly_mfs - that includes all
>>> my changes.
>>>
>>> Also I forgot to mention that I discover that mfs does not support links
>>> to directories. I had to manually tweak java image.
>>>
>>> On Sunday, November 19, 2017 at 10:45:19 AM UTC-5, Waldek Kozaczuk wrote:
>>>>
>>>> I have applied the patch produced by James Root. I had to manually
>>>> tweak Makefile and change gen-mfs-img.py by making it reuse
>>>> manifest_common.py.
>>>>
>>>> Overall I think it works well and as designed. BTW I can send new MFS
>>>> patch which would be original one plus my changes explained above.
>>>>
>>>> However I ran various tests to benchmark it and the results were not as
>>>> great as I expected especially for bigger images like Java. On other hand
>>>> it is much faster than ZFS for small images. Please see the numbers below.
>>>>
>>>> In essence I tested simple minimal Java "hello world" (./scripts/build
>>>> image=openjdk9-java-base,java-example fs=<fs>) and native one
>>>> (./scripts/build image=native-example fs=<fs>. I also added a stop watch
>>>> around java app main (modules/java-base/java.cc). I assumed that loading
>>>> and executing it (under 10K) would be negligible. Also I did not measure
>>>> whole run by doing "time ./scripts/run ..." as this would include loading
>>>> and starting QEMU which is besides what we need to measure.
>>>>
>>>> As you can see both native ramfs and mfs images boot in 100-110 ms
>>>> where zfs adds ~ 1,400 ms (it seems my machine is slow as others report
>>>> similar small images with ZFS to boot in 600-700ms).
>>>>
>>>> Java on other hand was different. The fastest one was ramfs, 2nd zfs
>>>> and slowest mfs. In both ramfs and zfs test Java app was able to run in ~
>>>> 200ms (which probably includes loading libjvm.so (17MB) and other files,
>>>> initializing JVM, compiling byte code to native code and executing which
>>>> makes 200ms pretty impressive). In mfs case as you see the boot time is
>>>> very fast as fs is mounted rapidly but Java app executions takes over 3
>>>> seconds :-(
>>>>
>>>> I added more stop watches in mfs code around mfsvnops.cc/mfs_read() to
>>>> see where time gets spent. In case of Java example it loads 15 files/inodes
>>>> with 2 bulky ones. Please see count of mfs_read calls:
>>>>
>>>>       5 of inode 110
>>>>      51 of inode 124
>>>>      28 of inode 125
>>>>      32 of inode 127
>>>> *    630 of inode 128 // 17MB of /usr/lib/jvm/java/lib/server/libjvm.so*
>>>> *   2457 of inode 131 // 8MB of /usr/lib/jvm/java/lib/modules (Java 9
>>>> replacement of rt.jar)*
>>>>      17 of inode 133
>>>>      31 of inode 134
>>>>      34 of inode 135
>>>>       1 of inode 32
>>>>      35 of inode 5
>>>>       1 of inode 55
>>>>      24 of inode 87
>>>>      19 of inode 88
>>>>      53 of inode 9
>>>>
>>>> Obviously with much printing to the console the timings gets affected
>>>> and entire java run is 9,000ms where 5586.47 ms is spent on mfs_read.
>>>>
>>>> As I understand during the java mfs image execution dlopen loads
>>>> libjvm.so file which involves calling mfs_read (4096 bytes) where each
>>>> involves 8 calls to bread. So it looks the system ends up reading 8 times
>>>> 2457 blocks of 512 bytes (the math does not seem to work as it means
>>>> only 9828 KB of libvm.so gets loaded - shouldn't dlopen read all 17MBs of
>>>> libjvm.so?). Please see attached java_mfs2.log that has all the stopwatches
>>>> (node 131 - libjvm.so).
>>>>
>>>> Is there a way to speed it up. It loading 512 bytes (single block) at a
>>>> time the only way to load data from disk? Why when ramfs image loads all
>>>> files in real mode it takes under 200ms - see "disk read (real mode):
>>>> 148.11ms, (+148.11ms)" line from bootchart of ramfs java benchmark and it
>>>> takes over 3 seconds to load less data. Is it because it does it through
>>>> all layers?
>>>>
>>>> Would it help if entire file was loaded at once on a first read into
>>>> memory and then eventually released? Is it possible to load content in
>>>> bigger chunks than 512 bytes from disk?
>>>>
>>>> I also wonder if it helped if the information about inodes in form of
>>>> some kind of map (file path -> inode) was preloaded when mounting mfs?
>>>>
>>>> Any ideas?
>>>>
>>>> Waldek
>>>>
>>>> PS1. Also BTW there is some problem with building ramfs images.
>>>> Everytime one changes lzkernel_base he needs to do a clean build otherwise
>>>> produces image crashes. So to reproduce build zfs image and then change
>>>> lzkernel_base and build ramfs image.
>>>>
>>>> PS2.
>>>> *NATIVE*
>>>> *zfs*
>>>> disk read (real mode): 26.26ms, (+26.26ms)
>>>> .init functions: 48.96ms, (+22.70ms)
>>>> SMP launched: 49.60ms, (+0.63ms)
>>>> VFS initialized: 51.90ms, (+2.30ms)
>>>> Network initialized: 52.16ms, (+0.27ms)
>>>> pvpanic done: 52.27ms, (+0.11ms)
>>>> pci enumerated: 56.99ms, (+4.71ms)
>>>> drivers probe: 56.99ms, (+0.00ms)
>>>> drivers loaded: 102.13ms, (+45.14ms)
>>>> ZFS mounted: 1449.70ms, (+1347.57ms)
>>>> Total time: 1455.74ms, (+6.04ms)
>>>> Hello from C code
>>>>
>>>> *ramfs*
>>>> disk read (real mode): 25.49ms, (+25.49ms)
>>>> .init functions: 46.62ms, (+21.13ms)
>>>> SMP launched: 47.37ms, (+0.75ms)
>>>> VFS initialized: 49.50ms, (+2.13ms)
>>>> Network initialized: 49.87ms, (+0.37ms)
>>>> pvpanic done: 49.99ms, (+0.11ms)
>>>> pci enumerated: 54.41ms, (+4.42ms)
>>>> drivers probe: 54.41ms, (+0.00ms)
>>>> drivers loaded: 97.66ms, (+43.25ms)
>>>> ZFS mounted: 100.99ms, (+3.34ms)
>>>> Total time: 106.63ms, (+5.64ms)
>>>> Hello from C code
>>>>
>>>> *mfs*
>>>> disk read (real mode): 26.36ms, (+26.36ms)
>>>> .init functions: 48.97ms, (+22.61ms)
>>>> SMP launched: 49.64ms, (+0.67ms)
>>>> VFS initialized: 51.63ms, (+1.99ms)
>>>> Network initialized: 52.00ms, (+0.37ms)
>>>> pvpanic done: 52.12ms, (+0.12ms)
>>>> pci enumerated: 56.52ms, (+4.40ms)
>>>> drivers probe: 56.52ms, (+0.00ms)
>>>> drivers loaded: 101.56ms, (+45.04ms)
>>>> ZFS mounted: 104.72ms, (+3.16ms)
>>>> Total time: 110.17ms, (+5.45ms)
>>>> Hello from C code
>>>>
>>>> *JAVA 9 *
>>>> *zfs*
>>>> disk read (real mode): 27.13ms, (+27.13ms)
>>>> .init functions: 50.21ms, (+23.07ms)
>>>> SMP launched: 50.81ms, (+0.60ms)
>>>> VFS initialized: 52.87ms, (+2.07ms)
>>>> Network initialized: 53.12ms, (+0.24ms)
>>>> pvpanic done: 53.24ms, (+0.12ms)
>>>> pci enumerated: 57.59ms, (+4.35ms)
>>>> drivers probe: 57.59ms, (+0.00ms)
>>>> drivers loaded: 100.55ms, (+42.96ms)
>>>> ZFS mounted: 1479.34ms, (+1378.79ms)
>>>> * Total time: 1484.99ms, (+5.65ms)*
>>>> java.so: Starting JVM app using: io/osv/nonisolated/RunNonIsola
>>>> tedJvmApp
>>>> java.so: Setting Java system classloader to
>>>> NonIsolatingOsvSystemClassLoader
>>>> Hello, World!
>>>> *java main took 201.02ms*
>>>>
>>>> *ramfs*
>>>> disk read (real mode): 148.11ms, (+148.11ms)
>>>> .init functions: 306.51ms, (+158.40ms)
>>>> SMP launched: 307.87ms, (+1.37ms)
>>>> VFS initialized: 322.56ms, (+14.68ms)
>>>> Network initialized: 323.10ms, (+0.54ms)
>>>> pvpanic done: 323.31ms, (+0.20ms)
>>>> pci enumerated: 333.27ms, (+9.96ms)
>>>> drivers probe: 333.27ms, (+0.01ms)
>>>> drivers loaded: 423.81ms, (+90.53ms)
>>>> ZFS mounted: 430.88ms, (+7.07ms)
>>>> * Total time: 442.40ms, (+11.52ms)*
>>>> java.so: Starting JVM app using: io/osv/nonisolated/RunNonIsola
>>>> tedJvmApp
>>>> java.so: Setting Java system classloader to
>>>> NonIsolatingOsvSystemClassLoader
>>>> random: device unblocked.
>>>> Hello, World!
>>>> *java main took 216.36ms*
>>>>
>>>> *mfs*
>>>> disk read (real mode): 24.60ms, (+24.60ms)
>>>> .init functions: 47.69ms, (+23.09ms)
>>>> SMP launched: 48.40ms, (+0.72ms)
>>>> VFS initialized: 50.49ms, (+2.09ms)
>>>> Network initialized: 50.82ms, (+0.33ms)
>>>> pvpanic done: 50.94ms, (+0.11ms)
>>>> pci enumerated: 55.60ms, (+4.66ms)
>>>> drivers probe: 55.60ms, (+0.00ms)
>>>> drivers loaded: 114.12ms, (+58.52ms)
>>>> ZFS mounted: 117.55ms, (+3.43ms)
>>>> * Total time: 123.21ms, (+5.67ms)*
>>>> java.so: Starting JVM app using: io/osv/nonisolated/RunNonIsola
>>>> tedJvmApp
>>>> random: device unblocked.
>>>> java.so: Setting Java system classloader to
>>>> NonIsolatingOsvSystemClassLoader
>>>> Hello, World!
>>>> *java main took 3386.41ms*
>>>>
>>>> On Wednesday, November 15, 2017 at 4:46:57 AM UTC-5, Nadav Har'El wrote:
>>>>>
>>>>>
>>>>> On Wed, Nov 15, 2017 at 5:21 AM, Waldek Kozaczuk <jwkoz...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> A better solution could be to not have a read-write ramdisk at all,
>>>>>> but rather a simple read-only filesystem on the disk image where we only
>>>>>> read file blocks as they are needed, and also cache them somehow for the
>>>>>> next reads.
>>>>>>
>>>>>>>
>>>>>>> This is what the proposed patches were about
>>>>>>> https://github.com/cloudius-systems/osv/pull/646 (which you already
>>>>>>> found). I reviewed those patches in this thread:
>>>>>>> https://groups.google.com/forum/#!searchin/osv-dev/readonly/
>>>>>>> osv-dev/DYSH4_NqyMg/YFJxM9NKBgAJ
>>>>>>>
>>>>>>> James Root came in that thread back with a new patch, but sadly it
>>>>>>> appears that nobody ever reviewed it, and James never complained about
>>>>>>> forgetting this code.
>>>>>>> At this point I don't really remember if there was anything wrong in
>>>>>>> Jame's last patch - it might be in good working order and you can try it
>>>>>>> (and patch the build script to use it instead of ramfs).
>>>>>>>
>>>>>>> If you're interested in reviving it, it might be very close to being
>>>>>>> complete!
>>>>>>>
>>>>>>
>>>>>> I am definitely interested and I will give it a try. Is there any
>>>>>> existing performance test suite that I could use to test its performance?
>>>>>> What would be the things to measure (speed, etc?) and observe (memory
>>>>>> utilization?)?
>>>>>>
>>>>>
>>>>> No, I don't think we have any organized test suite.
>>>>>
>>>>> I'd try to observe the boot time of a large program (e.g., some Java
>>>>> application), the memory it needs (e.g., you can run "scripts/run.py
>>>>> -m1000M" and reduce the size until you can't reduce it any more), and also
>>>>> check that the amount of disk IO (perhaps via iostat in the host or some
>>>>> internal OSv tracepoint count) it does to run some workload.
>>>>>
>>>>>
>>>>>>
>>>>>>>
>>>>>> Also even though most of the files in stateless apps do not change
>>>>>> there is always a need to write to some log files. I guess for that I 
>>>>>> could
>>>>>> use the ramfs but probably improve its write implementation by using some
>>>>>> kind of linked list (*std::deque) *of fixed side memory blocks to
>>>>>> store data so we do not constantly free/malloc per Justin's
>>>>>> recommendations. Do you have any other recommendations?
>>>>>>
>>>>>
>>>>> Yes, you can have the root filesystem ("/") on some read-only
>>>>> filesystem, but additionally mount a read-write filesystem, like ramfs, on
>>>>> /tmp or /var/log or whatever (of course, it's questionable why you'd want
>>>>> to write there if it's not persistant, but it will be easier if this is
>>>>> allowed for applications that can't be changed to log over the network,
>>>>> etc.).
>>>>>
>>>>> It should indeed be fairly easy to fix the ramfs append - see
>>>>> https://github.com/cloudius-systems/osv/issues/884
>>>>>
>>>>> https://github.com/cloudius-systems/osv/issues/739 is also about a
>>>>> different /tmp filesystem.
>>>>>
>>>>> Nadav.
>>>>>
>>>>> --
> You received this message because you are subscribed to the Google Groups
> "OSv Development" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to osv-dev+unsubscr...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups "OSv 
Development" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to osv-dev+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to