On 03/12/2017 12:19 AM, Kant Kodali wrote:
My response is inline.
On Sat, Mar 11, 2017 at 1:43 PM, Avi Kivity <a...@scylladb.com
<mailto:a...@scylladb.com>> wrote:
There are several issues at play here.
First, a database runs a large number of concurrent
operations, each of which only consumes a small amount of
CPU. The high concurrency is need to hide latency: disk
latency, or the latency of contacting a remote node.
*Ok so you are talking about hiding I/O latency. If all these
I/O are non-blocking system calls then a thread per core and
callback mechanism should suffice isn't it?*
Scylla uses a mix of user-level threads and callbacks. Most of
the code uses callbacks (fronted by a future/promise API).
SSTable writers (memtable flush, compaction) use a user-level
thread (internally implemented using callbacks). The important
bit is multiplexing many concurrent operations onto a single
kernel thread.
This means that the scheduler will need to switch contexts
very often. A kernel thread scheduler knows very little
about the application, so it has to switch a lot of
context. A user level scheduler is tightly bound to the
application, so it can perform the switching faster.
*sure but this applies in other direction as well. A user level
scheduler has no idea about kernel level scheduler either.
There is literally no coordination between kernel level
scheduler and user level scheduler in linux or any major OS. It
may be possible with OS's that support scheduler
activation(LWP's) and upcall mechanism. *
There is no need for coordination, because the kernel scheduler
has no scheduling decisions to make. With one thread per core,
bound to its core, the kernel scheduler can't make the wrong
decision because it has just one choice.
*Even then it is hard to say if it is all worth it (The research
shows performance may not outweigh the complexity). Golang
problem is exactly this if one creates 1000 go routines/green
threads where each of them is making a blocking system call then
it would create 1000 kernel threads underneath because it has no
way to know that the kernel thread is blocked (no upcall). *
All of the significant system calls we issue are through the main
thread, either asynchronous or non-blocking.
*And in non-blocking case I still don't even see a significant
performance when compared to few kernel threads with callback
mechanism.*
We do.
* If you are saying user level scheduling is the Future
(perhaps I would just let the researchers argue about it) As of
today that is not case else languages would have had it natively
instead of using third party frameworks or libraries.
*
User-level scheduling is great for high performance I/O intensive
applications like databases and file systems. It's not a general
solution, and it involves a lot of effort to set up the
infrastructure. However, for our use case, it was worth it.
There are also implications on the concurrency primitives in
use (locks etc.) -- they will be much faster for the
user-level scheduler, because they cooperate with the
scheduler. For example, no atomic read-modify-write
instructions need to be executed.
Second, how many (kernel) threads should you run?*This
question one will always have. If there are 10K user level
threads that maps to only one kernel thread then they cannot
exploit parallelism. so there is no right answer but a thread
per core is a reasonable/good choice.
*
Only if you can multiplex many operations on top of each of those
threads. Otherwise, the CPUs end up underutilized.
If you run too few threads, then you will not be able to
saturate the CPU resources. This is a common problem with
Cassandra -- it's very hard to get it to consume all of the
CPU power on even a moderately large machine. On the other
hand, if you have too many threads, you will see latency
rise very quickly, because kernel scheduling granularity is
on the order of milliseconds. User-level scheduling,
because it leaves control in the hand of the application,
allows you to both saturate the CPU and maintain low latency.
F*or my workload and probably others I had seen Cassandra
was always been CPU bound.*
Yes, but does it consume 100% of all of the cores on your
machine? Cassandra generally doesn't (on a larger machine), and
when you profile it, you see it spending much of its time in
atomic operations, or parking/unparking threads -- fighting with
itself. It doesn't scale within the machine. Scylla will happily
utilize all of the cores that it is assigned (all of them by
default in most configurations), and the bigger the machine you
give it, the happier it will be.
There are other factors, like NUMA-friendliness, but in the
end it all boils down to efficiency and control.
None of this is new btw, it's pretty common in the storage
world.
Avi
On 03/11/2017 11:18 PM, Kant Kodali wrote:
Here is the Java version
http://docs.paralleluniverse.co/quasar/
<http://docs.paralleluniverse.co/quasar/> but I still don't
see how user level scheduling can be beneficial (This is a
well debated problem)? How can this add to the performance?
or say why is user level scheduling necessary Given the
Thread per core design and the callback mechanism?
On Sat, Mar 11, 2017 at 12:51 PM, Avi Kivity
<a...@scylladb.com <mailto:a...@scylladb.com>> wrote:
Scylla uses a the seastar framework, which provides for
both user-level thread scheduling and simple
run-to-completion tasks.
Huge pages are limited to 2MB (and 1GB, but these
aren't available as transparent hugepages).
On 03/11/2017 10:26 PM, Kant Kodali wrote:
@Dor
1) You guys have a CPU scheduler? you mean user level
thread Scheduler that maps user level threads to
kernel level threads? I thought C++ by default creates
native kernel threads but sure nothing will stop
someone to create a user level scheduling library if
that's what you are talking about?
2) How can one create THP of size 1KB? According to
this post
<https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/6/html/Performance_Tuning_Guide/s-memory-transhuge.html>
it
looks like the valid values 2MB and 1GB.
Thanks,
kant
On Sat, Mar 11, 2017 at 11:41 AM, Avi Kivity
<a...@scylladb.com <mailto:a...@scylladb.com>> wrote:
Agreed, I'd recommend to treat benchmarks as a
rough guide to see where there is potential, and
follow through with your own tests.
On 03/11/2017 09:37 PM, Edward Capriolo wrote:
Benchmarks are great for FUDly blog posts. Real
world work loads matter more. Every NoSQL vendor
wins their benchmarks.