I was using the term “touch” loosely to hopefully mean pre-fetch, though I suspect (I think intel has been de-emphasizing) you can still do a sensible prefetch instruction in native code. Even if not you are still better blocking in JNI code - I haven’t looked at the link to see if the correct barriers are enforced by the sun-misc-unsafe method.
I do suspect that you’ll see up to about 5-10% sys call overhead if you hit pread. > On Oct 8, 2016, at 11:02 PM, Ariel Weisberg <ar...@weisberg.ws> wrote: > > Hi, > > This is starting to get into dev list territory. > > Interesting idea to touch every 4K page you are going to read. > > You could use this to minimize the cost. > http://stackoverflow.com/questions/36298111/is-it-possible-to-use-sun-misc-unsafe-to-call-c-functions-without-jni/36309652#36309652 > > Maybe faster than doing buffered IO. It's a lot of cache and TLB misses > with out prefetching though. > > There is a system call to page the memory in which might be better for > larger reads. Still no guarantee things stay cached though. > > Ariel > > > On Sat, Oct 8, 2016, at 08:21 PM, Graham Sanderson wrote: >> I haven’t studied the read path that carefully, but there might be a spot at >> the C* level rather than JVM level where you could effectively do a JNI >> touch of the mmap region you’re going to need next. >> >>> On Oct 8, 2016, at 7:17 PM, Graham Sanderson <gra...@vast.com> wrote: >>> >>> We don’t use Azul’s Zing, but it does have the nice feature that all >>> threads don’t have to reach safepoints at the same time. That said we make >>> heavy use of Cassandra (with off heap memtables - not directly related but >>> allows us a lot more GC headroom) and SOLR where we switched to mmap >>> because it FAR out performed pread variants - in no cases have we noticed >>> long time to safe point (then again our IO is lightning fast). >>> >>>> On Oct 8, 2016, at 1:20 PM, Jonathan Haddad <j...@jonhaddad.com> wrote: >>>> >>>> Linux automatically uses free memory as cache. It's not swap. >>>> >>>> http://www.tldp.org/LDP/lki/lki-4.html >>>> >>>> On Sat, Oct 8, 2016 at 11:12 AM Vladimir Yudovin <vla...@winguzone.com> >>>> wrote: >>>>> __ >>>>> Sorry, I don't catch something. What page (memory) cache can exist if >>>>> there is no swap file. >>>>> Where are those page written/read? >>>>> >>>>> >>>>> Best regards, Vladimir Yudovin, >>>>> *Winguzone[https://winguzone.com/?from=list] - Hosted Cloud Cassandra on >>>>> Azure and SoftLayer. >>>>> Launch your cluster in minutes. > * >>>>> >>>>> ---- On Sat, 08 Oct 2016 14:09:50 -0400 *Ariel >>>>> Weisberg<ar...@weisberg.ws>* wrote ---- >>>>>> Hi, >>>>>> >>>>>> Nope I mean page cache. Linux doesn't call the cache it maintains using >>>>>> free memory a file cache. It uses free (and some of the time not so >>>>>> free!) memory to buffer writes and to cache recently written/read data. >>>>>> >>>>>> http://www.tldp.org/LDP/lki/lki-4.html >>>>>> >>>>>> When Linux decides it needs free memory it can either evict stuff from >>>>>> the page cache, flush dirty pages and then evict, or swap anonymous >>>>>> memory out. When you disable swap you only disable the last behavior. >>>>>> >>>>>> Maybe we are talking at cross purposes? What I meant is that increasing >>>>>> the heap size to reduce GC frequency is a legitimate thing to do and it >>>>>> does have an impact on the performance of the page cache even if you >>>>>> have swap disabled? >>>>>> >>>>>> Ariel >>>>>> >>>>>> >>>>>> On Sat, Oct 8, 2016, at 01:54 PM, Vladimir Yudovin wrote: >>>>>>>> Page cache is data pending flush to disk and data cached from disk. >>>>>>> >>>>>>> Do you mean file cache? >>>>>>> >>>>>>> >>>>>>> Best regards, Vladimir Yudovin, >>>>>>> *Winguzone[https://winguzone.com/?from=list] - Hosted Cloud Cassandra >>>>>>> on Azure and SoftLayer. >>>>>>> Launch your cluster in minutes.* >>>>>>> >>>>>>> >>>>>>> ---- On Sat, 08 Oct 2016 13:40:19 -0400 *Ariel Weisberg >>>>>>> <ar...@weisberg.ws>* wrote ---- >>>>>>>> Hi, >>>>>>>> >>>>>>>> Page cache is in use even if you disable swap. Swap is anonymous >>>>>>>> memory, and whatever else the Linux kernel supports paging out. Page >>>>>>>> cache is data pending flush to disk and data cached from disk. >>>>>>>> >>>>>>>> Given how bad the GC pauses are in C* I think it's not the high pole >>>>>>>> in the tent. Until key things are off heap and C* can run with CMS and >>>>>>>> get 10 millisecond GCs all day long. >>>>>>>> >>>>>>>> You can go through tuning and hardware selection try to get more >>>>>>>> consistent IO pauses and remove outliers as you mention and as a user >>>>>>>> I think this is your best bet. Generally it's either bad device or >>>>>>>> filesystem behavior if you get page faults taking more than 200 >>>>>>>> milliseconds O(G1 gc collection). >>>>>>>> >>>>>>>> I think a JVM change to allow safe points around memory mapped file >>>>>>>> access is really unlikely although I agree it would be great. I think >>>>>>>> the best hack around it is to code up your memory mapped file access >>>>>>>> into JNI methods and find some way to get that to work. Right now if >>>>>>>> you want to create a safe point a JNI method is the way to do it. The >>>>>>>> problem is that JNI methods and POJOs don't get along well. >>>>>>>> >>>>>>>> If you think about it the reason non-memory mapped IO works well is >>>>>>>> that it's all JNI methods so they don't impact time to safe point. I >>>>>>>> think there is a tradeoff between tolerance for outliers and >>>>>>>> performance. >>>>>>>> >>>>>>>> I don't know the state of the non-memory mapped path and how reliable >>>>>>>> that is. If it were reliable and I couldn't tolerate the outliers I >>>>>>>> would use that. I have to ask though, why are you not able to tolerate >>>>>>>> the outliers? If you are reading and writing at quorum how is this >>>>>>>> impacting you? >>>>>>>> >>>>>>>> Regards, >>>>>>>> Ariel >>>>>>>> >>>>>>>> On Sat, Oct 8, 2016, at 12:54 AM, Vladimir Yudovin wrote: >>>>>>>>> Hi Josh, >>>>>>>>> >>>>>>>>>> Running with increased heap size would reduce GC frequency, at the >>>>>>>>>> cost of page cache. >>>>>>>>> >>>>>>>>> Actually it's recommended to run C* without virtual memory enabled. >>>>>>>>> So if there is no enough memory JVM fails instead of blocking >>>>>>>>> >>>>>>>>> Best regards, Vladimir Yudovin, >>>>>>>>> *Winguzone[https://winguzone.com/?from=list] - Hosted Cloud Cassandra >>>>>>>>> on Azure and SoftLayer. >>>>>>>>> Launch your cluster in minutes.* >>>>>>>>> >>>>>>>>> >>>>>>>>> ---- On Fri, 07 Oct 2016 21:06:24 -0400 *Josh >>>>>>>>> Snyder<j...@code406.com>* wrote ---- >>>>>>>>>> Hello cassandra-users, >>>>>>>>>> >>>>>>>>>> I'm investigating an issue with JVMs taking a while to reach a >>>>>>>>>> safepoint. I'd >>>>>>>>>> like the list's input on confirming my hypothesis and finding >>>>>>>>>> mitigations. >>>>>>>>>> >>>>>>>>>> My hypothesis is that slow block devices are causing Cassandra's JVM >>>>>>>>>> to pause >>>>>>>>>> completely while attempting to reach a safepoint. >>>>>>>>>> >>>>>>>>>> Background: >>>>>>>>>> >>>>>>>>>> Hotspot occasionally performs maintenance tasks that necessitate >>>>>>>>>> stopping all >>>>>>>>>> of its threads. Threads running JITed code occasionally read from a >>>>>>>>>> given >>>>>>>>>> safepoint page. If Hotspot has initiated a safepoint, reading from >>>>>>>>>> that page >>>>>>>>>> essentially catapults the thread into purgatory until the safepoint >>>>>>>>>> completes >>>>>>>>>> (the mechanism behind this is pretty cool). Threads performing >>>>>>>>>> syscalls or >>>>>>>>>> executing native code do this check upon their return into the JVM. >>>>>>>>>> >>>>>>>>>> In this way, during the safepoint Hotspot can be sure that all of >>>>>>>>>> its threads >>>>>>>>>> are either patiently waiting for safepoint completion or in a system >>>>>>>>>> call. >>>>>>>>>> >>>>>>>>>> Cassandra makes heavy use of mmapped reads in normal operation. When >>>>>>>>>> doing >>>>>>>>>> mmapped reads, the JVM executes userspace code to effect a read from >>>>>>>>>> a file. On >>>>>>>>>> the fast path (when the page needed is already mapped into the >>>>>>>>>> process), this >>>>>>>>>> instruction is very fast. When the page is not cached, the CPU >>>>>>>>>> triggers a page >>>>>>>>>> fault and asks the OS to go fetch the page. The JVM doesn't even >>>>>>>>>> realize that >>>>>>>>>> anything interesting is happening: to it, the thread is just >>>>>>>>>> executing a mov >>>>>>>>>> instruction that happens to take a while. >>>>>>>>>> >>>>>>>>>> The OS, meanwhile, puts the thread in question in the D state >>>>>>>>>> (assuming Linux, >>>>>>>>>> here) and goes off to find the desired page. This may take >>>>>>>>>> microseconds, this >>>>>>>>>> may take milliseconds, or it may take seconds (or longer). When I/O >>>>>>>>>> occurs >>>>>>>>>> while the JVM is trying to enter a safepoint, every thread has to >>>>>>>>>> wait for the >>>>>>>>>> laggard I/O to complete. >>>>>>>>>> >>>>>>>>>> If you log safepoints with the right options [1], you can see these >>>>>>>>>> occurrences >>>>>>>>>> in the JVM output: >>>>>>>>>> >>>>>>>>>>> # SafepointSynchronize::begin: Timeout detected: >>>>>>>>>>> # SafepointSynchronize::begin: Timed out while spinning to reach a >>>>>>>>>>> safepoint. >>>>>>>>>>> # SafepointSynchronize::begin: Threads which did not reach the >>>>>>>>>>> safepoint: >>>>>>>>>>> # "SharedPool-Worker-5" #468 daemon prio=5 os_prio=0 >>>>>>>>>>> tid=0x00007f8785bb1f30 nid=0x4e14 runnable [0x0000000000000000] >>>>>>>>>>> java.lang.Thread.State: RUNNABLE >>>>>>>>>>> >>>>>>>>>>> # SafepointSynchronize::begin: (End of list) >>>>>>>>>>> vmop [threads: total initially_running >>>>>>>>>>> wait_to_block] [time: spin block sync cleanup vmop] >>>>>>>>>>> page_trap_count >>>>>>>>>>> 58099.941: G1IncCollectionPause [ 447 1 >>>>>>>>>>> 1 ] [ 3304 0 3305 1 190 ] 1 >>>>>>>>>> >>>>>>>>>> If that safepoint happens to be a garbage collection (which this one >>>>>>>>>> was), you >>>>>>>>>> can also see it in GC logs: >>>>>>>>>> >>>>>>>>>>> 2016-10-07T13:19:50.029+0000: 58103.440: Total time for which >>>>>>>>>>> application threads were stopped: 3.4971808 seconds, Stopping >>>>>>>>>>> threads took: 3.3050644 seconds >>>>>>>>>> >>>>>>>>>> In this way, JVM safepoints become a powerful weapon for transmuting >>>>>>>>>> a single >>>>>>>>>> thread's slow I/O into the entire JVM's lockup. >>>>>>>>>> >>>>>>>>>> Does all of the above sound correct? >>>>>>>>>> >>>>>>>>>> Mitigations: >>>>>>>>>> >>>>>>>>>> 1) don't tolerate block devices that are slow >>>>>>>>>> >>>>>>>>>> This is easy in theory, and only somewhat difficult in practice. >>>>>>>>>> Tools like >>>>>>>>>> perf and iosnoop [2] can do pretty good jobs of letting you know >>>>>>>>>> when a block >>>>>>>>>> device is slow. >>>>>>>>>> >>>>>>>>>> It is sad, though, because this makes running Cassandra on mixed >>>>>>>>>> hardware (e.g. >>>>>>>>>> fast SSD and slow disks in a JBOD) quite unappetizing. >>>>>>>>>> >>>>>>>>>> 2) have fewer safepoints >>>>>>>>>> >>>>>>>>>> Two of the biggest sources of safepoints are garbage collection and >>>>>>>>>> revocation >>>>>>>>>> of biased locks. Evidence points toward biased locking being >>>>>>>>>> unhelpful for >>>>>>>>>> Cassandra's purposes, so turning it off (-XX:-UseBiasedLocking) is a >>>>>>>>>> quick way >>>>>>>>>> to eliminate one source of safepoints. >>>>>>>>>> >>>>>>>>>> Garbage collection, on the other hand, is unavoidable. Running with >>>>>>>>>> increased >>>>>>>>>> heap size would reduce GC frequency, at the cost of page cache. But >>>>>>>>>> sacrificing >>>>>>>>>> page cache would increase page fault frequency, which is another >>>>>>>>>> thing we're >>>>>>>>>> trying to avoid! I don't view this as a serious option. >>>>>>>>>> >>>>>>>>>> 3) use a different IO strategy >>>>>>>>>> >>>>>>>>>> Looking at the Cassandra source code, there appears to be an >>>>>>>>>> un(der)documented >>>>>>>>>> configuration parameter called disk_access_mode. It appears that >>>>>>>>>> changing this >>>>>>>>>> to 'standard' would switch to using pread() and pwrite() for I/O, >>>>>>>>>> instead of >>>>>>>>>> mmap. I imagine there would be a throughput penalty here for the >>>>>>>>>> case when >>>>>>>>>> pages are in the disk cache. >>>>>>>>>> >>>>>>>>>> Is this a serious option? It seems far too underdocumented to be >>>>>>>>>> thought of as >>>>>>>>>> a contender. >>>>>>>>>> >>>>>>>>>> 4) modify the JVM >>>>>>>>>> >>>>>>>>>> This is a longer term option. For the purposes of safepoints, >>>>>>>>>> perhaps the JVM >>>>>>>>>> could treat reads from an mmapped file in the same way it treats >>>>>>>>>> threads that >>>>>>>>>> are running JNI code. That is, the safepoint will proceed even >>>>>>>>>> though the >>>>>>>>>> reading thread has not "joined in". Upon finishing its mmapped read, >>>>>>>>>> the >>>>>>>>>> reading thread would test the safepoint page (check whether a >>>>>>>>>> safepoint is in >>>>>>>>>> progress, in other words). >>>>>>>>>> >>>>>>>>>> Conclusion: >>>>>>>>>> >>>>>>>>>> I don't imagine there's an easy solution here. I plan to go ahead >>>>>>>>>> with >>>>>>>>>> mitigation #1: "don't tolerate block devices that are slow", but I'd >>>>>>>>>> appreciate >>>>>>>>>> any approach that doesn't require my hardware to be flawless all the >>>>>>>>>> time. >>>>>>>>>> >>>>>>>>>> Josh >>>>>>>>>> >>>>>>>>>> [1] -XX:+SafepointTimeout -XX:SafepointTimeoutDelay=100 >>>>>>>>>> -XX:+PrintSafepointStatistics -XX:PrintSafepointStatisticsCount=1 >>>>>>>>>> [2] https://github.com/brendangregg/perf-tools/blob/master/iosnoop >>>>>>>> >>>>>> >> Email had 1 attachment: > > >> * smime.p7s >> 2k (application/pkcs7-signature)
smime.p7s
Description: S/MIME cryptographic signature