On Mon, Jul 23, 2018 at 7:11 PM, Andy Lutomirski <l...@kernel.org> wrote: > On 07/23/2018 05:55 AM, Fenghua Yu wrote: >> >> User wants to query if user wait instructions (umonitor, umwait, and >> tpause) are supported and use the instructions. The vDSO functions >> provides fast interface for user to check the support and use the >> instructions. >> >> waitpkg_supported and its alias __vdso_waitpkg_supported check if >> user wait instructions (a.k.a. wait package feature) are supported >> >> umonitor and its alias __vdso_umonitor provide user APIs for calling >> umonitor instruction. >> >> umwait and its alias __vdso_umwait provide user APIs for calling >> umwait instruction. >> >> tpause and its alias __vdso_tpause provide user APIs for calling >> tpause instruction. >> >> nsec_to_tsc and its alias __vdso_nsec_to_tsc converts nanoseconds >> to TSC counter if TSC frequency is known. It will fail if TSC frequency >> is unknown. >> >> The instructions can be implemented in intrinsic functions in future >> GCC. But the vDSO interfaces are available to user without the >> intrinsic functions support in GCC and the API waitpkg_supported and >> nsec_to_tsc cannot be implemented as GCC functions. >> >> Signed-off-by: Fenghua Yu <fenghua...@intel.com> >> --- >> arch/x86/entry/vdso/Makefile | 2 +- >> arch/x86/entry/vdso/vdso.lds.S | 10 ++ >> arch/x86/entry/vdso/vma.c | 9 ++ >> arch/x86/entry/vdso/vuserwait.c | 233 >> +++++++++++++++++++++++++++++++++ >> arch/x86/include/asm/vdso_funcs_data.h | 3 + >> 5 files changed, 256 insertions(+), 1 deletion(-) >> create mode 100644 arch/x86/entry/vdso/vuserwait.c >> >> diff --git a/arch/x86/entry/vdso/Makefile b/arch/x86/entry/vdso/Makefile >> index af4fcae5de83..fb0062b09b3c 100644 >> --- a/arch/x86/entry/vdso/Makefile >> +++ b/arch/x86/entry/vdso/Makefile >> @@ -17,7 +17,7 @@ VDSO32-$(CONFIG_X86_32) := y >> VDSO32-$(CONFIG_IA32_EMULATION) := y >> # files to link into the vdso >> -vobjs-y := vdso-note.o vclock_gettime.o vgetcpu.o vdirectstore.o >> +vobjs-y := vdso-note.o vclock_gettime.o vgetcpu.o vdirectstore.o >> vuserwait.o >> # files to link into kernel >> obj-y += vma.o >> diff --git a/arch/x86/entry/vdso/vdso.lds.S >> b/arch/x86/entry/vdso/vdso.lds.S >> index 097cdcda43a5..0942710608bf 100644 >> --- a/arch/x86/entry/vdso/vdso.lds.S >> +++ b/arch/x86/entry/vdso/vdso.lds.S >> @@ -35,6 +35,16 @@ VERSION { >> __vdso_movdir64b_supported; >> movdir64b; >> __vdso_movdir64b; >> + waitpkg_supported; >> + __vdso_waitpkg_supported; >> + umonitor; >> + __vdso_umonitor; >> + umwait; >> + __vdso_umwait; >> + tpause; >> + __vdso_tpause; >> + nsec_to_tsc; >> + __vdso_nsec_to_tsc; >> local: *; >> }; >> } >> diff --git a/arch/x86/entry/vdso/vma.c b/arch/x86/entry/vdso/vma.c >> index edbe5e63e5c2..006dfb5e5003 100644 >> --- a/arch/x86/entry/vdso/vma.c >> +++ b/arch/x86/entry/vdso/vma.c >> @@ -372,10 +372,19 @@ static int vgetcpu_online(unsigned int cpu) >> static void __init init_vdso_funcs_data(void) >> { >> + struct system_counterval_t sys_counterval; >> + >> if (static_cpu_has(X86_FEATURE_MOVDIRI)) >> vdso_funcs_data.movdiri_supported = true; >> if (static_cpu_has(X86_FEATURE_MOVDIR64B)) >> vdso_funcs_data.movdir64b_supported = true; >> + if (static_cpu_has(X86_FEATURE_WAITPKG)) >> + vdso_funcs_data.waitpkg_supported = true; >> + if (static_cpu_has(X86_FEATURE_TSC_KNOWN_FREQ)) { >> + vdso_funcs_data.tsc_known_freq = true; >> + sys_counterval = convert_art_ns_to_tsc(1); >> + vdso_funcs_data.tsc_per_nsec = sys_counterval.cycles; >> + } > > > You're losing a ton of precision here. You might even be losing *all* of > the precision and malfunctioning rather badly. > > The correct way to do this is: > > tsc_counts = ns * mul >> shift; > > and the vclock code illustrates it. convert_art_ns_to_tsc() is a bad > example because it uses an expensive division operation for no good reason > except that no one bothered to optimize it. > >> +notrace int __vdso_nsec_to_tsc(unsigned long nsec, unsigned long *tsc) >> +{ >> + if (!_vdso_funcs_data->tsc_known_freq) >> + return -ENODEV; >> + >> + *tsc = _vdso_funcs_data->tsc_per_nsec * nsec; >> + >> + return 0; >> +} > > > Please don't expose this one at all. It would be nice for programs that use > waitpkg to be migratable using CRIU-like tools, and this export actively > harms any such effort. If you omit this function, then the kernel could > learn to abort an in-progress __vdso_umwait if preempted (rseq-style) and > CRIU would just work. It would be a bit of a hack, but it solves a real > problem. > >> +notrace int __vdso_umwait(int state, unsigned long nsec) > > > __vdso_umwait_relative(), please. Because some day (possibly soon) someone > will want __vdso_umwait_absolute() and its friend __vdso_read_art_ns() so > they can do: > > u64 start = __vdso_read_art_ns(); > __vdso_umonitor(...); > ... do something potentially slow or that might fault ... > __vdso_umwait_absolute(start + timeout); > > Also, this patch appears to have a subtle but show-stopping race. Consider: > > 1. Task A does UMONITOR on CPU 1 > 2. Task A is preempted. > 3. Task B does UMONITOR on CPU 1 at a different address > 4. Task A resumes > 5. Task A does UMWAIT > > Now task A hangs, at least until the next external wakeup happens. > > It's not entirely clear to me how you're supposed to fix this without some > abomination that's so bad that it torpedoes the entire feature. Except that > there is no chicken bit to turn this thing off. Sigh.
The UMWAIT mechanism also looks like it will work incorrectly under a VM. How do you (or, more generally, Intel) plan to handle that?