On Thu, 23 Feb 2023 06:18:49 GMT, Martin Doerr <mdo...@openjdk.org> wrote:
>> Implementation of "Foreign Function & Memory API" for linux on Power (Little >> Endian) according to "Power Architecture 64-Bit ELF V2 ABI Specification". >> >> This PR does not include code for VaList support because it's supposed to >> get removed by [JDK-8299736](https://bugs.openjdk.org/browse/JDK-8299736). >> I've kept the related tests disabled for this platform and throw an >> exception instead. Note that the ABI doesn't precisely specify variable >> argument lists. Instead, it refers to `<stdarg.h>` (2.2.4 Variable Argument >> Lists). >> >> Big Endian support is implemented to some extend, but not complete. E.g. >> structs with size not divisible by 8 are not passed correctly (see >> `useABIv2` in CallArranger.java). Big Endian is excluded by selecting >> `ARCH.equals("ppc64le")` (CABI.java) only. >> >> There's another limitation: This PR only accepts structures with size >> divisible by 4. (An `IllegalArgumentException` gets thrown otherwise.) I >> think arbitrary sizes are not usable on other platforms, either, because >> `SharedUtils.primitiveCarrierForSize` only accepts powers of 2. >> >> The ABI has some tricky corner cases related to HFA (Homogeneous Float >> Aggregate). The same argument may need to get passed in both, a FP reg and a >> GP reg or stack slot (see "no partial DW rule"). This cases are not covered >> by the existing tests. >> >> I had to make changes to shared code and code for other platforms: >> 1. Pass type information when creating `VMStorage` objects from `VMReg`. >> This is needed for the following reasons: >> - PPC64 ABI requires integer types to get extended to 64 bit (also see >> CCallingConventionRequiresIntsAsLongs in existing hotspot code). We need to >> know the type or at least the bit width for that. >> - Floating point load / store instructions need the correct width to select >> between the correct IEEE 754 formats. The register representation in single >> FP registers is always IEEE 754 double precision on PPC64. >> - Big Endian also needs usage of the precise size. Storing 8 Bytes and >> loading 4 Bytes yields different values than on Little Endian! >> 2. It happens that a `NativeMemorySegmentImpl` is used as a raw pointer >> (with byteSize() == 0) while running TestUpcallScope. Hence, existing size >> checks don't work (see MemorySegment.java). As a workaround, I'm just >> skipping the check in this particular case. Please check if this makes sense >> or if there's a better fix (possibly as separate RFE). > > Martin Doerr has updated the pull request incrementally with one additional > commit since the last revision: > > Remove size restriction for structs. Add TODO for Big Endian. This looks good overall I think, though I'll stick to my previous suggestion to try and move more logic into Java. Also, I recommend adding a test in `java/foreign/callarranger` similar to the tests already found there. They call the CallGenerator directly and then check the binding recipe. This could prove useful if we need to refactor shared code for whatever reason, since those tests run on (almost) every platform, so can be used to do some basic sanity checking. src/hotspot/cpu/ppc/downcallLinker_ppc.cpp line 133: > 131: Register callerSP = R2, // C/C++ uses R2 as TOC, but we can > reuse it here > 132: tmp = R11_scratch1, // same as shuffle_reg > 133: call_target_address = R12_scratch2; // same as _abi._scratch2 > (ABIv2 requires this reg!) Do I understand correctly that the ABI requires the register to be used for the call to be `R12`? How does that make a difference? I guess in some cases the callee might want to know the address through which it is called? (so it looks at `R12`) src/hotspot/cpu/ppc/downcallLinker_ppc.cpp line 154: > 152: // (abi_reg_args is abi_minframe plus space for 8 argument register > spill slots) > 153: assert(_abi._shadow_space_bytes == frame::abi_minframe_size, "expected > space according to ABI"); > 154: int allocated_frame_size = frame::abi_minframe_size + > MAX2(_input_registers.length(), 8) * BytesPerWord; This is hard-coding an assumption about the ABI that's being called. Ok for now. If it needs to be addressed in the future, it could be done by adding another field to `ABIDescriptor` like `min_stack_arg_bytes`, or something like that (which is set to zero for other ABIs). It seems to be different from `shadow_space` since it's also used by the caller to put stack arguments. src/hotspot/cpu/ppc/downcallLinker_ppc.cpp line 343: > 341: > 342: __ flush(); > 343: // Disassembler::decode((u_char*)start, (u_char*)__ pc(), tty); Leftover commented code? (note that the stub can also be disassembled with `-Xlog:foreign+downcall=trace` now) src/hotspot/cpu/ppc/foreignGlobals_ppc.cpp line 229: > 227: > 228: void ArgumentShuffle::pd_generate(MacroAssembler* masm, VMStorage tmp, > int in_stk_bias, int out_stk_bias, const StubLocations& locs) const { > 229: Register callerSP = as_Register(tmp); // preset It looks like `tmp` is being used to hold the caller's SP. I'm guessing this can not be computed the same way as we do on x86 and aarch64? (based on `RBP`, `RFP_BIAS`) If you want, you could add another register argument to `pd_generate` that is just invalid/unused on other platforms. That way you could use `tmp` for the shuffling instead of having to go through the stack. (looks like `R0` is already used in some cases as a temp register) src/hotspot/cpu/ppc/upcallLinker_ppc.cpp line 137: > 135: ArgumentShuffle arg_shuffle(in_sig_bt, total_in_args, out_sig_bt, > total_out_args, &in_conv, &out_conv, shuffle_reg); > 136: // The Java call uses the JIT ABI, but we also call C. > 137: int out_arg_area = MAX2(frame::jit_out_preserve_size + > arg_shuffle.out_arg_bytes(), (int)frame::abi_reg_args_size); We need `frame::abi_reg_args_size` since we call `on_entry`/`on_exit` which require the stack space I guess? src/hotspot/cpu/ppc/upcallLinker_ppc.cpp line 240: > 238: __ ld(call_target_address, in_bytes(Method::from_compiled_offset()), > R19_method); > 239: __ mtctr(call_target_address); > 240: __ bctrl(); Ok, I see. I guess there is some special purpose register called `CTR` which we are moving to for `bctrl` here. Does ABIv2 require that move to always come from `R12`? (from the comment in downcallLinker). (I'm trying to understand the requirements for possibly tweaking shared code). src/hotspot/cpu/ppc/upcallLinker_ppc.cpp line 347: > 345: FunctionDescriptor* fd = (FunctionDescriptor*)fd_addr; > 346: fd->set_entry(fd_addr + sizeof(FunctionDescriptor)); > 347: #endif Had to do a double take. Looks like we're not the only one who are using the name `FunctionDescriptor` :) src/hotspot/cpu/ppc/upcallLinker_ppc.cpp line 356: > 354: } > 355: #endif > 356: //blob->print_on(tty); Leftover commented code? src/java.base/share/classes/jdk/internal/foreign/abi/ppc64/CallArranger.java line 68: > 66: public abstract class CallArranger { > 67: // Linux PPC64 Little Endian uses ABI v2. > 68: private static final boolean useABIv2 = ByteOrder.nativeOrder() == > ByteOrder.LITTLE_ENDIAN; Now that I'm here. This could be a potentially interesting case for having 2 subclasses of CallArranger: one for `useABIv2 == true` and one for `false`. src/java.base/share/classes/jdk/internal/foreign/abi/ppc64/CallArranger.java line 81: > 79: new VMStorage[] { f1, f2, f3, f4, f5, f6, f7, f8 }, // FP output > 80: new VMStorage[] { r0, r2, r3, r4, r5, r6, r7, r8, r9, r10, r11, > r12 }, // volatile GP > 81: new VMStorage[] { f0, f1, f2, f3, f4, f5, f6, f7, f8, f9, f10, > f11, f12, f13 }, // volatile FP Note that argument registers are assumed volatile, so they don't have to be duplicated here. src/java.base/share/classes/jdk/internal/foreign/abi/ppc64/CallArranger.java line 286: > 284: // "no partial DW rule": Mark first stack slot to get > filled. > 285: // Note: Can only happen with forArguments = true. > 286: VMStorage overlappingReg = null; `overlappingReg` is initialized along all branches, so it's not needed to assign `null` here. src/java.base/share/classes/jdk/internal/foreign/abi/ppc64/CallArranger.java line 293: > 291: } else { > 292: overlappingReg = new > VMStorage(StorageType.STACK_AND_FLOAT, > 293: (short) > STACK_SLOT_SIZE, (int) stackOffset - 4); I think you could remove the mixed VMStorage types here relatively easily by returning a `VMStorage[][]`, where each element is a single element array, but then for the `needOverlapping` case add another element to the array for the extra store. Then when unboxing a `STRUCT_HFA`, `dup` the result of the `bufferLoad` and then do 2 `vmStore`s (one for each element). For boxing, you could just ignore the extra storage, and just `vmLoad` the first one (or, whichever one you like :)) src/java.base/share/classes/jdk/internal/foreign/abi/ppc64/TypeClass.java line 66: > 64: } > 65: > 66: static boolean isHomogeneousFloatAggregate(MemoryLayout type, boolean > useABIv2) { Note that we had to make some changes to this routine on AArch64, since it didn't properly account for nested structs/unions and arrays. See: https://github.com/openjdk/panama-foreign/pull/780 Just as a heads up, in case PPC needs changes too. src/java.base/share/classes/jdk/internal/foreign/abi/ppc64/linux/LinuxPPC64CallArranger.java line 33: > 31: * PPC64 CallArranger specialized for Linux ABI. > 32: */ > 33: public class LinuxPPC64CallArranger extends CallArranger { I don't really see the point in having a separate subclass with `CallArranger` being abstract, unless you are planning to add other implementations later? (edit: see also later comment in CallArranger https://github.com/openjdk/jdk/pull/12708#discussion_r1120753657) ------------- PR: https://git.openjdk.org/jdk/pull/12708