On 07/01/2021 13:27, Christophe Lyon via Gcc-patches wrote: > On Thu, 7 Jan 2021 at 13:56, Richard Earnshaw > <richard.earns...@foss.arm.com> wrote: >> >> On 07/01/2021 00:59, Daniel Engel wrote: >>> --snip-- >>> >>> On Wed, Jan 6, 2021, at 9:05 AM, Richard Earnshaw wrote: >>> >>>> >>>> Thanks for working on this, Daniel. >>>> >>>> This is clearly stage1 material, so we've got time for a couple of >>>> iterations to sort things out. >>> >>> I appreciate your feedback. I had been hoping that with no regressions >>> this might still be eligible for stage2. Christophe never indicated >>> either way. but the fact that he was looking at it seemed positive. >>> I thought I would be a couple of weeks faster with this last >>> iteration, but holidays got in the way. >> >> GCC doesn't have a stage 2 any more (historical wart). We were in >> (late) stage3 when this was first posted, and because of the significant >> impact this might have on not just CM0 but other targets as well, I >> don't think it's something we should try to squeeze in at the last >> minute. We're now in stage 4, so that is doubly the case. >> >> Christophe is a very valuable member of our community, but he's not a >> port maintainer and thus cannot really rule on what can go into the >> tools, or when. >> >>> >>> I actually think your comments below could all be addressable within a >>> couple of days. But, I'm not accounting for the review process. >>> >>>> Firstly, the patch is very large, but contains a large number of >>>> distinct changes, so it would really benefit from being broken down into >>>> a number of distinct patches. This will make reviewing the individual >>>> changes much more straight-forward. >>> >>> I have no context for "large" or "small" with respect to gcc. This >>> patch comprises about 30% of a previously-monolithic library that's >>> been shipping since ~2016 (the rest is libm material). Other than >>> (1) the aforementioned change to div0(), (2) a nascent adaptation >>> for __truncdfsf2() (not enabled), and (3) the gratuitous addition of >>> the bitwise functions, the library remains pretty much as it was >>> originally released. >> >> Large, like many other terms is relative. For assembler file changes, >> which this is primarily, the overall size can be much smaller and still >> be considered 'large'. >> >>> >>> The driving force in the development of this library was small size, >>> which of course was never possible with the softfp routines. It's not >>> half-slow, either, for the limitations of the M0 architecture. And, >>> it's IEEE compliant. But, that means that most of the functions are >>> highly interconnected. So, some of it can be broken up as you outline >>> below, but that last patch is still worth more than half of the total. >> >> Nevertheless, having the floating-point code separated out will make >> reviewing more straight forward. I'll likely need to ask one of our FP >> experts to have a specific look at that part and that will be easier if >> it is disentangled from the other changes. >> >>> >>> I also have ~70k lines of test vectors that seem mostly redundant, but >>> not completely. I haven't decided what to do here. For example, I have >>> coverage for __aeabi_u/ldivmod, while GCC does not. If I do anything >>> with this code it will be in a separate thread. >> >> Publishing the test code, even if it isn't integrated into the GCC >> testsuite would be useful. Perhaps someone else could then help with that. >> >>> >>>> I'd suggest: >>>> >>>> 1) Some basic makefile cleanups to ease initial integration - in >>>> particular where we have things like >>>> >>>> LIB1FUNCS += <long list of functions> >>>> >>>> that this be rewritten with one function per line (and sorted >>>> alphabetically) - then we can see which functions are being changed in >>>> subsequent patches. It makes the Makefile fragments longer, but the >>>> improvement in clarity for makes this worthwhile. >>> >>> I know next to nothing about Makefiles, particularly ones as complex as >>> GCC's. I was just trying to work with the existing style to avoid >>> breaking something. However, I can certainly adopt this suggestion. >>> >>>> 2) The changes for the existing integer functions - preferably one >>>> function per patch. >>>> >>>> 3) The new integer functions that you're adding >>> >>> These wouldn't be too hard to do, but what are the expectations for >>> testing? A clean build of GCC takes about 6 hours in my VM, and >>> regression testing takes about 4 hours per architecture. You would want >>> a full regression report for each incremental patch? I have no idea how >>> to target regression tests that apply to particular runtime functions >>> without the risk of missing something. >>> >> >> Most of this can be tested in a cross-compile environment using qemu as >> a model. A cross build shouldn't take that long (especially if you >> restrict the compiler to just C and C++ - other languages are >> vanishingly unlikely to pick up errors in the parts of the compiler >> you're changing). But build checks will be mostly sufficient for most >> of the intermediate patches. >> >>>> 4) The floating-point support. >>>> >>>> Some more general observations: >>>> >>>> - where functions are already in lib1funcs.asm, please leave them there. >>> >>> I guess I have a different vision here. I have had a really hard time >>> following all of the nested #ifdefs in lib1funcs, so I thought it would >>> be helpful to begin breaking it up into logical units. >> >> Agreed, it's not easy. But the restructuring, if any, should be done >> separately from other changes, not mixed up with them. >> >>> >>> The functions removed were all functions for which I had THUMB1 >>> sequences faster/smaller than lib1funcs: __clzsi2, __clzdi2, __ctzsi2, >>> __ashrdi3, __lshrdi3, __ashldi3. In fact, the new THUMB1 of __clzsi2 is >>> the same number of instructions as the previous ARM/THUMB2 version. >>> >>> You will find all of the previous ARM versions of these functions merged >>> into the new files (with attribution) and the same preprocessor >>> selection path. So no architecture variant should be any worse off than >>> before this patch, and some beyond v6m should benefit. >>> >>> In the future, I think that my versions of __divsi3 and __divdi3 will >>> prove faster/smaller than the existing THUMB2 versions. I know that my >>> float routines are less than half the compiled size of THUMB2 versions >>> in 'ieee754-sf.S'. However, I haven't profiled the exact performance >>> differences so I have left all this work for future patches. (It's also >>> quite likely that my version can be further-refined with a few judicious >>> uses of THUMB2 alternatives.) >>> >>> My long-term vision would be use lib1funcs as an architectural wrapper >>> distinct from the implementation code. >>> >>>> - lets avoid having the cm0 subdirectory - in particular we should do >>>> this when there is existing code for other targets in the same source >>>> files. It's OK to have any new files in the main 'arm' directory of the >>>> source tree - just name the files appropriately if really needed. >>> >>> Fair point on the name. In v1 of this patch, all these files were all >>> preprocessor-selected for v6m only. However, as I've stumbled through >>> the finer points of integration, that line has blurred. Name aside, >>> the subdirectory does still represent a standalone library. I think >>> I've managed to add enough integration hooks that it works well in >>> a libgcc context, but it still has a very distinct implementation style. >>> >>> I don't have a strong opinion on this, just preference. But, keeping >>> the subdirectory with a neutral name will probably make maintenance >>> easier in the short term. I would suggest "lib0" (since it caters to >>> the lowest common denominator) or "eabi" (since that was the original >>> target). There are precedents in other architectures (e.g. avr). >> >> The issue here is that the selection of code from the various >> subdirectories is not consistent. In some cases we might be pulling in >> a thumb1 implementation into a thumb2 environment, so having the code in >> a directory that doesn't reflect this makes maintaining the code harder. >> I don't mind too much if some new files are introduced and their names >> reflect both their function and the architecture they support - eg >> t1-di-shift.S would obviously contain code for di-mode shifts in thumb1. >> >>> >>>> - let's avoid the CM0 prefix - this is 'thumb1' code, for want of a >>>> better term, and that is used widely elsewhere in the compiler. So if >>>> you really need a term just use THUMB1, or even T1. >>> >>> Maybe. The Cortex M0 includes a subset of THUMB2 instructions. Most >>> of this is probably THUMB1 clean, but it wasn't a design requirement. >> >> It's particularly the Thumb1 issue, just more the name is for a specific >> CPU which might cause confusion later. v6m would be preferable to that >> if there really is a dependency on the instructions that are not in the >> original Thumb1 ISA. >> >>> >>> The CM0_FUNC_START exists so that I can specify subsections of ".text" >>> for each function. This was a fairly fundamental design decision that >>> allowed me to make a number of branch optimizations between functions. >>> The other macros are just duplicates for naming symmetry. >> >> This is something we'll have to get to during the main review of the >> code - we used to have support for PE-COFF object files. That might now >> be obsolete, wince support is certainly deprecated - but we can't assume >> that ELF is the only object format we'll ever have to support. >> >>> >>> The existing FUNC_START macro inserts extra conflicting ".text" >>> directives that would break the build. Of course, the prefix was >>> arbitrary; I just took CM0 from the library name. But, there's nothing >>> architecturally significant about this macro at all, so THUMB1 and T1 >>> seems just about as wrong. Maybe define a FUNC_START_SECTION macro with >>> two parameters? For example: >>> >>> FUNC_START_SECTION clzdi2 .text.sorted.libgcc.clz2.clzdi2 >>> >>> Instead of: >>> >>> .section .text.sorted.libgcc.clz2.clzdi2,"x" >>> CM0_FUNC_START clzdi2 >>> >>>> - For the 64-bit shift functions, I expect the existing code to be >>>> preferable whenever possible - I think it can even be tweaked to build >>>> for thumb2 by inserting a suitable IT instruction. So your code should >>>> only be used when >>>> >>>> #if !__ARM_ARCH_ISA_ARM && __ARM_ARCH_ISA_THUMB == 1 >>> >>> That is the definition of NOT_ISA_TARGET_32BIT, which I am respecting. >>> (The name doesn't seem quite right for Cortex M0, since it does support >>> some 32 bit instructions, but that's beside the point.) >> >> The terms Thumb1 and Thumb2 predate the arm-v8m architecture >> specifications - even then the term Thumb1 was interpreted as "mostly >> 16-bit instructions" and thumb2 as "a mix of 16- and 32-bit". Yes, the >> 16/32-bit spilt has become more blurred and that will likely continue in >> future since the 16-bit encoding space is pretty full. >> >>> >>> The current lib1funcs ARM code path still exists, as described above. My >>> THUMB1 implementations were 1 - 3 instructions shorter than the current >>> versions, which is why I took the time to merge the files. >>> >>> Unfortunately, the Cortex M0 THUMB2 subset does not provide IT. I don't >>> see an advantage to eliminating the branch unless these functions were >>> written with cryptographic side channel attacks in mind. >> >> On high performance cores branches are predicted - if the branch is >> predictable then the common path will be taken and the unneeded >> instructions will never be used. But library functions like this tend >> to have very unpredictable values used for calling them, so it's much >> less likely that the hardware will predict the right path - at this >> point conditional instructions tend to win (especially if there aren't >> very many of them) because the cost (on average) of not executing the >> unneeded instructions is much lower than the cost (on average) of >> unwinding the processor state to execute the other arm of the >> conditional branch. >> >>> >>>> - most, if not all, of your LSYM symbols should not be needed after >>>> assembly, so should start with a captial 'L' (and no leading >>>> underscores), the assembler will then automatically discard any that are >>>> not needed for relocations. >>> >>> You don't want debugging symbols for libgcc internals :) ? I sort of >>> understand that, but those symbols have been useful to me in the past. >>> The "." by itself seems to keep visibility local, so the extra symbols >>> won't cause linker issuess. Would you object to a macro variant (e.g. >>> LLSYM) that prepends the "L" but is easier to disable? >> >> It is a matter of taste, but I really prefer the local symbols to >> disappear entirely once the file is compiled - it makes things like >> backtrace gdb show the proper call heirarchy. >> >>> >>>> - you'll need to write suitable commit messages for each patch, which >>>> also contain a suitable ChangeLog style entry. >>> >>> OK. >>> >>>> - finally, your popcount implementations have data in the code segment. >>>> That's going to cause problems when we have compilation options such as >>>> -mpure-code. >>> >>> I am just following the precedent of existing lib1funcs (e.g. __clz2si). >>> If this matters, you'll need to point in the right direction for the >>> fix. I'm not sure it does matter, since these functions are PIC anyway. >> >> That might be a bug in the clz implementations - Christophe: Any thoughts? >> > > Indeed that looks suspicious. I'm wondering why I saw no problem during > testing. > Is it possible that __clzsi2 is not covered by GCC's 'make check' ?
It's possible, but don't forget that modern cores have a CLZ instruction, so most tests will not end up using the library implementation. > >>> >>>> I strongly suggest that, rather than using gcc snapshots (I'm assuming >>>> this based on the diff style and directory naming in your patches), you >>>> switch to using a git tree, then you'll be able to use tools such as >>>> rebasing and the git posting tools to send the patch series for >>>> subsequent review. >>> >>> Your assumption is correct. I didn't think that I would have to get so >>> deep into the gcc development process for this contribution. I used >>> this library as a bare metal alternative for libgcc/libm in the product >>> for years, so I thought it would just drop in. But, the libgcc compile >>> mechanics have proved much more 'interesting'. I'm assuming this >>> architecture was created years before the introduction of -ffunction- >>> sections... >>> >> >> I don't think I've time to write a history lesson, even if you wanted >> it. Suffice to say, this does date back to the days of a.out format >> object files (with 4 relocation types, STABS debugging, and one code, >> one data and one bss section). >> >>>> >>>> Richard. >>>> >>> >>> Thanks again, >>> Daniel >>> >> >> R. R.