Re: [U-Boot] [PATCH] nvme: add more cache flushes
Hi All, On 10/17/2019 12:12 AM, Patrick Wildt wrote: On Thu, Oct 17, 2019 at 03:08:59PM +0800, Bin Meng wrote: Hi Patrick, On Thu, Oct 17, 2019 at 2:44 PM Patrick Wildt wrote: On Thu, Oct 17, 2019 at 10:55:11AM +0800, Bin Meng wrote: Hi Patrick, On Wed, Oct 16, 2019 at 11:35 PM Patrick Wildt wrote: On Wed, Oct 16, 2019 at 06:11:23PM +0800, Bin Meng wrote: On Mon, Oct 14, 2019 at 7:11 PM Patrick Wildt wrote: On an i.MX8MQ our nvme driver doesn't completely work since we are missing a few cache flushes. One is the prp list, which is an extra buffer that we need to flush before handing it to the hardware. Also the block read/write operations needs more cache flushes on this SoC. Signed-off-by: Patrick Wildt --- drivers/nvme/nvme.c | 15 +-- 1 file changed, 9 insertions(+), 6 deletions(-) diff --git a/drivers/nvme/nvme.c b/drivers/nvme/nvme.c index 2444e0270f..69d5e3eedc 100644 --- a/drivers/nvme/nvme.c +++ b/drivers/nvme/nvme.c @@ -123,6 +123,9 @@ static int nvme_setup_prps(struct nvme_dev *dev, u64 *prp2, } *prp2 = (ulong)dev->prp_pool; + flush_dcache_range((ulong)dev->prp_pool, (ulong)dev->prp_pool + + dev->prp_entry_num * sizeof(u64)); + return 0; } @@ -717,9 +720,10 @@ static ulong nvme_blk_rw(struct udevice *udev, lbaint_t blknr, u16 lbas = 1 << (dev->max_transfer_shift - ns->lba_shift); u64 total_lbas = blkcnt; - if (!read) - flush_dcache_range((unsigned long)buffer, - (unsigned long)buffer + total_len); + flush_dcache_range((unsigned long)buffer, + (unsigned long)buffer + total_len); Why we need this for read? I'm no processor engineer, but I have read (and "seen") the following on ARMs. If I'm wrong. please correct me. Since the buffer is usually allocated cache-aligned on the stack, it is very possible that this buffer was previously still used as it's supposed to be used: as stack. Thus, the caches can still be filled, and are not yet evicted/flushed to memory. Now it is possible that between the DMA access from the device and our cache invalidation, the CPU cache writes back its contents to memory, overwriting whatever the NVMe just gave us. OK, this makes sense. So if we allocate the buffer from the heap, we should only care about flush on write, right? Can you test this? If you're talking about having a bounce buffer: You'd need to flush it once upon allocation, because that part of the heap could have also be used earlier by someone else. I was not talking about bounce buffer. I mean the buffer allocated from help and use that directly for DMA. Regards, Bin If you allocate a buffer from the heap, you still need to make sure to flush it once, since there's still the chance that you have used it shortly earlier. But it's less of an issue as on the stack. The "rules" for cache management of DMA buffers for non-cache-coherent CPUs are immutable. It doesn't matter where the memory came from (static, heap, or stack). It may be in cache, and it may be dirty. You can't know it is for sure "clean". It is assumed that the DMA buffers are allocated to begin on a cache line boundary and are a multiple of a cache line in length. If this is not the case, references by the CPU to locations outside the buffer can make the cache state of the buffer dirty, which is an error. It is also required that there be no accesses to the DMA buffer by the CPU while DMA is in progress. This is normally true by default, and if it isn't true, it is an error. The rules are then as follows: On write, before DMA is started. the cache corresponding to the DMA buffer addresses MUST be flushed to ensure the desired content is transferred from cache to RAM before write DMA begins. On read, before DMA is started, the cache corresponding to the DMA buffer addresses MUST be either invalidated or flushed to ensure that no cache system initiated writes to RAM will occur while read DMA is in progress. Normally, invalidate is preferred because it is faster. However, for programming simplicity some drivers choose to flush before both read or write DMA is started. If that is the case, it is good practice to note that choice in a comment. On write, after DMA is completed, no additional cache actions are required. On read, after DMA is completed, the cache corresponding to the DMA buffer addresses MUST be invalidated. This is necessary to ensure that stale data in the cache will not be used instead of the new read data in RAM. If one elected to invalidate the cache before the read DMA started, one may wonder why a second invalidate is necessary. Since the buffer is not allowed to be referenced by the program in the interim, the cache should not contain any data from the DMA buffer area. The reason is that modern CPUs, may load data from the DMA buffer into cache while DMA is in progress. This can be
Re: [U-Boot] [PATCH 1/3] spl: ram: Do not memcpy() identical buffers
On 5/28/2019 4:19 AM, Tom Rini wrote: On Tue, May 28, 2019 at 05:24:34AM +0200, Marek Vasut wrote: On 5/28/19 5:04 AM, Tom Rini wrote: On Tue, May 28, 2019 at 04:44:52AM +0200, Marek Vasut wrote: On 5/28/19 4:42 AM, Tom Rini wrote: On Tue, May 28, 2019 at 04:07:44AM +0200, Marek Vasut wrote: On 5/28/19 4:06 AM, Tom Rini wrote: On Tue, May 28, 2019 at 03:49:13AM +0200, Marek Vasut wrote: If the source and destination buffer address is identical, there is no need to memcpy() the content. Skip the memcpy() in such a case. Signed-off-by: Marek Vasut Cc: Michal Simek Cc: Tom Rini Shouldn't memcpy catch that itself? memcpy(3) says The memcpy() function copies n bytes from memory area src to memory area dest. The memory areas must not overlap. Use memmove(3) if the memory areas do overlap. OK, and shouldn't memcpy optimize that case? Does it usually? As the manpage says "The memory areas must not overlap." , I would expect it does not have to ? I guess I'm not being clear enough, sorry. Go look at how this is implemented in a few places please and report back to us. Someone else, or many someone else, have probably already figured out if optimizing this case in general, in memcpy, is a good idea or not. Thanks! If even [1] says the behavior is undefined, then what does it matter whether some implementation added an optimization for such undefined behavior? We cannot depend on it, can we ? [1] https://pubs.opengroup.org/onlinepubs/009695399/functions/memcpy.html Yes, yes it would be worth seeing if other groups that have looked into the performance impact here have also decided to optimize this undefined behavior or not, thanks. I don't think this is an optimization question, really. Calling memcpy with overlapping areas is an error. The result is explicitly undefined. It may well be that all the existing implementations effectively do nothing, either by explicit check or by actually copying the data over itself. However, to rely on that behavior is asking for trouble down the line. Undefined behavior is exactly that. Don't do it. ___ U-Boot mailing list U-Boot@lists.denx.de https://lists.denx.de/listinfo/u-boot ___ U-Boot mailing list U-Boot@lists.denx.de https://lists.denx.de/listinfo/u-boot
Re: [U-Boot] [PATCH] libfdt: Initialize the stack variable
On 9/4/2017 8:41 PM, Chee, Tien Fong wrote: On Rab, 2017-08-30 at 06:31 -0700, J. William Campbell wrote: On 8/29/2017 10:15 PM, tien.fong.c...@intel.com wrote: From: Tien Fong Chee Report Coverity log: The code uses a variable that has not been initialized, leading to unpredictable or unintended results. Reported-by: Coverity (CID: 60519) Signed-off-by: Tien Fong Chee --- lib/libfdt/fdt_wip.c |2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/lib/libfdt/fdt_wip.c b/lib/libfdt/fdt_wip.c index 45fb964..01adad0 100644 --- a/lib/libfdt/fdt_wip.c +++ b/lib/libfdt/fdt_wip.c @@ -115,7 +115,7 @@ int fdt_find_regions(const void *fdt, char * const inc[], int inc_count, struct fdt_region region[], int max_regions, char *path, int path_len, int add_string_tab) { - int stack[FDT_MAX_DEPTH]; + int stack[FDT_MAX_DEPTH] = { 0 }; It seems to me that one of three things must be true. 1) Coverity can't correctly analyze the code and stack[] is not used in an un- initialized manner, 2) stack is used in an un-initialized manner but the result is not used in that case and is a "don't care" or 3) there is a bug in the code. It seems that just initializing the variable to 0 is a "Bad Idea(tm)". If it is case 1 or 2, there should be a Coverity code annotation comment added to that effect, and if it is case 3, it should be fixed. Initializing this variable makes the binary larger to no purpose unless there is a bug already. Best Regards, J. William Campbell Yeah, i agree with you, state machine design should ensure stack[] is not used in a uninitialized manner. Hence, i need input from whom familiar with this function, whether this warning fall in anyone of these conditions. If we just direct init the stack[], and this solution will make extra 128 bytes in binary, but having variable with default value is also good pratice from software quality perspective. Yes, if the default value has a rationale. On the surface, there is no way to know that 0 is a "good" initial value. There may be a reason that it is, but if we don't know for sure, it is just a "random" number. I hope whoever wrote this will speak up and say why the variable is never used before it is initialized. Thank you for being so diligent. Best Regards, Bill Campbell char *end; int nextoffset = 0; uint32_t tag; ___ U-Boot mailing list U-Boot@lists.denx.de https://lists.denx.de/listinfo/u-boot
Re: [U-Boot] [PATCH] libfdt: Initialize the stack variable
On 8/29/2017 10:15 PM, tien.fong.c...@intel.com wrote: From: Tien Fong Chee Report Coverity log: The code uses a variable that has not been initialized, leading to unpredictable or unintended results. Reported-by: Coverity (CID: 60519) Signed-off-by: Tien Fong Chee --- lib/libfdt/fdt_wip.c |2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/lib/libfdt/fdt_wip.c b/lib/libfdt/fdt_wip.c index 45fb964..01adad0 100644 --- a/lib/libfdt/fdt_wip.c +++ b/lib/libfdt/fdt_wip.c @@ -115,7 +115,7 @@ int fdt_find_regions(const void *fdt, char * const inc[], int inc_count, struct fdt_region region[], int max_regions, char *path, int path_len, int add_string_tab) { - int stack[FDT_MAX_DEPTH]; + int stack[FDT_MAX_DEPTH] = { 0 }; It seems to me that one of three things must be true. 1) Coverity can't correctly analyze the code and stack[] is not used in an un-initialized manner, 2) stack is used in an un-initialized manner but the result is not used in that case and is a "don't care" or 3) there is a bug in the code. It seems that just initializing the variable to 0 is a "Bad Idea(tm)". If it is case 1 or 2, there should be a Coverity code annotation comment added to that effect, and if it is case 3, it should be fixed. Initializing this variable makes the binary larger to no purpose unless there is a bug already. Best Regards, J. William Campbell char *end; int nextoffset = 0; uint32_t tag; ___ U-Boot mailing list U-Boot@lists.denx.de https://lists.denx.de/listinfo/u-boot
Re: [U-Boot] [PATCH 2/2] arm: config: enforce -fno-pic for SPL and normal U-Boot
On 7/3/2017 8:12 PM, Peng Fan wrote: -Original Message- From: Tom Rini [mailto:tr...@konsulko.com] Sent: Tuesday, July 04, 2017 10:47 AM To: Peng Fan Cc: Simon Glass ; Philipp Tomsich ; albert.u.b...@aribaud.net; u- b...@lists.denx.de Subject: Re: [U-Boot] [PATCH 2/2] arm: config: enforce -fno-pic for SPL and normal U-Boot On Tue, Jul 04, 2017 at 01:09:36AM +, Peng Fan wrote: Hi Tom, -Original Message- From: Tom Rini [mailto:tr...@konsulko.com] Sent: Tuesday, July 04, 2017 12:17 AM To: Peng Fan ; Simon Glass ; Philipp Tomsich Cc: albert.u.b...@aribaud.net; u-boot@lists.denx.de Subject: Re: [U-Boot] [PATCH 2/2] arm: config: enforce -fno-pic for SPL and normal U-Boot On Mon, Jul 03, 2017 at 09:14:08PM +0800, Peng Fan wrote: If not pass -fno-pic to toolchains, some toolchains may generate .got and .got.plt sections, but when generate binaries, we did not take .got and .got.plt into consideration, then SPL or normal U-Boot boot failure because image corrupted. Need to pass -fno-pic to disable generating .got and .got.plt sections. Signed-off-by: Peng Fan Cc: Albert Aribaud Cc: Tom Rini --- arch/arm/config.mk | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/arch/arm/config.mk b/arch/arm/config.mk index 1a9..66ae403 100644 --- a/arch/arm/config.mk +++ b/arch/arm/config.mk @@ -130,9 +130,10 @@ ALL-y += checkarmreloc # instruction. Relocation is not supported for that case, so disable # such usage by requiring word relocations. PLATFORM_CPPFLAGS += $(call cc-option, -mword-relocations) -PLATFORM_CPPFLAGS += $(call cc-option, -fno-pic) endif +PLATFORM_CPPFLAGS += $(call cc-option, -fno-pic) + # limit ourselves to the sections we want in the .bin. ifdef CONFIG_ARM64 OBJCOPYFLAGS += -j .text -j .secure_text -j .secure_data -j .rodata -j .data \ Something is "up" here and I need you to dig harder and perhaps see if we're missing something in the linker scripts? The very next line of context here is: -j .u_boot_list -j .rela.dyn -j .got -j .got.plt Meaning that we intentionally copy .got / .got.plt into the resulting binary. And I see that we took in 397d7d5a1be1 from you back in 2016 saying that we needed this in SPL. But 5a942a152776 put the got/got.plt sections (for 32bit ARM) in intentionally as some relocations do need it. And in 4b0d506ed3b4 Philipp seems to have seen the same problem you have, but fixed it with adding got/got.plt to the sections list we copy in (the above hunk of context). If pass -fno-pic to compiler, there will be no .got and .got.plt sections. The .got and .got.plt is usually used for dynamic link, such as linux "*.so" file. We need to pass -fno-pic to compiler to remove .got and .got.plt sections. "Usually" isn't the same as "always" or "only". And this reminded me that we From https://gcc.gnu.org/onlinedocs/gcc/Code-Gen-Options.html#Code-Gen-Options, when -fpic is used for ARM, there will be GOT. The dynamic loader resolves the GOT entries when the program starts (the dynamic loader is not part of GCC; it is part of the operating system). had to deal with e391b1e64b0b because yes, we're not making a shared library but we do have position independent code. So, in the case of SPL, since we For position independent code, but not making a shared library, we could use -fpie. can get away with -fno-pic (and get some space savings) that's just not true of U-Boot itself. We're enforcing -fpic on other architectures, so what exactly is going on with what you're seeing? Where If not passing -fno-pic to gcc, the android toolchain will generate .got and .got.plt section. I think these sections are for dynamic link, not for static link. would we need to be taking .got/.got.plt into consideration and why would it be a bad thing to do so? Thanks! We could keep .got and .got.plt, but personally I prefer use -fno-pic. I think fpic is for shared library and fpie is for position independt code for static link. Thanks, Peng. To clarify things, as I understand it, -pie produces the same code and segments as -pic does, plus a loader to load the code correctly into memory. u-boot probably doesn't want that loader. The loader is required to make the executable work under the OS, but stand-alone probably doesn't work right in general. For u-boot, the advantage of -pic code is that there are no TEXTREL relocations in the binary. The code is generated to take care of bss (and code segment when required) references that require relocation. Things in the .got may need relocating by where bss ends up, but not the code. So relocation is short and simple. Non pic code requires relocation processing that may be more complex than pic code, and requires a larger relocation table. If you are loading the code and bss at the location they were linked for, neither type requires relocation. If the code and or bss is located at a different address, -pic requires a little relocation and no
Re: [U-Boot] [PATCH v2] NS16550: buffer reads
On 10/24/2011 3:22 PM, Graeme Russ wrote: > Hi Wolfgang, > > On Tue, Oct 25, 2011 at 8:59 AM, Wolfgang Denk wrote: >> Dear Graeme Russ, >> >> In message<4ea5cd39.2070...@gmail.com> you wrote: Assume we have a simple device with a small Rx FIFO - say, 8 bytes only. Guess what the chances are that we will overrun this FIFO (and then lose characters) before the other side stops sending? >>> Should be zero (if you disable the hardware Tx buffer) unless you have an >>> incredibly slow Rx routine (in which case, your baud rate is too high). And >>> even with a Tx buffer in play the solution is fairly simple - After sending >>> an XOFF, flush the Rx buffer (and remote Tx buffer) into an internal buffer >>> until tstc() returns false - use this buffer for getc() until it is empty. >> Things are becoming more and more complicated, it seems. >> >> If you ask me: forget about all this, and stick with single line >> input. Do the processing on the sender's side (always wait for a >> prompt before sending the next line). This is MUCH easier. > Oh, I wholehearedly agree that if we impose such a limitation, life will > be much easier for us all ;) > > But realistically, the solution is actually very trivial and comprises > just two parts: > > 1a) When the command line interpreter (which is just the really simple > function detecting new-line or command delimiter) dispatches a > command string to the command interpreter (which looks up the > command table and if it finds a command runs it) send an XOFF > 1b) When the command interpreter returns control back to the command > line interpreter, send an XON > 1c) If an XOFF has been recently sent (i.e. no XON since) send an XON > in getc() > > 2) After sending XOFF, flush the transmitter's buffer into a small > (16 bytes chould sufffice) buffer. Grab characters from this buffer > before grabing from the UART when getc() is called Hi All, The problem with all this is that the transmitting CPU may send "several" characters after the XOFF is transmitted, where several will be at least two if the transmitting CPU has no output fifo, up to possibly the output fifo length + 2. Furthermore, this "overage" can theoretically increase with each x-on x-off sequence, as each command may be, in theory, shorter than the output fifo length. So the input buffering is needed in all cases, i.e. the receive fifo must be "run dry" before executing a command. If the receive fifo is shorter than the transmit fifo +2, you are doomed. If the receive fifo is larger than this, you are ok, the transmitting CPU will stop before overrunning the input fifo and data being lost. However, a 16 byte buffer will not be enough if there are lots of "short" commands in the input stream. Best Regards, Bill Campbell > > The second part should not even be necessary if you have a half-sane > serial port with a half-decent hardware buffer. > > It really would not be that hard (maybe a few dozen lines of code at most), > would work for all UARTS and should (hopefully) resolve all 'dropped > character' issues. > > Now 1c) makes the bold assumption that any command which calls getc() > 'knows what it is doing' and will consume all input characters at a fast > enough rate AND will not invoke a delay between receiving the last > character and returning back to the command line interpreter. If the > command knows it received characters and knows it will invoke a delay (some > kind of post-processing) then in order to not lose characters, the command > MUST issue it's own XOFF - nothing else aside from an interrupt driven > serial driver will solve the problem in this case. > > Regards, > > Graeme > ___ > U-Boot mailing list > U-Boot@lists.denx.de > http://lists.denx.de/mailman/listinfo/u-boot > > ___ U-Boot mailing list U-Boot@lists.denx.de http://lists.denx.de/mailman/listinfo/u-boot
Re: [U-Boot] [RFC] Timer API (again!)
On 9/16/2011 4:53 AM, Graeme Russ wrote: > Hi All, > > Well, here we are again, a Timer API discussion > > All things considered, I don't think the Linux approach is right for U-Boot > - It is designed to cater for way more use-cases than U-Boot will ever need > to deal with (time queues and call-backs in particular) Hi Graeme, Glad you are tackling this again. > > To summarize, in a nutshell, what _I_ think U-Boot need from a Timer API > > 1. A reliable udelay() available as early as possible that is OK to delay > longer than requested, but not shorter - Accuracy is not implied or > assumed I think you don't really mean this. SOME attempt at accuracy is to be expected. Not perfect is allowed, but to be totally bogus is not allowed. That is worse than nothing. Longer by 20% or so is possibly ok, 2% isn't. > 2. A method of obtaining a number of 'time intervals' which have elapsed > since some random epoch (more info below) > 3. A nice way of making A->B time checks simple within the code > > OK, some details: > > 1. I'm starting to thing this should be a function pointer or a flag in gd. > Why oh why would you do such a thing I hear you ask... udelay() is needed > _EARLY_ - It may well be needed for SDRAM or other hardware initialisation > prior to any reliable timer sub-system being available. But early udelay() > is often inaccurate and when the timer sub-system gets fully initialised, > it can be used for an accurate udelay(). x86 used to have an global data > flag that got set when the timer-subsystem got initialised. If the flag was > not set, udelay() would use one implementation, but if it was set, udelay() > would use a more accurate implementation. In the case of the eNET board, it > had an FPGA implementation of a microsecond timer which was even more > accurate that the CPU, so it had it's own implementation that should have > duplicated the 'if (gd->flags& TIMER_INIT)' but never did (this was OK > because udelay() was never needed before then) > > I think a function pointer in gd would be a much neater way of handling the > fact that more accurate (and efficient) implementations of udelay() may > present themselves throughout the boot process I think this won't be popular, but I am not against it on the face of it. > > An other option would be to have the gd flag and make udelay() a lib > function which performs the test - If the arch timer has better than 1us > resolution it can set the flag and udelay() will use the timer API > > 2a (random epoch) - The timer sub-system should not rely on a particular > epoch (1970, 1901, 0, 1, since boot, since timer init, etc...) - By using > the full range of an unsigned variable, the epoch does not matter > > 2b (time interval) - We need to pick a base resolution for the timer API - > Linux uses nano-seconds and I believe this is a good choice. Big Fat Note: > The underlying hardware DOES NOT need to have nano-second resolution. The > only issue with nano-seconds is that it mandates a 64-bit timer - A few > people are against this idea - lets discuss. A 32-bit microsecond timer > provides ~4300 seconds (~1.2 hours) which _might_ be long enough, but > stand-alone applications doing burn-in tests may need longer. Going > milli-seconds means we cannot piggy-back udelay() on the timer API. Well, as you probably know, I am "one of those" against a 64 bit timer API to figure out if my disk is ready, especially if my hardware timebase does not support nanosecond resolution anyway. I think having a time base in milliseconds like we have now is perfectly adequate for "sane" bootloader usage. The requirement for the timer API would then be to provide the "current time" in millisecond resolution, and the time in as close to microsecond resolution as possible. This would be a "helper" function for udelay. Udelay could then always be implemented using microseconds to ticks and getting the subtracting delta ticks from the target value. If you can provide the first function, you can provide the second. They can usually be the same code with different constants involved. The gd-> call may actually be inside udelay, with the timer init replacing the pointer used by udelay to get the current time in "microseconds". The helper function for udelay should not be called by other parts of u-boot, to prevent abuse of this capability. Note that a 32 bit millisecond delay is like 49 days, so that should be plenty. I think it must be kept in mind that u-boot is used on lots of "small" systems. Using up lots of resources for 64 bit arithmetic that we don't need elsewhere is , IMHO, not a good idea. If it really did something that we need, all well and good, I am for it. But in this case, I don't think it really helps us any. But it does make things harder for some CPUs, and that is not a good thing. > > My preference is a 64-bit nano-second timer with ud
Re: [U-Boot] [PATCH v2] ARM926ejs: Add routines to invalidate D-Cache
On 8/5/2011 4:51 AM, Aneesh V wrote: > Hi Albert, > > On Friday 05 August 2011 04:33 PM, Albert ARIBAUD wrote: >> Hi Aneesh, >> >> On 05/08/2011 12:47, Aneesh V wrote: >>> Hi Eric, >>> >>> On Friday 05 August 2011 04:03 PM, Hong Xu wrote: Hi Aneesh, >>> [snip ..] > IMHO, Hong's approach is correct. If the buffer that is invalidated is > not aligned to cache-line, one cache-line at the respective boundary > may have to be flushed to make sure the invalidation doesn't affect > somebody else's memory. > > The solution is for drivers to ensure that any buffer that needs to be > invalidated is aligned to cache-line boundary at both ends. The above > approach puts this onus on the driver. I have documented the alignment > requirement in my recent patch series for fixing arm cache problems. I have not noticed the patch series. ;-) If we put the alignment burden to the driver, I'm afraid many drivers which make use of something like a DMA controller have to modify the code heavily. This sounds not good. :) >>> We have a fundamental problem when it comes to invalidating an >>> un-aligned buffer. Either you flush the boundary lines and corrupt your >>> buffer at boundaries OR you invalidate without flushing and corrupt >>> memory around your buffer. Both are not good! The only real solution is >>> to have aligned buffers, if you want to have D-cache enabled and do DMA >>> at the same time. >> Plus, there should not be *heavy* modifications; DMA engines tend to use >> essentially two types of memory-resident objects: data buffers and >> buffer descriptors. There's only a small handful of places in the driver >> code to look at to find where these objects are allocated and how. >> >> So I stand by my opinion: since the cache invalidation routine should >> only be called with cache-aligned objects, there is no requirement to >> flush the first (resp. last) cache line in case of unaligned start >> (resp.stop), and I don't want cache operations performed when they are >> not required. > IMHO, flushing is better, because the person who commits the > mistake of invalidating the un-aligned buffer is the one who is > affected and is likely to fix the issue soon. If we didn't flush, the > resulting corruption will cause totally random errors that will be hard > to debug. Doing an extra flush operation for a maximum of 2 lines > doesn't cost us anything. This is the approach followed by the kernel > too. Hi All, I am interested in this last statement in particular, that Linux allows non-cache aligned buffers for DMA. In a previous discussion series, we demonstrated why it was IMPOSSIBLE for a non-cache aligned DMA buffer to work in conjunction with a "write back" type cache. To be clear, by write-back, we mean a cache system that has a single dirty bit per cache line, and that if there are stores to any addresses in that line, the ENTIRE LINE will be written back into memory, not just the changed data. I also seem to recall that the ARM V7 caches were defined as write back, but I am not an ARM person so I don't know for sure what kind of cache we are talking about here. If it is write-back, there is only one possible solution that always works. Write-through will work with un-aligned buffers if the correct flushes and invalidates are present. In that case, buffer alignment is not so important. However, if the same driver is to be used in both cases, it must use cache-aligned buffers only. Best Regards, Bill Campbell > br, > Aneesh > ___ > U-Boot mailing list > U-Boot@lists.denx.de > http://lists.denx.de/mailman/listinfo/u-boot > > ___ U-Boot mailing list U-Boot@lists.denx.de http://lists.denx.de/mailman/listinfo/u-boot
Re: [U-Boot] i.MX51: FEC: Cache coherency problem?
On 7/23/2011 6:04 AM, Albert ARIBAUD wrote: > Le 21/07/2011 08:48, David Jander a écrit : > >>> However, it is still correct that copying from an non-cached area is >>> slower than from cached areas, because of burst reads vs. individual >>> reads. However, I doubt that the u-boot user can tell the >>> difference, as >>> the network latency will far exceed the difference in copy time. > > That's assuming cache is only for networking. There can be DMA engines > in a lot of other peripherals which do not have the same latency as > network (and then, even for networking, TFTP can be done from a very > nearby server, possibly even on the same Ethernet segment). Hi All, Yes, there are other uses of DMA. On a network, unless you have a Gigabit network, your memory access speed is at least an order of magnitude faster than the network, probably more. Plus, there is a latency due to sending the ack and request for the next record that undoubtedly swamps out any reduction in memory speed due to the single copy that takes place. In the case of other devices, like for example disks, the percentage effect is probably greater, but since these devices are so fast anyway that the human-perceived speed reduction is essentially nil. If we were talking about a CPU running Linux and doing all kinds of I/O all day long, the reduction in throughput performance might be 10% and that might matter. In a boot loader that does I/O mostly to read in a program to replace itself, I would argue that nobody will notice the difference between cached and un-cached buffers. Counter-examples welcome however. > >>> The >>> question is, which is easier to do, and that is probably a matter of >>> opinion. However, it is safe to say that so far a cached solution has >>> eluded us. That may be changing, but it would still be nice to know how >>> to allocate a section of un-cached RAM in the ARM processor, in so far >>> as the question has a single answer! That would allow easy portability >>> of drivers that do not know about caches, of which there seems to be >>> many. > > That is one approach, which I think prevents cache from being used > beyond caching pure CPU-used DRAM. You are certainly correct there. However, I think the pure CPU-used ram case is the one that matters most. Uncompressing and checksumming of input data are typical u-boot functions that take significant time. The performance increase due to cache hits in these cases is huge, and easily perceptible by the user. > >> I agree. Unfortunately, my time is up for now, and I can't go on with >> trying >> to fix this driver. Maybe I'll pick up after my vacation. >> As for now I settled for the ugly solution of keeping dcache disabled >> while >> ethernet is being used :-( > > Make sure you flush before disabling. :) > >> IMHO, doing cache maintenance all over the driver is not an easy or nice >> solution. Implementing a non-cached memory pool in the MMU and a >> corresponding >> dma_malloc() sounds like much more universally applicable to any driver. > > I think cache maintenance is feasible if one makes sure the cached > areas used by the driver are properly aligned, which simplifies things > a lot: you don't have to care for flush-invalidate or just-in-time > invalidate, you just have to flush before sending and invalidate > before reading. I do agree it can be done. However, most (I think?) of the CPUs to which u-boot have been ported have cache-coherent DMA. As a result, cache issues for these CPUs are not addressed in the driver at all. Often this means that cache support is done after the fact by somebody other than the original author who may not totally understand the original driver. If DMA buffers were always allocated from cache-coherent memory, either because the memory is un-cached or because the CPU is DMA cache coherent, no changes would be necessary to get the driver working correctly. If performance ever became an issue in the un-cached case, then more work would be required, but in most cases, I expect nobody will notice. Best Regards, Bill Campbell > >> Best regards, > > Amicalement, ___ U-Boot mailing list U-Boot@lists.denx.de http://lists.denx.de/mailman/listinfo/u-boot
Re: [U-Boot] i.MX51: FEC: Cache coherency problem?
On 7/20/2011 7:35 AM, Albert ARIBAUD wrote: > Le 20/07/2011 16:01, J. William Campbell a écrit : >> On 7/20/2011 6:02 AM, Albert ARIBAUD wrote: >>> Le 19/07/2011 22:11, J. William Campbell a écrit : >>> >>>> If this is true, then it means that the cache is of type write-back >>>> (as >>>> opposed to write-thru). From a (very brief) look at the arm7 >>>> manuals, it >>>> appears that both types of cache may be present in the cpu. Do you >>>> know >>>> how this operates? >>> Usually, copyback (rather than writeback) and writethough are modes of >>> operation, not cache types. >> Hi Albert, >> One some CPUs both cache modes are available. On many other CPUs (I >> would guess most), you have one fixed mode available, but not both. I >> have always seen the two modes described as write-back and >> write-through, but I am sure we are talking about the same things. > > We are. Copy-back is another name for write-back, not used by ARM but > by some others. > >> The >> examples that have both modes that I am familiar with have the mode as a >> "global" setting. It is not controlled by bits in the TLB or anything >> like that. How does it work on ARM? Is it fixed, globally, globally >> controlled, or controlled by memory management? > > Well, it's a bit complicated, because it depends on the architecture > version *and* implementation -- ARM themselves do not mandate things, > and it is up to the SoC designer to specify what cache they want and > what mode it supports, both at L1 and L2, in their specific instance > of ARM cores. And yes, you can have memory areas that are write-back > and others that are write-through in the same system. > >> If it is controlled by memory management, it looks to me like lots of >> problems could be avoided by operating with input type buffers set as >> write-through. One probably isn't going to be writing to input buffers >> much under program control anyway, so the performance loss should be >> minimal. This gets rid of the alignment restrictions on these buffers >> but not the invalidate/flush requirements. > > There's not much you can do about alignment issues except align to > cache line boundaries. > >> However, if memory management >> is required to set the cache mode, it might be best to operate with the >> buffers and descriptors un-cached. That gets rid of the flush/invalidate >> requirement at the expense of slowing down copying from read buffers. > > That makes 'best' a subjective choice, doesn't it? :) Hi All, Yes,it probably depends on the usage. > >> Probably a reasonable price to pay for the associated simplicity. > > Others would say that spending some time setting up alignments and > flushes and invalidates is a reasonable price to pay for increased > performance... That's an open debate where no solution is The Right > One(tm). > > For instance, consider the TFTP image reading. People would like the > image to end up in cached memory because we'll do some checksumming on > it before we give it control, and having it cached makes this step > quite faster; but we'll lose that if we put it in non-cached memory > because it comes through the Ethernet controller's DMA; and it would > be worse to receive packets in non-cached memory only to move their > contents into cached memory later on. > > I think properly aligning descriptors and buffers is enough to avoid > the mixed flush/invalidate line issue, and wisely putting instruction > barriers should be enough to get the added performance of cache > without too much of the hassle of memory management. I am pretty sure that all the drivers read the input data into intermediate buffers in all cases. There is no practical way to be sure the next packet received is the "right one" for the tftp. Plus there are headers involved, and furthermore there is no way to ensure that a tftp destination is located on a sector boundary. In short, you are going to copy from an input buffer to a destination. However, it is still correct that copying from an non-cached area is slower than from cached areas, because of burst reads vs. individual reads. However, I doubt that the u-boot user can tell the difference, as the network latency will far exceed the difference in copy time. The question is, which is easier to do, and that is probably a matter of opinion. However, it is safe to say that so far a cached solution has eluded us. That may be changing, but it would still be nice to know how to allocate a section of un-cached RAM in the ARM processor, in so far as the question has a single answer! That would allow easy portability of drivers that do not know about caches, of which there seems to be many. Best Regards, Bill Campbell > >> Best Regards, >> Bill Campbell > > Amicalement, ___ U-Boot mailing list U-Boot@lists.denx.de http://lists.denx.de/mailman/listinfo/u-boot
Re: [U-Boot] i.MX51: FEC: Cache coherency problem?
On 7/19/2011 11:14 AM, Anton Staaf wrote: > On Tue, Jul 19, 2011 at 7:36 AM, J. William Campbell > wrote: >> On 7/19/2011 2:05 AM, Albert ARIBAUD wrote: >>> Le 19/07/2011 10:43, Aneesh V a écrit : >>> >>>>>> You would have to flush (before sending packets / starting external >>>>>> memory-to-device DMA) and invalidate (before reading received packets / >>>>>> after external device-to-memory DMA is done); using MMU and mapping >>>>>> cached/non-cached areas is IMO overkill, and will hurt CPU accesses to >>>>>> the xmit/receive buffers and descriptors. >>>>> So, you say actually what I did while exploring the problem would have >>>>> been a >>>>> correct way of solving this problem? >>>>> >>>>> Like this: >>>>> >>>>> 587 flush_cache(&fec->tbd_base[fec->tbd_index], 4); >>>> This is what is needed assuming the below is initiating a memory to >>>> peripheral DMA. Is your buffer only 4 bytes long? >>> Generally: >>> >>> - for sending data through a device that has its own, external, DMA >>> engine, you'll obviously need to flush the data buffer(s) but also any >>> DMA descriptors used by the engine, before you start the engine; >>> >>> - for rceiving, if you have to set up receive descriptors, you must >>> flush that before telling the device to enter receive mode (so that the >>> device reads the descriptors as you wrote them), and you should >>> invalidate the receive buffers at the latest when the device signals >>> that data has been received, >> Hi All, >> >>>or preferably long before (at the same time >>> you flushed the read descriptor, so that cache-related actions are >>> grouped in the same place in the code). >> I think this last statement is incorrect. You should invalidate the >> cache for the receive buffers just before you intend to reference them. >> If you do it right after receive mode is entered, subsequent access to >> items NEAR the receive buffer may reload the cache with receive buffer >> data before the dma is done, re-creating the problem you are trying to >> avoid. Also, I don't know if ARM cache is write-back or write -thru, but >> if it is write-back, the only way to avoid problems is to allocate the >> receive buffers on cache line boundaries, so no "nearby" writes can >> cause something in the DMA buffer to be corrupted. If all receive >> buffers are allocated on cache-line boundaries (both start and end of >> each buffer), you can invalidate the cache "early" under the assumption >> that there will be no read accesses to the read buffers themselves until >> after DMA is complete. IMHO it is better, even in this case., to >> invalidate cache after dma is done but before referencing the read data. > This is a critical observation, and one I was going to make if I had > made it to the end of the thread and no one had already pointed it > out. In fact, there is no way with the current implementation (that I > can see) of the v7_dcache_inval_range function to correctly implement > a cache enabled DMA driven driver without aligning buffers to cache > line sizes. Below is the commit message from a recent patch I made > locally to fix the Tegra MMC driver. I wanted to start a discussion > on the list about forcing all buffers to be aligned to cache line > sizes. The problem is that many buffers are stack allocated or part > of structs that are of unknown alignment. > > mmc: Tegra2: Enable dcache support by bouncing unaligned requests. > > Dcache support was disabled due to dcache/dma interactions with > unaligned buffers. When an unaligned buffer is used for DMA the > first and last few bytes of the buffer will be clobbered by the > dcache invalidate call that is required to make the contents of > the buffer visible to the CPU post DMA. The reason that these > bytes are clobbered is that the dcache invalidate code (which is > the same as the linux kernels) checks for unaligned invalidates > and first flushes the cache lines that are being partially > invalidated. This is required because the cache lines may be > shared with other variables, and may be dirty. Hi Anton, >This flush however > writes over the first and last few bytes of the unaligned buffer > with whatever happened to be in the cache. If this is true, then it means that the cache is of type write-back (as opposed to write-thru). From a (very brief) look at the arm
Re: [U-Boot] i.MX51: FEC: Cache coherency problem?
On 7/19/2011 2:05 AM, Albert ARIBAUD wrote: > Le 19/07/2011 10:43, Aneesh V a écrit : > You would have to flush (before sending packets / starting external memory-to-device DMA) and invalidate (before reading received packets / after external device-to-memory DMA is done); using MMU and mapping cached/non-cached areas is IMO overkill, and will hurt CPU accesses to the xmit/receive buffers and descriptors. >>> So, you say actually what I did while exploring the problem would have >>> been a >>> correct way of solving this problem? >>> >>> Like this: >>> >>> 587 flush_cache(&fec->tbd_base[fec->tbd_index], 4); >> This is what is needed assuming the below is initiating a memory to >> peripheral DMA. Is your buffer only 4 bytes long? > Generally: > > - for sending data through a device that has its own, external, DMA > engine, you'll obviously need to flush the data buffer(s) but also any > DMA descriptors used by the engine, before you start the engine; > > - for rceiving, if you have to set up receive descriptors, you must > flush that before telling the device to enter receive mode (so that the > device reads the descriptors as you wrote them), and you should > invalidate the receive buffers at the latest when the device signals > that data has been received, Hi All, > or preferably long before (at the same time > you flushed the read descriptor, so that cache-related actions are > grouped in the same place in the code). I think this last statement is incorrect. You should invalidate the cache for the receive buffers just before you intend to reference them. If you do it right after receive mode is entered, subsequent access to items NEAR the receive buffer may reload the cache with receive buffer data before the dma is done, re-creating the problem you are trying to avoid. Also, I don't know if ARM cache is write-back or write -thru, but if it is write-back, the only way to avoid problems is to allocate the receive buffers on cache line boundaries, so no "nearby" writes can cause something in the DMA buffer to be corrupted. If all receive buffers are allocated on cache-line boundaries (both start and end of each buffer), you can invalidate the cache "early" under the assumption that there will be no read accesses to the read buffers themselves until after DMA is complete. IMHO it is better, even in this case., to invalidate cache after dma is done but before referencing the read data. Best Regards, Bill Campbell > > Amicalement, ___ U-Boot mailing list U-Boot@lists.denx.de http://lists.denx.de/mailman/listinfo/u-boot
Re: [U-Boot] [PATCH v1 (WIP) 00/16] [Timer]API Rewrite
On 7/15/2011 11:34 AM, Wolfgang Denk wrote: > Dear "J. William Campbell", > > In message<4e208227.6010...@comcast.net> you wrote: >> If the I2C protocol must be available before interrupts are >> available, then udelay must be used. In the above examples, there are >> some loops in i2c and spi that appear to be waiting a full second. I >> assume they are using udelay because the get_timer feature is not yet >> available to them. I also assume that the example in common/usb.c uses > No, this is usually not the case. This long delay is the error case, > which most probably will never happen. For the normal case, you want > a tight spinning loop that introduces as little additional delay as > possible. Hi All, I understand why one would want to use udelay in those cases where there are other conditions that should cause the loop to exit. However, in the 16 examples you cited, 7 of those usages are for delays of multiple milliseconds with no tests or escape conditions included. I therefore pointed out that I assume that it was being done that way (as opposed to using get_timer) because this code would be executed "early", before get_timer was available. >> True, although I expect you will find the statement "on all the other >> architectures" to be false. Many other architectures, yes, all, no. >> These other architectures just don't have spi or I2C yet, or if they do, >> they don't use it "early". > Are you aware of another architecture that cannot provide > sub-microsecond timer resolution? Which is it? No, but I am aware of some that do not provide udelay at all until after interrupts are enabled and the get_timer interface is available. Several ARM implementations have this property. There are one or two others (SPARC) that are similar as I recall. There were also some that called reset_timer, which would mess up nested timing loops, but this has already been fixed. There also quite a few udelay implementations that do not actually provide 1 µs resolution. For instance, the first thing that happens in arm720t/interrupts.c __udelay is that the input microseconds is divided by 1000 and the remainder discarded. A udelay(10) won't work so well here. The lh7a40x/timer.c uses a 508 kHz clock, so delays are accurate to at best 2 µs. There are other examples. For instance, I am pretty sure some u-boot based CPUs use a 32kHz clock as the timebase for udelay. The resolution in that case is 30 µs, which would be a problem for your example code. I am sure further digging will reveal other resolution issues. These udelay issues would need fixing if all u-boot options are to work "out of the box". In some cases of low resolution, the fix is not easy, or maybe not even possible. However, this problem has not prevented the use of u-boot on these CPUs. It does mean that some things do not work "out of the box", and these non-compliant udelays should be fixed wherever practical. In any case, this shouldn't stop progress on get_timer and the delay_at_least (or whatever it ends up being called) function being adopted. That is independent of the udelay issues. udelay must be implemented in a "compliant" manner, which may or may not use the same hardware as get_timer() does. It certainly can't use the same code (although some ARM version in effect do, resulting in a non-compliant, low resolution udelay). It might be interesting to utilize something like "jiffies" for the loop delays that are looking for error exits. In those cases, the error case should be infrequent. It may not matter that the total delay is quite a bit longer than the user requested, as long as it isn't vastly shorter. If a 10 µsec delay became a 30 µsec delay inside the loop examples where we expect to exit due to another cause, it probably wouldn't matter much. It would matter a lot though if the delay were setting a baud rate on a "bit-banged" output, so jiffies may not be worth it. I put it out there for consideration, as it is easier to produce than a udelay that has a guaranteed low absolute error. Best Regards, Bill Campbell > > Best regards, > > Wolfgang Denk > ___ U-Boot mailing list U-Boot@lists.denx.de http://lists.denx.de/mailman/listinfo/u-boot
Re: [U-Boot] [PATCH v1 (WIP) 00/16] [Timer]API Rewrite
On 7/15/2011 12:17 AM, Wolfgang Denk wrote: > Dear "J. William Campbell", > > In message<4e1f8127.8030...@comcast.net> you wrote: >> I am pretty sure that is long enough for someone to notice. . I would be >> interested in seeing an example of such code as you refer to. Could you >> point me to one, because it seems to me that the only reason to code >> such a delay is that for some reason the user didn't want to keep >> looking at the termination condition so often. I think that that > arch/powerpc/cpu/mpc512x/i2c.c: > > 80 while (timeout--&& (status& I2C_BB)) { > ... > 88 udelay (1000); > 89 status = mpc_reg_in (®s->msr); > 90 } > > 103 while (timeout--&& !(*status& I2C_IF)) { > 104 udelay (1000); > 105 *status = mpc_reg_in (®s->msr); > 106 } > arch/powerpc/cpu/mpc512x/pci.c: > > 87 /* We need to wait at least a 1sec based on PCI specs */ > 88 for (i = 0; i< 1000; i++) > 89 udelay(1000); > > arch/powerpc/cpu/mpc5xx/spi.c: > > 258 for (i = 0; i< 1000; i++) { > ... > 265 udelay(1000); > 266 } > > 390 for (tm=0; tm<1000; ++tm) { > ... > 394 udelay (1000); > 395 } > > arch/powerpc/cpu/mpc5xxx/i2c.c: > > 102 while (timeout--&& (status& I2C_BB)) { > ... > 111 udelay(15); > 112 status = mpc_reg_in(®s->msr); > 113 } > > 125 while (timeout--&& !(*status& I2C_IF)) { > 126 udelay(15); > 127 *status = mpc_reg_in(®s->msr); > 128 } > > And just that you don't think this is in a single CPU only: > > arch/powerpc/cpu/mpc8260/pci.c: > > 343 for (i = 0; i< 1000; ++i) > ... > 345 udelay (1000); > > Or in board code: > > board/altera/common/cfide.c: > > 29 /* wait 500 ms for power to stabilize */ > 30 for (i = 0; i< 500; i++) > 31 udelay(1000); > > board/amcc/bamboo/bamboo.c: > > 1019 for (i=0; i<500; i++) > 1020 udelay(1000); > > or common code: > > common/cmd_fdc.c: > > 213 while((read_fdc_reg(FDC_SRA)&0x80)==0) { > 214 timeout--; > 215 udelay(10); > 216 if(timeout==0) /* timeout occured */ > 217 return FALSE; > 218 } > > 228 while((read_fdc_reg(FDC_MSR)&0xC0)!=0xC0) { > 229 /* direction out and ready */ > 230 udelay(10); > 231 timeout--; > 232 if(timeout==0) /* timeout occured */ > 233 return -1; > 234 } > etc.etc. > > 375 for(val=0;val<255;val++) > 376 udelay(500); /* wait some time to start motor */ > > 418 for(i=0;i<100;i++) > 419 udelay(500); /* wait 500usec for fifo overrun */ > > 600 for(i=0; i<255; i++) /* then we wait some time */ > 601 udelay(500); > > > common/usb.c: > >93 inline void wait_ms(unsigned long ms) >94 { >95 while (ms--> 0) >96 udelay(1000); >97 } > > etc. etc. Note this last example which "spreads" the effect in a not > so nice way. > > Note that there are even places where udelay() is part of the protocol > timing implementation - like in the soft-I2C and soft-SPI drivers. > Hi All, Thanks for the pointers Wolfgang. From your examples, it seems we are talking exclusively about udelay. As I said in my previous post, on NIOS2, udelay needs to be mechanized on NIOS2 by a timed loop, not by using the same mechanism as the timer, as there is no way one can approximate microsecond resolution by the hardware timer. It is my opinion that there are many CPUs on which the udelay implementation must not be/should not be based on the same timer code as the get_time() operation, because in many cases the get_time() operation requires interrupts to be functional. As you have also pointed out, udelay needs to be available "early", before interrupts are available. Therefore, the udelay and get_timer must use the hardware at least somewhat differently. The must also not interfere with each other, which on some existing boards they did. If the I2C protocol must be available before interrupts are available, then udelay must be used. In the above examples, t
Re: [U-Boot] [PATCH v1 (WIP) 00/16] [Timer]API Rewrite
On 7/14/2011 12:41 PM, Wolfgang Denk wrote: > Dear "J. William Campbell", > > In message<4e1cf2e0.1030...@comcast.net> you wrote: >> Yes, this is true. However, the time_elapsed_since routine can do >> this dynamically (i.e. add twice the timer resolution) . I think you had > That would IMHO be a very bad idea. We have a number of places where > we have to deal with pretty long timeouts (usually because of protocol > specifications that require this - often in the order of several > seconds), where the normal path is very fast. The typical approach is > to break the timeout into a large number of very short loops. > Sometimes we use udelay() for this, other places use get_timer(). > > So assume we have a timeout of 5 seconds, and implement this as 50,000 > loops of 100 microseconds. If you silently turn each of these into 20 > milliseconds on NIOS, the timeout would become 1,000 seconds instead > of 5 - users would return boards as broken and report "it just > freezes" because nobody expects that it will wake up again after some > 16 minutes. Hi All, If such a condition existed, that is indeed what would happen. However, at present, such code is not being used on NIOS. We know this because the current "work-around" of resetting the timer at the start of any timeout operation extends the timeout to a minimum of 10 milliseconds. So we would be waiting 8 minutes currently, not 16, and I am pretty sure that is long enough for someone to notice. . I would be interested in seeing an example of such code as you refer to. Could you point me to one, because it seems to me that the only reason to code such a delay is that for some reason the user didn't want to keep looking at the termination condition so often. I think that that equivalent operation can be produced by a pretty simple re-coding of the loop. In any case, NIOS does not have the problem at present, so the suggested new work-around would be no worse than the present situation. It is also true that the hardware timer cannot be used in a reasonable version of udelay, as most of the desired delays may be very short relative to the timebase. A calibrated timing loop would be the best approach to the udelay problem. >> another function name (at_least) involved, but you can define >> time_elapsed_since as always compensating for the resolution. That will >> fix any resolution questions in a processor-specific way. It is either >> that or the ifdefs. One way or another, the resolution must be >> addressed. Up to now, the implicit resolution has been 1 ms, but we now >> know that is not general enough. > It's not as simple as this. You have to change a lot of code to make > this work for such slow clock systems. In general, that is true. There may be a few cases where a delay of less than the resolution is essential to make something work. There are probably lots of other cases where we can easily remove the restriction on the resolution. We cannot fix the first kind, no matter what we do, to work on a lower resolution timer. The second kind we can and probably should fix, because they are coded in an overly-restrictive manner. In any case, we don't absolutely have to fix them until somebody decides to use the code on a CPU with a low resolution timer. Practically speaking, the suggested solution will therefore work on all extant cases. Best Regards, Bill Campbell > > Best regards, > > Wolfgang Denk > ___ U-Boot mailing list U-Boot@lists.denx.de http://lists.denx.de/mailman/listinfo/u-boot
Re: [U-Boot] [PATCH v1 (WIP) 00/16] [Timer]API Rewrite
On 7/12/2011 5:33 PM, Graeme Russ wrote: > Hi Reinhard, > > On 13/07/11 02:08, Reinhard Meyer wrote: >> Dear J. William Campbell, All > [snip] > >> Lets just keep the current functions udelay(us) and u32 get_timer(), the >> latter maybe without parameter. Remove all *masked() and *reset() functions > This is happening and has Wolfgang's 100% support. Patches are done, just > need to be rebased and resent > >> and lets change the timeout values used in code that also NIOS uses to >> be 20ms more than the hardware requires. It does not really matter if a >> *timeout* of 20ms is blown up to 40ms. > Can't do that - That would mean winding out common code (CFI, Network, > Serial, FPGA etc, etc) timeouts to global worst-case scenario unless you > want to add a #define and start littering the code with macro's - Urgh! Hi All, Yes, this is true. However, the time_elapsed_since routine can do this dynamically (i.e. add twice the timer resolution) . I think you had another function name (at_least) involved, but you can define time_elapsed_since as always compensating for the resolution. That will fix any resolution questions in a processor-specific way. It is either that or the ifdefs. One way or another, the resolution must be addressed. Up to now, the implicit resolution has been 1 ms, but we now know that is not general enough. Best Regards, Bill Campbell > > Regards, > > Graeme > ___ > U-Boot mailing list > U-Boot@lists.denx.de > http://lists.denx.de/mailman/listinfo/u-boot > > ___ U-Boot mailing list U-Boot@lists.denx.de http://lists.denx.de/mailman/listinfo/u-boot
Re: [U-Boot] [PATCH v1 (WIP) 00/16] [Timer]API Rewrite
On 7/12/2011 8:23 AM, Scott McNutt wrote: > Dear Wolfgang > > Wolfgang Denk wrote: > >> What exactly is the reason that we cannot have better timer >> resolutions in NIOS? > You _can_ have better timer resolutions in Nios. However, there > are legacy systems that implement timer(s) with a fixed period > of 10 msec. The use of such implementations is very common. > How do I know? Because my customers have such implementations. > Why not upgrade? Because most sane engineers are loathe to > change their rtl just so they can fix a simple software bug > or add a simple custom feature. > > My obvious preference is to continue to use the latest u-boot in > such systems ... without having to create a custom branch for > customers with older Nios implementations, where I use the old > timer interfaces ... simply because we "improved" the API. > >> Scott, maybe you can comment here? > I have only one comment: Out of pure frustration, I have been > avoiding any discussions regarding this timer re-write effort. > >> At the moment I'm trying to understand if we really have a problem on >> NIOS2 that cannot be fixed in a way that is compatible with our >> current plans. > This is where my frustration begins. I've said this several times > before, and I don't know how else to say this ... but I'll give it > one more try: > > This is not a Nios problem. > > If I can't use a 10 msec timer interrupt to detect a simple timeout > condition when I'm programming a flash memory, then the timer API > design is garbage. Hi All, I agree with the idea that mandating a particular timer resolution is an un-desired limitation. You may need it for some cases, but not for all. Proof of this concept is the number of legacy UNIX systems running on 10 ms timer ticks, or for that matter the number of systems running on line cycle interrupts (16.666 ms in the US). It is also however true that one cannot ignore the timer resolution. That is what produced the problem in the first place. Asking for a delay that is less than the resolution isn't going to work as expected. This problem can be easily fixed by always adding twice the timer resolution to any requested delay, to ensure you waited at least two ticks , or by having a routine that does this for you (like the suggested API does). Right now, there are many places in the u-boot code that wait a certain small number of milliseconds under the assumption that the timer resolution is either 1) not an issue at all, or 2) is 1 ms. Both cases are wrong, the first case more so than the second. I wonder how many delays in the program that say "wait 5 ms" are aware that you could exit the loop after 4ms plus epsilon every so often, even with 1 ms resolution? Perhaps the tolerance is built into the number 5, perhaps not. However, it is surely not if the timer resolution is 16.666 ms. > And quite frankly, the enormous amount of > discussion in this matter reminds me of an undergraduate science > project. Or any government project. But when you will effect lots of people, you often must kill a few trees to make sure nobody (such as yourself and other NIOS users) somehow gets a bad deal. I am now returning to "real work" and will be offline for a while. Good Luck All. Best Regards, Bill Campbell >> We also have to deal with decrementing counters, but this is just aan >> unimportant detail. And it appears that we actually can have this, >> even on NIOS. > Wow! Even on Nios! Arrggghh! > >> The only architecture I remember that seemed prolematic was NIOS > Really? There have been occasional issues that have required a patch > here and there. But I consider most of them far from "problematic." > > This is exhausting. As I recall, this entire fuss was born out > of an issue with nested calls to get_timer() ... and that begat > removing reset_timer ... and that begat ... and so on. And we're > still spilling lots of intellectual seed here. And now we have > what? A jack-of-all-trades API that can do everything with > exacting precision ... other than handling a 10 msec interrupt? > > Best Regards, > --Scott > > ___ > U-Boot mailing list > U-Boot@lists.denx.de > http://lists.denx.de/mailman/listinfo/u-boot > > ___ U-Boot mailing list U-Boot@lists.denx.de http://lists.denx.de/mailman/listinfo/u-boot
Re: [U-Boot] [PATCH v1 (WIP) 00/16] [Timer]API Rewrite
On 7/12/2011 6:10 AM, Wolfgang Denk wrote: > Dear Graeme Russ, > > Do we? What exactly is the needed resolution of the underlying > hardware timer? So far, it appears sufficient to have it ticking with > 1000 Hz or more. Are there really systems that cannot provide that? > The only architecture I remember that seemed prolematic was NIOS - but > then the NIOS documentation suggests that this might actually be Hi All, The NIOS2 "systems" are a FPGA based architecture ("soft" core). They CAN have all different kinds of timers. However, there are many already out in the world that were built (generated) with a timer that has 10 ms resolution. We can't retro-actively change these. We can declare them not supported in future releases, but we can't fix the resolution problem retroactively. I have two comments regarding this discussion so far. First, I think using the "time" function name at all is a VERY BAD idea. People will confuse it with the "normal" c library function that returns the time of day since the epoch. One may say that they should RTFM, but it is lots easier to just use another name and avoid all such confusion. Second, I think forcing all systems to use 64 bit time stamps is not a good idea. The main purpose of the timer API is to provide hardware timeouts. 64 bits is vast overkill for such a purpose. Mandating 64 bit capability for all u-boots is I think counter-productive. A 32 bit timestamp in millisecond resolution plus udelay covers all practical cases. The 1 ms resolution is probably inadequate for performance measurement, but for every u-boot system that is sometimes used for performance measurement, there are 100 (probably even more, like 1000, but I don't really have the figures) systems that are just used to boot Linux or another stand alone program. So we may need a different API for performance measurement that uses higher resolution. On most "larger" CPUs, there are hardware timers that can already do this with minimal work. On some less-capable CPUs, it may be lots harder, but if you aren't doing performance measurement, you don't need it. So I suggest that mandating a 64 bit API for some of the processors (68K comes to mind) is not really a good idea. It seems counter to the elegant simplicity of the rest of u-boot to stick a 64 bit requirement on timestamps used to determine if the boot delay is expired. Best Regards, J. William Campbell > Best regards, > > > Wolfgang Denk > ___ U-Boot mailing list U-Boot@lists.denx.de http://lists.denx.de/mailman/listinfo/u-boot
Re: [U-Boot] [RFC][Timer API] Revised Specification - Implementation details
On 5/27/2011 10:53 PM, Graeme Russ wrote: > Hi Bill, > > On 28/05/11 00:23, J. William Campbell wrote: >> On 5/27/2011 12:35 AM, Graeme Russ wrote: >>> Hi Wolfgang, >>> >>> On 27/05/11 17:13, Wolfgang Denk wrote: >>>> Dear Graeme Russ, >>>> >>>> In message you wrote: >>>>> I think we will need to define get_timer() weak - Nios will have to >>>>> override the default implementation to cater for it's (Nios') limitations >>>> Please don't - isn't the purpose of this whole discussion to use >>>> common code for this ? >>>> >>> Yes, but Nios is particularly bad - It has a 10ms tick counter :( >> Hi All, >> And a hardware timer that you can't read to subdivide the 10 ms. Note >> that this is not necessarily a problem with all NIOS implementations. The >> timer characteristics can be controlled when you generate the bitstream for >> the FPGA. You can make the counter both faster and readable if you want. It >> just uses a bit more silicon. Sad to say, it probably will require per >> board get_ticks routine. For the "old" nios2 timers however, overriding >> get_timer with a /board routine is probably the only way to go. >> > Actually, using the logic you presented in the "Remove calls to > [get,reset]_timer outside arch/" thread, we could have a common get_timer() > implementation which deals with all timer resolution issues: > > #if !defined(CONFIG_MIN_TIMER_RESOLUTION) > #define CONFIG_MIN_TIMER_RESOLUTION 1 > #endif > > u32 get_timer(u32 base) > { > if (base != 0) { > if (timer - base< (CONFIG_MIN_TIMER_RESOLUTION * 2)) > return 0; > else > return timer - base - CONFIG_MIN_TIMER_RESOLUTION; > } else { > return timer; > } > } > > This would also guarantee that timeout loops run for _at least_ the timeout > period > Hi Graeme, You are almost correct. The problem is that if you are calling get_timer early on in start-up, the timer can actually be 0. The loop initializing get_timer(0) will then return 0. If you can initialize the timer to a number !=0, then the above code works. If the timer is totally hardware derived, such an initialization may not be easy. For nios2, it is easy since the counter is software interrupt driven. It can just be initialized to 1. The 10 ms updates will then run like normal. It also won't work if the user is doing his own timer arithmetic by subtracting get_timer(0) values, but there is nothing we can do about that case anyway so that case is moot. Good job of spotting this approach. Best Regards, Bill Campbell > Regards, > > Graeme > > ___ U-Boot mailing list U-Boot@lists.denx.de http://lists.denx.de/mailman/listinfo/u-boot
Re: [U-Boot] [RFC][Timer API] Revised Specification - Implementation details
On 5/27/2011 8:13 AM, Simon Glass wrote: > On Fri, May 27, 2011 at 8:00 AM, J. William Campbell > wrote: > [snip] >> Hi All, >> A more precise statement of the problem is that all timer delays >> may be shortened by the timer resolution. So this means that if you have >> a timeout of 1 ms in your get_time(0) { } while ( ...< 1), then your >> actual delay may be anywhere between 0 and 1 ms. The problem arises when >> some piece of common code uses a delay of say 8 millisec, expecting the >> actual delay to be between 7 and 8. If the resolution is 10 ms, the >> delay will be between 0 and 10 ms, 0 being particularly bad. This can be >> fixed in get_timer, making the 8 ms delay become a minimum of 10 ms at >> the expense of it becoming up to 20 ms sometimes. Since these delays are >> used mostly for error conditions, making them longer will probably be >> ok, and doesn't require changing any of the common code. It probably >> will not make things slower either, because the error timeouts should >> not be reached. The reset of the hardware timer would cause all "short" >> delays to become 10 ms. This reset approach is bad in that it prevents >> proper nesting of timing loops. However, in this case it isn't so bad, >> in that the nested loops are just extended, not shortened. Note that if >> the reset is only resetting the HARDWARE interrupt generator, not the >> actual timestamp itself, we are just extending all existing timeouts by >> 0 to 10 ms.. So this just lengthens all pending timeouts. The other fix >> is in my opinion nicer, because it affects the nest loops less. If the >> inner loop is executed 100 times, with the reset, the outer loop timeout >> is extended by up to 1000 ms. >> >> Best Regards, >> Bill Campbell > Hi Bill, > > Yes I agree that this is ugly - I didn't realize that this is what > reset_timer() does, but I think these 10ms platforms should have to > live with the fact that timeouts will be 0-10ms longer than hoped. > Perhaps reset_timer() should become a non-standard board thing that is > deprecated. Really if you have a 10ms timer and are asking for a 10ms > timeout you are being a bit hopeful. Hi All, Yes, but the person writing the driver was writing "common" code. He probably didn't even know there was a timer whose resolution was not 1 ms. > But perhaps this argues for a function to check timeouts - at the > moment get_timer() returns the time since an event and it is used at > the start of the loop and the end. Perhaps we should have: > > #define TIMEOUTMS 2000 > > stop_time = get_future_time(TIMEOUT_MS); // Returns current time + > TIMEOUT_MS + (resolution of timer) > while (get_timer(stop_time)< 0) // (I would much prefer while > (!timed_out(stop_time)) > wait for something > } > > Regards, > Simon In the existing system, you can get the same result by running the while loop with a condition of (get_timer(base) < TIMEOUTMS + TIMER_RESOLUTION). We could just make TIMER_RESOLUTION a mandatory define for all u-boots. Then common code would be wrong if the TIMER_RESOLUTION were omitted. For all I know, there may be such a define already. Anybody know of one? Best Regards, Bill Campbell ___ U-Boot mailing list U-Boot@lists.denx.de http://lists.denx.de/mailman/listinfo/u-boot
Re: [U-Boot] [RFC][Timer API] Revised Specification - Implementation details
On 5/27/2011 8:44 AM, Scott McNutt wrote: > J. William Campbell wrote: >> On 5/27/2011 6:07 AM, Scott McNutt wrote: >>> Graeme Russ wrote: >>>> Hi Wolfgang >>>> >>>> On Friday, May 27, 2011, Wolfgang Denk wrote: >>>>> Dear Graeme Russ, >>>>> >>>>> In message >>>>> you wrote: >>>>>> Besides, Nios can return an increment of 10 (presumably ms) between >>>>>> two immediately consecutive calls. This causes early timeouts in CFI >>>>>> driver >>>>> Now this in turn is a bug in the timer implementation that needs >>>>> to be >>>>> fixed. >>> >>> And this is what reset_timer() corrected. >>> >>>> Agreed, but that is not something I can achieve - I don't want to hold >>>> up this whole show that we have all put so much effort into for the >>>> sake of one weak function >>> >>> And I don't want to see something that currently works become broken >>> because we "improved" a feature ... simply because the resolution of >>> the timestamp is 10 msec rather than 1 msec. >>> >>> And just to be clear. This is not a Nios issue. Currently, if the >>> timestamp is incremented via a fixed period interrupt, and the period >>> of the interrupt is longer that 1 msec, calls to get_timer() may >>> produce early timeouts ... regardless of platform. > >> This can be fixed in get_timer, making the 8 ms delay become a >> minimum of 10 ms at the expense of it becoming up to 20 ms sometimes. > > Ok. Now I get it. Thanks. > > >> This reset approach is bad in that it prevents proper nesting of >> timing loops. > > In my particular case, because reset_timer() resets the timestamp > to zero rather than simply restarting the timer. I believe leaving > the timestamp alone would solve the nesting problem. > > >> The other fix is in my opinion nicer, because it affects the nest >> loops less. If the inner loop is executed 100 times, with the reset, >> the outer loop timeout is extended by up to 1000 ms. > > Bill, thank you for explaining -- probably for the nth time -- but it > did finally sink in. Hi Scott, Glad to help, I finally think I understand it myself in looking into it further! I think we have a good way ahead that should keep everything working. We will get you an alpha copy of whatever we do as soon as possible so you can verify we didn't break nios2! Best Regards, Bill Campbell > > Regards, > --Scott > > ___ U-Boot mailing list U-Boot@lists.denx.de http://lists.denx.de/mailman/listinfo/u-boot
Re: [U-Boot] [RFC][Timer API] Revised Specification - Implementation details
On 5/26/2011 11:54 PM, Graeme Russ wrote: > On Fri, May 27, 2011 at 4:33 PM, J. William Campbell > wrote: >> On 5/26/2011 9:33 PM, Graeme Russ wrote: >>> Hi Bill, >>> >> > [massive snip] > > OK, you have my ears pricked - Can you give me code samples for: > > - get_ticks() > - sync_timbase() (no need to implement the whole lot if that is too > much effort right now) > - timer_isr() > > that handle the following hardware tick counter scenarios: > > a) 64-bit up counter > b) 64-bit down counter Hi Graeme, > c) 32-bit up counter, wraps at 65000 Do you mean 32 bits or 16 bits? doesn't make much difference, but just checking. > d) 16-bit microsecond up counter 0-999 which wraps and triggers a 16-bit > millisecond up counter. Reading milliseconds latched microseconds and > clears milliseconds (look in arch/x86/cpu/sc520/timer.c for example) > e) 24-bit counter occupying bits 2-25 of a 32-bit word (just to be > difficult) > f) Any other option anyone cares to throw ;) > > All of these options must be covered using: > - Minimal global data (we would like it to work before relocation, but > not mandatory - GD footprint would be nice) > - All use the same sync_timebase function > - Demonstrate using an ISR NOT synced to the hardware tick counter and > an ISR that is > - Parameters to get_ticks() and sync_timer() are permitted, but not for > timer_isr() (naturally) > OK! Once again, I accept the challenge. One Caveat. My real work has been sliding due to the time I have been spending on this. I am flying from San Francisco to Sydney tonight , (to work, not to play), so I will be off-grid for 14 hours+. You will not get this code for a few days, like probably 3 days. I have looked at the requirements, and I see no real problems that I don't know how to solve. >> I don't' see any reason to push this down to a lower level. It is just one >> more thing to get messed up across implementations. > Agreed > >>>> detection in the non-interrupt case to sync_timebase as well. >>>> Sync_timebase >>>> can also invert the down-counting counters, removing that from get_ticks. >>>> The wrap detection code can be #ifdef out if one is using interrupts and >>> Urghh - Why are you adding unnecessary ugliness - #ifdef in the middle of >>> code is usually a sign you are doing something wrong >> As I said, this is an optional optimization. I do not agree that an #ifdef >> in the middle of code indicates you have a bad design. Lots and Lots of >> ifdefs certainly indicates a bad design. An ifdef to eliminate code if some >> option is not selected is hardly such a strange thing, especially only a >> single #ifdef. However, feel free to not have it if you so desire. > OK, I'll let this slide for the moment - please include in above example Will Do. >>>> offended by it's presence. Thanks for pointing this out and compelling me >>>> to >>>> reduce the number of cases! Making get_ticks more lightweight is a good >>>> idea >>>> in my opinion. >>>>> Lets say you have a platform with a 32-bit tick counter running at a >>>>> reasonably long rollover time so you decide not to use interrupts. Then >>>>> you create a new platform with the same tick counter, but it runs much >>>>> faster and you realise you need to implement the interrupt routine to >>>>> make get_timer() work for long enough periods - Fine, you add an ISR >>>>> for the new platform that calls sync_timebase - No other changes are >>>>> required. >>>>> >>>>> The last thing we want is for the 64-bit tick counter to be conceptually >>>>> different across platforms >>>>> >>>>> I just realised - the ISR _does not need to call the sync_timebase at >>>>> all_ >>>>> The ISR only needs to call get_ticks(), so it will be fast anyway >>>> The problem is that the way we previously detected wrapping does not work >>>> if >>>> the interrupt rate is == to the counter wrap time, which it essentially >>>> always is. If get_ticks is trying to update the wrap count when an >>>> interrupt >>> Is it, always, on every platform? >> Yes, pretty much. You get a terminal count/counter underflow interrupt and >> that is it. > Not on sc520 - The micro/millisecond counter cannot be used to driver an > interrupt - you need to setup a seperate timer. I think the x86 internal > performance counters are the same What is true, as
Re: [U-Boot] [RFC][Timer API] Revised Specification - Implementation details
On 5/27/2011 6:07 AM, Scott McNutt wrote: > Graeme Russ wrote: >> Hi Wolfgang >> >> On Friday, May 27, 2011, Wolfgang Denk wrote: >>> Dear Graeme Russ, >>> >>> In message you >>> wrote: Besides, Nios can return an increment of 10 (presumably ms) between two immediately consecutive calls. This causes early timeouts in CFI driver >>> Now this in turn is a bug in the timer implementation that needs to be >>> fixed. > > And this is what reset_timer() corrected. > >> Agreed, but that is not something I can achieve - I don't want to hold >> up this whole show that we have all put so much effort into for the >> sake of one weak function > > And I don't want to see something that currently works become broken > because we "improved" a feature ... simply because the resolution of > the timestamp is 10 msec rather than 1 msec. > > And just to be clear. This is not a Nios issue. Currently, if the > timestamp is incremented via a fixed period interrupt, and the period > of the interrupt is longer that 1 msec, calls to get_timer() may > produce early timeouts ... regardless of platform. Hi All, A more precise statement of the problem is that all timer delays may be shortened by the timer resolution. So this means that if you have a timeout of 1 ms in your get_time(0) { } while ( ... < 1), then your actual delay may be anywhere between 0 and 1 ms. The problem arises when some piece of common code uses a delay of say 8 millisec, expecting the actual delay to be between 7 and 8. If the resolution is 10 ms, the delay will be between 0 and 10 ms, 0 being particularly bad. This can be fixed in get_timer, making the 8 ms delay become a minimum of 10 ms at the expense of it becoming up to 20 ms sometimes. Since these delays are used mostly for error conditions, making them longer will probably be ok, and doesn't require changing any of the common code. It probably will not make things slower either, because the error timeouts should not be reached. The reset of the hardware timer would cause all "short" delays to become 10 ms. This reset approach is bad in that it prevents proper nesting of timing loops. However, in this case it isn't so bad, in that the nested loops are just extended, not shortened. Note that if the reset is only resetting the HARDWARE interrupt generator, not the actual timestamp itself, we are just extending all existing timeouts by 0 to 10 ms.. So this just lengthens all pending timeouts. The other fix is in my opinion nicer, because it affects the nest loops less. If the inner loop is executed 100 times, with the reset, the outer loop timeout is extended by up to 1000 ms. Best Regards, Bill Campbell > > --Scott > > > ___ U-Boot mailing list U-Boot@lists.denx.de http://lists.denx.de/mailman/listinfo/u-boot
Re: [U-Boot] [RFC][Timer API] Revised Specification - Implementation details
On 5/27/2011 12:35 AM, Graeme Russ wrote: > Hi Wolfgang, > > On 27/05/11 17:13, Wolfgang Denk wrote: >> Dear Graeme Russ, >> >> In message you wrote: >>> I think we will need to define get_timer() weak - Nios will have to >>> override the default implementation to cater for it's (Nios') limitations >> Please don't - isn't the purpose of this whole discussion to use >> common code for this ? >> > Yes, but Nios is particularly bad - It has a 10ms tick counter :( Hi All, And a hardware timer that you can't read to subdivide the 10 ms. Note that this is not necessarily a problem with all NIOS implementations. The timer characteristics can be controlled when you generate the bitstream for the FPGA. You can make the counter both faster and readable if you want. It just uses a bit more silicon. Sad to say, it probably will require per board get_ticks routine. For the "old" nios2 timers however, overriding get_timer with a /board routine is probably the only way to go. l > I don't see reason for hamstring other platforms when a simply weak > function can get around it Agree. Best Regards, Bill Campbel > Regards, > > Graeme > > ___ U-Boot mailing list U-Boot@lists.denx.de http://lists.denx.de/mailman/listinfo/u-boot
Re: [U-Boot] [RFC][Timer API] Revised Specification - Implementation details
On 5/27/2011 12:33 AM, Wolfgang Denk wrote: > Dear "J. William Campbell", > > In message<4ddf2072.5090...@comcast.net> you wrote: > ... >> The problem is that the way we previously detected wrapping does not >> work if the interrupt rate is == to the counter wrap time, which it >> essentially always is. If get_ticks is trying to update the wrap count > You ignore the fact that this is only ever a problem when the rollover > cannot signal through an interrupt or similar. Also, some processors > allow daisy-chaning of timers, etc. > > Again, I would really like to know about how many exotic systems we > are talking that fulfil your worst-case expectations. I bet the > overwhelming majority behaves absolutely harmless. Hi Wolfgang, I think that in fact the opposite is true. The problems occur if both the main program and the interrupt routine are trying to update the timer msb using the same code, as we were originally talking about. There is no problem if only the interrupt routine detects the rollover. That is the correct way to go if your interrupts work. There was nothing particularly exotic required. It was the "normal" case. Take a look at what would happen on the PPC is the main program was reading the decrementer, detecting wraps and increasing the timestamp while the interrupt routine was also incrementing the timestamp. Every so often you get a double increment. Why were we doing this? Because I was trying to re-use exactly the same code in the interrupt case and the non-interrupt case. Not a good idea, in fact a bad idea as it turns out. Best Regards, Bill Campbell > Best regards, > > Wolfgang Denk > ___ U-Boot mailing list U-Boot@lists.denx.de http://lists.denx.de/mailman/listinfo/u-boot
Re: [U-Boot] [RFC][Timer API] Revised Specification - Implementation details
On 5/27/2011 12:28 AM, Wolfgang Denk wrote: > Dear "J. William Campbell", > > In message<4ddefdbc.7050...@comcast.net> you wrote: >> I really STRONGLY disagree with this statement. If you actually needed >> 64 bit variables, fine use them. But as I have already shown, you do not >> need them in general. We are computing a 32 bit result. There is some >> entropy argument that says you shouldn't need 64 bits to do so. Another >> way to look at it is that converting the top 32 bit word and the bottom >> 32 bit word to ms separately can be easier than doing them both together >> at once. However, as we will see below, I do agree we need two 32 bit >> words to make this process go smoothly. I just don't agree that they >> should/will constitute a 64 bit binary word. See below. > And I disagree with this. > Hi Wolfgang, OK, I hear you. >> Yes, that is the problem. I have come to the view that two 32 bit words >> are the best approach. Note that the lsb may actually not fill the full >> 32 bits. The top 32 bits are the rollover count and the bottom 32 bits >> are the current counter. If the counter is a full 32 bits, so much the >> better. Again, one could put them together inside the interrupt routine >> , but it is easier to check for a changed value if you don't do this. > It's even easier if you use a single 64 bit variable, because then you > can simply use ==. > In general, no you can't, or at least you probably don't want to. . If you are reading a 64 bit performance counter, it is quite likely that you cannot read it twice without the "clock" having "ticked". If the CPU executes 1 instruction (or fewer(if an SPR/memory reference is involved?) per performance counter tick, which is the goal of the performance counter, == is an infinite loop A similar condition exists if you are combining a software counter with a fairly fast hardware counter. It might require flipping the hardware counter (if it is a down counter) and a 64 bit multiply add, which must be done in software/a subroutine if the cpu has no 64 by 64 multiply. By the time that is done, the timer LSB may have ticked. Consider the m68K case. >> Otherwise, you have to check both words. It also makes the isr faster. > Drop any thoughts about "make FOO faster" for now, please. Especially > at this stage it is much more important to have a simple and clean > design. If split in two variables, even a simple read access will > turn into code like > > do { > upper = timebase_upper; > lower = timebase_lower; > } while (upper != timebase_upper); > > This is not exactly as simple as you claimed. True, but if you look at a lot of 64 bit performance counters, that is EXACTLY what the handbook book recommends on how to read them. There is no atomic way to read them both at once, and reading one half doesn't freeze the other half. This code is also required if timebase_upper is altered in the interrupt routine. YMMV, but in a lot, dare I say most, cases this is required anyway. And while the code is more complex than a simple assignment statement, it is not very complex. Best Regards, Bill Campbell > Best regards, > > Wolfgang Denk > ___ U-Boot mailing list U-Boot@lists.denx.de http://lists.denx.de/mailman/listinfo/u-boot
Re: [U-Boot] [RFC][Timer API] Revised Specification - Implementation details
On 5/26/2011 9:33 PM, Graeme Russ wrote: > Hi Bill, > > get_ticks() does not care about the clock rate - It simply looks at the > current value of the hardware tick counter and the value of the hardware > tick counter the last time get_ticks() was called, calculates the difference > and adds that to the 64-bit software tick-counter > I don't see how it being a down counter makes that any more difficult > >> This is neither simple nor maintainable. Further, it is un-necessary, as the >> sync_timer routine is just going to convert the number from whatever radix >> we converted it to into millisec. If we leave the two numbers as split, all >> that complexity is removed from get_ticks and sent upward to the common >> routine that converts the answer into ms anyway. This makes the system more >> maintainable by placing the minimum requirement on get_ticks. The "tick" >> should be opaque to anybody but sync_timebase anyway. > But how is the common routine going to know if it is a split timer, up > timer, down timer, little endian, big endian, etc, etc. > > get_ticks() abstracts the hardware implementation of the hardware timer > from sync_timer() Hi All, I understand your point. I prefer a higher level of abstraction. You are correct that there are some aspects of the tick counter that are very hardware quirky, and these attributes are hard to abstract. If the timer is embedded into a bit field with other variables, it is reasonable to expect get_ticks to extract the bit field and right justify the number. If there are endian issues, the get_ticks routine must return the number in the "natural" endianness of the platform. However, after that point, the values are extremely "regular". The fact that a counter is a down counter can be expressed in a data structure as a boolean. The high limit of the hardware counter is a number. The number of ticks per millsecond is obtainable from usec2ticks(1000), or 1 if we want to avoid some roundoff. From these values, sync_timer can take the two part ticks value and convert it to millisec. Trust me on this. I have the routines to do it. This puts as much of the abstraction of the two numbers into ONE COMMON ROUTINE, sync_timer. Now it is clearly possible to move some of the abstraction down a level into sync_timer. For instance you could move inverting the counter down to that level, and then multiply the msb by the maximum value of the lsb counter and add in the msb. It is clearly possible to move ALL of sync_timer down into get_ticks, if one wanted to. It is clearly possible to replace general values in gd-> with platform specific constant values. However, if you do that, you end up with a lot of duplicate, or almost duplicate, code running around. That has proven to be error prone, and it has left new ports of u-boot to sort of fend for themselves in figuring out how things should work. I prefer to abstract that all up into sync_timer. That way, all the math is in one place, and is table driven so it is easy to change. The top 32 bits are the rollover count and the bottom 32 bits are the current counter. If the counter is a full 32 bits, so much the better. >>> Ah - Lets keep it that way >>> Again, one could put them together inside the interrupt routine , but it is easier to check for a changed value if you don't do this. Otherwise, you have to check both words. It also makes the isr faster. It is just an >>> As I said before - Simple First, Fast Later >> I am in favor of simple. That is why I want get_ticks to do as little as >> possible. It should just read the hardware register and the overflow counter >> if it is separate. Make sure the overflow didn't change while we were >> reading. This is redundant if we are not using interrupts but we can leave >> the code in. It just won't do anything. We can also move the rollover >> detection to sync_timebase. It will be redundant if we are using interrupts, >> because time will never "back up". But we can do it this way. This >> centralizes the overflow detection, which is a good thing. > That does not sound simple to me. This, however, does: > > u64 get_ticks(void) > { > static u64 last_hw_tick_count; > static u64 last_sw_tick_count; > > /* Now for the platform specific stuff - Read hardware tick counter */ > u64 current_hw_tick_count = /* Read hw registers etc */ > > /* Deal with hardware weirdness - errata, stupid hw engineers etc */ > > u64 elapsed_ticks = current_hw_tick_count - last_hw_tick_count; > last_sw_tick_count += elapsed_ticks; > > return last_sw_tick_count; > } > > The '/* Read hw registers etc */' bit will always be the same, no matter > what way you do it Agree. > The '/* Deal with hardware weirdness - errata, stupid hw engineers etc */' > bit is where we are truly abstracting the hardware away - This is the > bit you propose to leave mangled and deal with in sync_time? Not totally. The get_
Re: [U-Boot] [RFC][Timer API] Revised Specification - Implementation details
On 5/26/2011 6:51 PM, Graeme Russ wrote: > Hi Bill, > > On Fri, May 27, 2011 at 11:26 AM, J. William Campbell > wrote: >> On 5/26/2011 4:28 PM, Graeme Russ wrote: >>> Why mess around with bit shifting (which you would then have to cludge >>> into >>> your platform code) when carting around a 64-bit value is relatively >>> cheap, >>> transparent and poratble (all all supported up-to-date tool chains) >>> >> I really STRONGLY disagree with this statement. If you actually needed 64 >> bit variables, fine use them. But as I have already shown, you do not need >> them in general. We are computing a 32 bit result. There is some entropy >> argument that says you shouldn't need 64 bits to do so. Another way to look >> at it is that converting the top 32 bit word and the bottom 32 bit word to >> ms separately can be easier than doing them both together at once. However, >> as we will see below, I do agree we need two 32 bit words to make this >> process go smoothly. I just don't agree that they should/will constitute a >> 64 bit binary word. See below. >>>>> - void timer_isr() >>>>> - Optional (particularly if tick counter rollover period is >>>>>exceptionally log which is usually the case for native 64-bit tick >>>>>counters) >>>>> - Simply calls sync_timebase() >>>>> - For platforms without any tick counter, this can implement one >>>>>(but accuracy will be harmed due to usage of disable_interrupts() >>>>> and >>>>>enable_interrupts() in U-Boot >>>>> >>>>> So to get the new API up and running, only two functions are mandatory: >>>>> >>>>> get_ticks() which reads the hardware tick counter and deals with any >>>>> 'funny >>>>> stuff' including rollovers, short timers (12-bit for example), composite >>>>> counters (16-bit 0-999 microsecond + 16-bit millisecond) and maintains a >>>>> 'clean' 64-bit tick counter which rolls over from all 1's to all 0's. >>>>> The >>>> I think it is the task of get_ticks to return the hardware tick counter >>>> as >>>> an increasing counter, period. The counter may wrap at some final count >>>> that is not all ones. That is ok. Sync_timebase deals with the rollovers >>>> if >>> The hardware tick counter may, the 64-bit software tick counter maintained >>> by get_ticks() may not >>>> necessary. get_ticks is very lightweight. get_ticks should deal with >>>> decrementing counters by returning the complement of the counter. The >>>> sc520 >>>> case is a bit more complex if you intend to use the 0-999 and 16 bit >>>> millisec registers, in that you do need to add them to the previous value >>>> to >>> As I mentioned in another post, this is a problem for the platform >>> maintainer and is abstracted away throught the platform specific >>> implementation of get_ticks() >>> >>>> make an increasing counter. Sync_timebase "likes" short counters in that >>>> they are easy to convert to millisec and tick remainders. >>> The compiler should handle using 64-bit rather than 32-bit transparently >> True enough. But you don't need 64 bit variables at this point two 32 bit >> ones work just fine, in fact better in most cases. > Remember, we are not dealing with a high performance OS here. The primary > goal is portability - Performance optimisations (which do not break > portability) can be performed later > >>>>> 64-bit tick counter does not need to be reset to zero ever (even on >>>>> startup >>>>> - sync_timebase tacks care of all the details) >>>> True, but sync_timebase does have to be initialized (as does the timer >>>> itself in most cases, so this is not a restriction). >>> This can be done in timer_init() via a call to sync_timebase() after the >>> timer has been configured. This should bring everything into line >>> >>>>> ticks_per_millisecond() simply return the number of ticks in a >>>>> millisecond >>>>> - This may as simple as: >>>>> >>>>> inline u64 ticks_per_millisecond(void) >>>>> { >>>>> return CONFIG_SYS_TICK_PER_MS; >>>>> } >>>>> >>>>> But it may be trickier if you have a programmable tick frequ
Re: [U-Boot] [RFC][Timer API] Revised Specification - Implementation details
On 5/26/2011 4:28 PM, Graeme Russ wrote: > Hi Bill, > > On Fri, May 27, 2011 at 2:56 AM, J. William Campbell > wrote: >> On 5/26/2011 6:27 AM, Graeme Russ wrote: >>> Hello Everyone, >>> >>> OK - Starting a new thread to discuss implementation details. This is a >>> heads-up for arch/platform maintainers - Once this is a bit more stable, I >>> will put it on the wiki >>> >>> Assumed Capabilities of the Platform >>> - Has a 'tick counter' that does not rely on software to increment >> Hi All, >>The nios2 with the most basic timer does not meet this requirement. It >> will not count at all without the 10 ms interrupt. I don't think this >> requirement matters anyway. We need a 'tick counter' that 'ticks'. If it >> takes software to make it tick, we don't much care. There may be problems >> with early use of udelay in that case, but that is a different issue. > I think we will need to define get_timer() weak - Nios will have to > override the default implementation to cater for it's (Nios') limitations Hi All, Yes, that will probably be required here. >>> - tick interval may by a fixed constant which cannot be controlled >>> via software, or it could be programmable (PIT) >>> >>> API Functions (/lib/timer.c) >>> - u32 get_timer(u32 start) >>> - Returns the number of elapsed milliseconds since 'start' >>> >>> API Functions (/arch/...) >>> - void udelay(u32 delay) >>> - Used for 'short' delays (generally up to several seconds) >>> - Can use the tick counter if it is fast enough >>> - MUST NOT RESET THE TICK COUNTER >> There is a requirement that udelay be available before relocation and before >> the BSS is available. One can use the tick counter to provide udelay as long >> as sync_timebase is not called OR sync timebase does not use BSS. It appears >> many implementations ignore this requirement at present. We should try to >> fix this, but is should not be a requirement. > If you really wanted to, sync_timebase() could use global data (it doesn't > have many static variables) in which case all timer functions would be > available before relocation Yes, my implementation of the sync_timebase routine was written that way, using gd-> for the required variables. >>> 'Helper' Functions (/lib/timer.c) >> I think this function should be weak, so that it is possible for people to >> override it with a "custom" function. The fully general sync_timebase has >> lots of code in it that can be simplified in special cases. We want and need >> a fully general function to be available, but other users who are real tight >> on space may want a cut down version. We should make that easily possible. > Agree > >>> - void sync_timebase(void) >>> - Updates the millisecond timer >>> - Utilises HAL functions to access the platform's tick counter >>> - Must be called more often than the rollover period of the >>>platform's tick counter >>> - Does not need to be called with a regular frequency (immune >>>to interrupt skew) >>> - Is always called by get_timer() >>> - For platforms with short tick counter rollovers it should >>>be called by an ISR >>> - Bill Campbell wrote a good example which proved this can be common >>>and arbitrary (and optionally free of divides and capable of >>>maintaining accuracy even if the tick frequency is not an even >>>division of 1ms) >>> >>> HAL Functions (/arch/... or /board/...) >>> - u64 get_ticks(void) >> For what it's worth, I would like to propose using a (gasp!) typedef here. >> It seems to me there are a whole lot of cases where the max number of ticks >> is a u32 or less. In those cases, the wrap at 32 bits helps things a lot. If >> the tick counter is really 64 bits, the function of sync_timebase is simply >> to convert the tick value to millisec, and that is it. Otherwise, if it is >> 32 bits or less then some other actions will be required. I think this is >> one of those times where a typedef would help, We could define a type called >> timer_tick_t to describe this quantity. That would allow a pure 32 bit >> implementation where appropriate. >> >> Another suggestion is that perhaps we want a u32 get_ticks_lsb(void) as well >> as a regular get_ticks. The lsb version would be used for udelay and could >&
Re: [U-Boot] [RFC][Timer API] Revised Specification - Implementation details
On 5/26/2011 1:27 PM, Wolfgang Denk wrote: > Dear "J. William Campbell", > > In message<4ddeafe0.8060...@comcast.net> you wrote: >> I certainly agree using 64 bits for all calculations is vast overkill. >> In fact, I think using 64 bit calculations on systems that have only a >> 32 bit or less timer register is probably overkill. :-) However, to >> date,AFAIK, no processor has exceeded the u32 in ticks per second. As I > Not yet. But it makes no sense to start a new design with settings > already in critical range, especially since there is zero problem > with breaking it down by a factor of 1000 or 1e6. > >> pointed out in a previous e-mail, if they ever do this, we can just drop >> one or 2 bits off the 64 bit counter and in millisecond resolution, >> nobody will ever know. Also as previously pointed out, usec2ticks is > No. I will not accept a design that is so close on the edge of > breaking. > > What is your exact problem with the existing interfaces ticks2usec() > and usec2ticks() ? > >> not present yet in lots of implementations. Also, if the fundamental >> clock frequency is 32 kHz (or anything less than 1 MHz), usec2ticks is >> 0! This probably rules out using it to get ticks per millisec or ticks >> per sec. > The statement "usec2ticks is 0" makes absolutely no sense as long as > you don't say which argument you pass in. You get a return value of > 0 even for a tick rate in the GHz range if you pass 0 as argument. > > Hoewver, usec2ticks(1000) or maybe usec2ticks(10) will probably > return pretty useful values. > > [Note that by passing properly scaled arguments you can also avoid a > number of rounding errors.] Hi Wolfgang, Yes, you are correct. I was thinking usec2ticks(1), which is certainly not the way to do it. I am happy with usec2ticks and ticks2usec. That works for me. Sorry for the noise. How about the first part of my response? Are you still thinking about it or is it just too bad for words :-) ? Best Regards, Bill Campbell > > Best regards, > > Wolfgang Denk > ___ U-Boot mailing list U-Boot@lists.denx.de http://lists.denx.de/mailman/listinfo/u-boot
Re: [U-Boot] [RFC][Timer API] Revised Specification - Implementation details
On 5/26/2011 12:16 PM, Wolfgang Denk wrote: > Dear "J. William Campbell", > > In message<4ddea165.9010...@comcast.net> you wrote: >>>> I think it is the task of get_ticks to return the hardware tick counter >>>> as an increasing counter, period. The counter may wrap at some final >>>> count that is not all ones. That is ok. Sync_timebase deals with the >>> NO! We want to be able to compute time differences using simple >>> unsigned arithmentics, even after a rollover of the counter. For this >>> it is mandatory that the counter always gets only incremented until it >>> wraps around at te end of it's number range, and never gets reset >>I agree that that is what must happen, but it should happen inside >> sync_timebase. Sync_timebase does what is needed to convert the >> less-than-fully capable counters into a fully capable one. You could > I think you also want this behaviour for get_ticks(). Hi Wolfgang, I understand why that might be nice. But to do that with common code would require get_ticks to call a generic routine (say sync_ticks) that would expand the counter to 64 bits. Note that this is not always totally trivial, as the timer may roll over at 10 ms or some other not-so-nice number. Then sync_timer would convert the 64 bit number to milliseconds. That approach will work. However, I think that is overkill, as we really want the result in milliseconds. If you look at the prototype sync_timer routine, you can see an example of how this is possible without resorting to 64 bit math. I think that avoiding the 64 bit math on processors that don't have a 64 bit tick counter (and are therefore probably less capable) is worthwhile. I also think that the purpose of the get_time routine abstracting the time into milliseconds is to avoid dealing with ticks anywhere except in the timer routines. Presumably, nobody but sync_timer would ever have reason to call get_ticks. If that is not your thinking, fine, we just disagree on that point, and having a sync_ticks to expand the tick counter certainly can be done. >> To date, it has been shown conclusively that this process cannot be >> relied upon, or we wouldn't be having this discussion. If we put that >> functionality inside sync_timebase, it is in one place and debuggable >> once. All sync_timebase requires to do this is ticks per second and >> maximum tick value. I do request that counters that decrement be negated >> in the get_ticks routine, but beyond that, it should be a simple read of >> the tick register(s). > I think using ticks per second is not a good idea. It may exceed > ULONG_MAX, and having to use 64 bit for all calculations is probably > overkill. The existing ticks2usec/usec2ticks work fine so far. I certainly agree using 64 bits for all calculations is vast overkill. In fact, I think using 64 bit calculations on systems that have only a 32 bit or less timer register is probably overkill. :-) However, to date,AFAIK, no processor has exceeded the u32 in ticks per second. As I pointed out in a previous e-mail, if they ever do this, we can just drop one or 2 bits off the 64 bit counter and in millisecond resolution, nobody will ever know. Also as previously pointed out, usec2ticks is not present yet in lots of implementations. Also, if the fundamental clock frequency is 32 kHz (or anything less than 1 MHz), usec2ticks is 0! This probably rules out using it to get ticks per millisec or ticks per sec. Best Regards, Bill Campbell > Best regards, > > Wolfgang Denk > ___ U-Boot mailing list U-Boot@lists.denx.de http://lists.denx.de/mailman/listinfo/u-boot
Re: [U-Boot] [RFC][Timer API] Revised Specification - Implementation details
On 5/26/2011 10:53 AM, Wolfgang Denk wrote: > Dear "J. William Campbell", > > In message<4dde8639.3090...@comcast.net> you wrote: > >> I think it is the task of get_ticks to return the hardware tick counter >> as an increasing counter, period. The counter may wrap at some final >> count that is not all ones. That is ok. Sync_timebase deals with the > NO! We want to be able to compute time differences using simple > unsigned arithmentics, even after a rollover of the counter. For this > it is mandatory that the counter always gets only incremented until it > wraps around at te end of it's number range, and never gets reset Hi All, I agree that that is what must happen, but it should happen inside sync_timebase. Sync_timebase does what is needed to convert the less-than-fully capable counters into a fully capable one. You could move that functionality down into get_ticks, but if you do, you end up much like the present scheme where the multiple different get_ticks routines expected to cope with expanding the hardware counter properly. To date, it has been shown conclusively that this process cannot be relied upon, or we wouldn't be having this discussion. If we put that functionality inside sync_timebase, it is in one place and debuggable once. All sync_timebase requires to do this is ticks per second and maximum tick value. I do request that counters that decrement be negated in the get_ticks routine, but beyond that, it should be a simple read of the tick register(s). Best Regards, Bill Campbell > before. > >> You will have to call the routine that initializes sync_timebase. This >> routine should have a name, like void init_sync_timebase(void)? > init_timebase(). > > > Best regards, > > Wolfgang Denk > ___ U-Boot mailing list U-Boot@lists.denx.de http://lists.denx.de/mailman/listinfo/u-boot
Re: [U-Boot] [RFC][Timer API] Revised Specification - Implementation details
On 5/26/2011 6:27 AM, Graeme Russ wrote: > Hello Everyone, > > OK - Starting a new thread to discuss implementation details. This is a > heads-up for arch/platform maintainers - Once this is a bit more stable, I > will put it on the wiki > > Assumed Capabilities of the Platform > - Has a 'tick counter' that does not rely on software to increment Hi All, The nios2 with the most basic timer does not meet this requirement. It will not count at all without the 10 ms interrupt. I don't think this requirement matters anyway. We need a 'tick counter' that 'ticks'. If it takes software to make it tick, we don't much care. There may be problems with early use of udelay in that case, but that is a different issue. > - tick interval may by a fixed constant which cannot be controlled > via software, or it could be programmable (PIT) > > API Functions (/lib/timer.c) > - u32 get_timer(u32 start) > - Returns the number of elapsed milliseconds since 'start' > > API Functions (/arch/...) > - void udelay(u32 delay) > - Used for 'short' delays (generally up to several seconds) > - Can use the tick counter if it is fast enough > - MUST NOT RESET THE TICK COUNTER There is a requirement that udelay be available before relocation and before the BSS is available. One can use the tick counter to provide udelay as long as sync_timebase is not called OR sync timebase does not use BSS. It appears many implementations ignore this requirement at present. We should try to fix this, but is should not be a requirement. > 'Helper' Functions (/lib/timer.c) I think this function should be weak, so that it is possible for people to override it with a "custom" function. The fully general sync_timebase has lots of code in it that can be simplified in special cases. We want and need a fully general function to be available, but other users who are real tight on space may want a cut down version. We should make that easily possible. > - void sync_timebase(void) > - Updates the millisecond timer > - Utilises HAL functions to access the platform's tick counter > - Must be called more often than the rollover period of the >platform's tick counter > - Does not need to be called with a regular frequency (immune >to interrupt skew) > - Is always called by get_timer() > - For platforms with short tick counter rollovers it should >be called by an ISR > - Bill Campbell wrote a good example which proved this can be common >and arbitrary (and optionally free of divides and capable of >maintaining accuracy even if the tick frequency is not an even >division of 1ms) > > HAL Functions (/arch/... or /board/...) > - u64 get_ticks(void) For what it's worth, I would like to propose using a (gasp!) typedef here. It seems to me there are a whole lot of cases where the max number of ticks is a u32 or less. In those cases, the wrap at 32 bits helps things a lot. If the tick counter is really 64 bits, the function of sync_timebase is simply to convert the tick value to millisec, and that is it. Otherwise, if it is 32 bits or less then some other actions will be required. I think this is one of those times where a typedef would help, We could define a type called timer_tick_t to describe this quantity. That would allow a pure 32 bit implementation where appropriate. Another suggestion is that perhaps we want a u32 get_ticks_lsb(void) as well as a regular get_ticks. The lsb version would be used for udelay and could possibly come from another timer if that was necessary/desirable. See the requirement for early udelay early availability. > - Returns a tick count as an unsigned 64-bit integer > - Abstracts the implementation of the platform tick counter >(platform may by 32-bit 3MHz counter, might be a 16-bit >0-999 microsecond plus 16-bit 0-65535 millisecond etc) > - u64 ticks_per_millisecond() > - Returns the number of ticks (as returned by get_ticks()) per >millisecond I think ticks_per_sec would be a better choice. First, such a function already exists in almost all u-boots. Second, if one wants the best accuracy for things like udelay, you need better accuracy than millisec. Since this function is used only infrequently, when things are initialized, a divide to get ticks_per_millsec (if that is what you really want) is no big deal. Lastly, I think this function can remain u32. Yes, there is a 4 GHz limit on the clock rate. If this ever becomes an issue, we can change the type to timer_tick_t. When the CPU clock rate gets quite high, it is an advantage to divide it down for performance measurement anyway. The AMD/Intel chips already do this. If the hardware doesn't do it, shift the timer value right two bits. I doubt you will miss the 0.4 nanoseconds resolution loss from your 10 GHz timestamp. > - void timer_isr() > - Optional (particularly if tick counter
Re: [U-Boot] [RFC] Review of U-Boot timer API
On 5/25/2011 9:40 PM, Graeme Russ wrote: > On Thu, May 26, 2011 at 2:19 PM, Reinhard Meyer > wrote: >> Dear Graeme Russ, >>> On closer inspection, some do, some don't. All ARMv7 (OMAP, S5P, Tegra2) >>> do. at91 is odd - It looks like it uses interrupts, but get_timer() and >>> udelay() both end up calling get_timer_raw() (with udelay only having >>> millisecond resolution it seems). Some others can be configured to >>> increment the timer using an interrupt. ARM is, quite frankly, a complete >>> mess - It has a mass of *_timer_masked() functions which the core timer >>> functions are 'wafer thin' wrapper around, udelay() silently resets >>> the timebase trashing get_timer() loops etc. >> Please look at current master for at91. >> >> http://git.denx.de/?p=u-boot.git;a=blob;f=arch/arm/cpu/arm926ejs/at91/timer.c;h=a0876879d3907af553d832bea187a062a22b9bd4;hb=5d1ee00b1fe1180503f6dfc10e87a6c6e74778f3 >> >> AT91 uses a 32 bit hardware register that by means of a prescaler is made >> to increment at a rate in the low megahertz range. > Yes, I see that now > >> This results in a wrap approximately every 1000 seconds. >> Actually this would be sufficient for all known uses of udelay() and >> get_timer() >> timeout loops. However, this hardware register is extended to 64 bits by >> software >> every time it is read (by detecting rollovers). > Which makes it 100% compatible with my proposed solution - The software > prescaler will trigger the 64-bit extension and rollover detection > >> Since a wrap of that 64 bit "tick" would occur after the earth has ended, >> it is simple to obtain milliseconds from it by doing a 64 bit division. > Which would be done in the common prescaler in /lib/ > > Currently, most ARM specific utilisations of get_timer() enforce a reset > of the tick counter by calling reset_timer() - Subsequent calls to > get_timer() then assume a start time of zero. Provided the internal timer > rolls over currectly, the initial call of get_timer(0) will reset the ms > timer and remove and 'glitch' present due to not calling the 'extender' > function between 32-bit rollovers which makes the reset_timer() call > unneccessary - I believe at91 behaves correctly in this regard. > > In any case, the underlying assumption made by the ARM timer interface > (call reset_timer() first always) is inherently broken as not all users > of the timer API do this - They assume a sane behaviour of: > > start = get_timer(0); > elapsed_time = get_timer(start); > > Add to this udelay() resetting the timer make the following very broken: > > start = get_timer(0); > while(condition) { > udelay(delay); > } > elapsed_time = get_timer(start); > > NOTE: In this case, if udelay() also calls the prescaler then no interrupt > triggered every 1000s would be required in the above example to get > correct elapsed_time even if the loop ran for several hours (provided > udelay() is called at least every 1000s > > However, to allow timing of independent events with no intervening > udelay() or get_timer() calls, an 1000s interrupt to kick the prescaler is > all that is needed to make this particular implementation behave correctly. Hi All, True, if the processor supports timer interrupts. The problem is that the existing u-boots in many cases do not. I think that is really the crux of the problem. So what are we going to do? I am open to ideas here. Best Regards, Bill Campbell > Of course disabling interruts and not calling get_timer() or udelay() will > break the timer - But there is nothing that can be done about that) > > Regards, > > Graeme > > ___ U-Boot mailing list U-Boot@lists.denx.de http://lists.denx.de/mailman/listinfo/u-boot
Re: [U-Boot] [RFC] Review of U-Boot timer API
On 5/25/2011 9:19 PM, Reinhard Meyer wrote: > Dear Graeme Russ, >> On closer inspection, some do, some don't. All ARMv7 (OMAP, S5P, Tegra2) >> do. at91 is odd - It looks like it uses interrupts, but get_timer() and >> udelay() both end up calling get_timer_raw() (with udelay only having >> millisecond resolution it seems). Some others can be configured to >> increment the timer using an interrupt. ARM is, quite frankly, a >> complete >> mess - It has a mass of *_timer_masked() functions which the core timer >> functions are 'wafer thin' wrapper around, udelay() silently resets >> the timebase trashing get_timer() loops etc. > > Please look at current master for at91. > > http://git.denx.de/?p=u-boot.git;a=blob;f=arch/arm/cpu/arm926ejs/at91/timer.c;h=a0876879d3907af553d832bea187a062a22b9bd4;hb=5d1ee00b1fe1180503f6dfc10e87a6c6e74778f3 > > > > AT91 uses a 32 bit hardware register that by means of a prescaler is made > to increment at a rate in the low megahertz range. > > This results in a wrap approximately every 1000 seconds. > > Actually this would be sufficient for all known uses of udelay() and > get_timer() Hi All Yes, you are correct. It would be sufficient for all known uses of udelay and get_timer(). However, Wolfgang has indicated a strong desire that the performance of the new API work properly over the full 32 bit range of the millisecond delay time. That has been the basic issue for some time here. > timeout loops. However, this hardware register is extended to 64 bits > by software > every time it is read (by detecting rollovers). Yes, but this extension ONLY happens if get_ticks is called via udelay or get_timer. It doesn't happen if you are sitting at the command prompt or off executing a downloaded stand alone program. You might ask "who cares", and I would answer that Wolfgang cares, at least to some level. If the timer overflow triggered an interrupt, we could call get_ticks to update the extended time inside the interrupt routine. But, as far as I know, it doesn't. There are some other ARM processors that have a 32 bit clock derived from a 32 kHz crystal, The will work much as you example does up to 134217 seconds, in fact much longer than your AT91 example does. However, that doesn't touch the 4294967 seconds that the PPC can manage. Without interrupts, the 32 bit (or smaller) counters will NEVER get to the PPC standard if their tick rate exceeds 1 msec. It may be that we need a lower standard, perhaps saying 1000 seconds is enough. But that is not my decision to make. > > Since a wrap of that 64 bit "tick" would occur after the earth has ended, > it is simple to obtain milliseconds from it by doing a 64 bit division. True, but moot because of the above. Best Regards, Bill Campbell > > Best Regards, > Reinhard > > ___ U-Boot mailing list U-Boot@lists.denx.de http://lists.denx.de/mailman/listinfo/u-boot
Re: [U-Boot] [RFC] Review of U-Boot timer API
On 5/25/2011 4:13 PM, Graeme Russ wrote: > Hi Wolfgang > > On Thu, May 26, 2011 at 7:16 AM, Wolfgang Denk wrote: >> Dear Graeme Russ, >> >> In message<4ddd7066.4000...@gmail.com> you wrote: No, not at all. And I already answered this. For example on PPC, just reading the timebase would be perfectly sufficient, and simpler and more reliable than the current interrupt based approach. >>> I assume by 'timebase' you mean the 64-bit tick counter. If so, that is >> By timebase I mean the timebase register, implemented as two 32 bit >> registers tbu and tbl, holding the upper and the lower 32 bits of the >> free-running 64 bit counter, respective. > And remember, not all platforms have this implementation. The AMD sc520 > for example has a microsecond register which counts 0-999 that ticks a > 16-bit millisecond register and resets to zero. And the millisecond > register latches the value of the microsecond register and resets > (the millisecond register) back to zero. > > The thing is, this can all be abstracted away via get_tick() which > (provided it is called every 65 seconds or so) can maintain a software > version of the timebase register. So, every 65 seconds, the prescaler > needs to be kicked. Now, if all we want to use get_timer() for is to > monitor a timeout (which I think might be every single use in U-Boot > to date) then the while (get_timer(start)< timeout) loop will work. If > get_timer() is needed to measure time between two arbitrary events (which > I 100% agree it should be able to do) then the prescaler will need to be > kicked (typically by an interrupt) > >>> _exactly_ what I am suggesting we do (and what does already happen on ARM). >> I don't think so. Hi All, Just to be clear, while ARMv7 has a 64 bit performance counter, it is not presently used by get_time. This is a change we want to make correct? > On closer inspection, some do, some don't. All ARMv7 (OMAP, S5P, Tegra2) > do. at91 is odd - It looks like it uses interrupts, but get_timer() and > udelay() both end up calling get_timer_raw() (with udelay only having > millisecond resolution it seems). I am not sure why you say at91 appears to use interrupts. There is a comment in arch/arm/cpu/arm930t/at91/timer.c that says "timer without interrupts" (line 73). There is the same comment in arch/arm/cpu/arm930t/at91rm9200/timer.c Nothing in either routine refers to interrupts, so I would say the timer doesn't use them. I could be wrong of course. > Some others can be configured to > increment the timer using an interrupt. ARM is, quite frankly, a complete > mess - It has a mass of *_timer_masked() functions which the core timer > functions are 'wafer thin' wrapper around, udelay() silently resets > the timebase trashing get_timer() loops etc. I sure agree with this last part. The only arm timer I found that clearly thought it could use interrupts was in arch/arm/cpu/ixp, and that was conditional, not mandatory. > So let's wind back and distill the approach I am suggesting: > > 1) A common prescaler function in /lib/ - It's purpose is to maintain > a 1ms resolution timer (if the platform cannot otherwise do so)[1] > The prescaler utilises a platform provided get_ticks()[2] > 2) A get_ticks() function provided by the platform - This function must > return an unsigned counter which wraps from all 1's to all 0's - It > DOES NOT have to be initialised to zero at system start. get_ticks() > hides the low-level tick counter implementation - The sc520 example > above is a classic example, so is your PPC tbu/tbl example. > 3) [Optional]An ISR which calls the prescaler[3] > > Now there is an optimisation if your tick counter has a 1ms resolution > and is not small (i.e. 64-bits) - The prescaler is defined weak, so in > the platform code, re-implement the prescaler to simply copy the tick > counter to the timer variable. > > And what are the specific implementation types (in decending order of > preference)? I think: > 1) A 64-bit micro-second tick counter[5] >- No interrupts needed >- Can be used by udelay() and get_timer() trivially > 2) A 64-bit sub-micro-second tick counter >- Interrupts most likely undeeded unless the tick frequency is > insanely high >- Can be used by udelay() and get_timer() trivially > 3) A 64-bit milli-second tick counter >- No interrupts needed >- No prescaler needed >- Can be used by get_timer() trivially >- udelay() needs another tick source (if available) or be reduced > to millisecond resolution > 4) A 32-bit milli-second tick counter >- No prescaler needed[6] >- Max 'glitch free' duration is ~50 days >- ISR needed to kick prescaler if events longer than 50 days need > to be timed >- Can be used by get_timer() trivially >- udelay() needs another tick source (if available) or be reduced > to millisecond
Re: [U-Boot] [RFC] Review of U-Boot timer API
On 5/25/2011 12:46 PM, Wolfgang Denk wrote: > Dear Graeme Russ, > > In message you wrote: >> I hope to get an implementation agreed upon that does not require >> interrupts at all, provided a tick counter with sufficiently long roll >> over time is available (longer than the maximum expected period >> between 'related' get_timer() calls, for example calls to get_timer() >> in a timeout testing while loop). This 'bare minimum' implementation >> can be optionally augmented with an ISR which kicks the timer >> calculation in the background (typically by just calling get_timer()) >> >> It really is quite simple in the end. > The length of this thread shows that it is not as simple as you want > to make us believe. > > > > If all you have in mind are timeouts, then we don't need get_timer() > at all. We can implement all types of timeout where we wait for some > eent to happen using udelay() alone. > > Need a 10 second timeout? Here we go: > > int cnt = 0; > int limit = 10 * 1000; > while (!condition) { > usleep(1000); /* wait 1 millisec */ > if (++cnt> limit) > break; > } > if (cnt> limit) > error("timeout"); > > get_timer() comes into play only if we want to calculate a time > difference, for example if we want to run some code where we don't > know how long it runs, and later come back and check if a certain > amount of time has passed. > > When we don't know how long this code runs, we also cannot know (and > espeically not in advance) wether or not this time will be longer or > not than the rollover time of the tick counter. > > > Your plan to require that get_timer() gets called "often enough" to > prevent or detect tick counter overflows is putting things on their > head. It should be the opposite: The implementation of get_timer() > should be such that it becomes independent from such low level > details. > HI all, We also cannot know if this code disables interrupts. If it does, the existing PPC design is broken. If interrupts are disabled during the entire code being executed, the elapsed run time will be 0. You can say that disabling interrupts in the code is not allowed. Fine, then that becomes a constraint on the code. Calling get_timer explicitly often enough is also a constraint on the code. One constraint is acceptable to you, the other is not. Ok, that's fine too. If the interrupt routine calls get_timer, then get_timer is called "often enough" and everything works the same as it presently does on PPC, It is not different than requiring the interrupt routine to be executed "often enough". The two methods are functionally identical. The only problems arise on systems that don't support interrupts and don't have any timers with enough bits available to meet the 4294967 seconds maximum interval requirement. Those systems will be "broken" no matter what we do, as we have all agreed. Right now, almost all ARM cases are broken, because they have short timers and don't use interrupts. In some cases, there are actual bugs involved. We can make these cases less broken than they now are with a common get_timer approach as outlined previously. However, we cannot fix them to the standard Wolfgang is stating here, to be not "broken". So, back to what I was asking before, is the improvement worth the effort if the result is still broken? Best Regards, Bill Campbell > I have stated this before: I consider any design that requires > get_timer() to be called "often enough" broken. > > Best regards, > > Wolfgang Denk > ___ U-Boot mailing list U-Boot@lists.denx.de http://lists.denx.de/mailman/listinfo/u-boot
Re: [U-Boot] [RFC] Review of U-Boot timer API
On 5/24/2011 10:17 PM, Wolfgang Denk wrote: Dear "J. William Campbell", In message<4ddc31eb.6040...@comcast.net> you wrote: ... A tick is defined as the smallest increment of system time as measured by a computer system (seehttp://en.wikipedia.org/wiki/System_time): System time is measured by a system clock, which is typically implemented as a simple count of the number of ticks that have transpired since some arbitrary starting date, called the epoch. Unfortunately, this definition is obsolete, and has been for quite some Do you have any proof for such a claim? Hi Wolfgang, Well, yes, in fact the same reference you used. Note that the statement "A tick is defined as the smallest increment of system time as measured by a computer system" is NOT found in http://en.wikipedia.org/wiki/System_time. That page is defining system time, and what we are discussing is the definition of a "tick". In fact, on the same wiki page you cite, there IS the statement "Windows NT <http://en.wikipedia.org/wiki/Windows_NT> counts the number of 100-nanosecond ticks since 1 January 1601 00:00:00 UT as reckoned in the proleptic Gregorian calendar <http://en.wikipedia.org/wiki/Proleptic_Gregorian_calendar>, but returns the current time to the nearest millisecond". Here 100 nanosecond ticks clearly does not refer to any "hardware" 100 ns clock that exists on the pc. The 100 ns is a computed (dare I say virtual) "tick" value. The point here is that the definition of tick is yours, not wikipedia.org's. (Although we all are aware if it is wikipedia, it MUST be so). Further, http://en.wikipedia.org/wiki/Jiffy_%28time%29#Use_in_computing contains the statement "In computing <http://en.wikipedia.org/wiki/Computing>, a jiffy is the duration of one tick of the system timer <http://en.wikipedia.org/wiki/System_time> interrupt <http://en.wikipedia.org/wiki/Interrupt>.". If a tick is the smallest increment of system time as measured by the computer system, the "of the system timer interrupt" part of the statement would be unnecessary. The fact it IS present indicates there are other kinds of ticks present in the universe. AFAIK, all timers "tick", and the definition of the tick rate is 1/timer resolution. The concept of "timer ticks" and "clock ticks" has been around forever, and exists independent of System Time. For example, there may be a watchdog timer on a system that does not measure time at all, yet the watchdog still ticks. When you read a value from a timer that was not the "system timer", what do you call the value you read? I would call it ticks, and I bet just about everybody else would too. The only reason I feel this point matters at all is that when one refers to a routine called "get_ticks", it is not obvious to me which timer ticks are being referenced. You are saying that, by definition, it refers to the "system clock". My feeling is that it is not obvious why that is so on a system that has many clocks. The name of the function or an argument to the function should, IMNSHO, specify which timer ticks are being returned. Best Regards, Bill Campbell years. When computers had a single timer, the above definition worked, but it no longer does, as many (most?) computers have several hardware timers. A "tick" today is the time increment of any particular timer of a computer system. So, when one writes a function called get_ticks on a PPC, does one mean read the decrementer, or does one read the RTC or does one read the TB register(s) A similar situation exists on the Blackfin BF531/2/3, that has a preformance counter, a real-time clock, and three programmable timers. Which tick do you want? For each u-boot Please re-read the definition. At least as far as U-Boot and Linux are concerned, there is only a single clock source used to implement the _system_time_. And I doubt that other OS do different. implementation, we can pick one timer as the "master" timer, but it may not be the one with the most rapid tick rate. It may be the one with the most USEFUL tick rate for get_timer. If you take the above definition at face value, only the fastest counter value has ticks, and all other counters time increments are not ticks. If they are not ticks, what are they? Clocks, timers? Best regards, Wolfgang Denk ___ U-Boot mailing list U-Boot@lists.denx.de http://lists.denx.de/mailman/listinfo/u-boot
Re: [U-Boot] [RFC] Review of U-Boot timer API
On 5/24/2011 5:17 PM, Graeme Russ wrote: > On Wed, May 25, 2011 at 5:19 AM, Wolfgang Denk wrote: >> Dear Graeme Russ, >> >> In message<4ddbe22d.6050...@gmail.com> you wrote: > Why must get_timer() be used to perform "meaningful time measurement?" Excellent question! It was never intended to be used as such. >>> Because get_timer() as it currently stands can as it is assumed to return >>> milliseconds >> Yes, but without any guarantee for accuracy or resolution. >> This is good enough for timeouts, but nothing for time measurements. > Out of curiosity, are there any platforms that do not use their most > accurate source(*) as the timebase for get_timer()? If a platform is using > it's most accurate, commonly available, source for get_timer() the the > whole accuracy argument is moot - You can't get any better anyway so > why sweat the details. Hi All, Well, it is not quite that simple. The "accuracy" of the 1 ms interrupt rate is controlled in all cases I know about by the resolution of the programmable divider used to produce it. It appears that the x86 uses a 1.19318 MHz crystal oscillator to produce the nominal 1 ms timer tick. (There is a typo in line 30 of arch/x86/lib/pcat_timer.c that says 1.9318. I couldn't make any of the numbers work until I figured this out). The tick is produced by dividing the 1.19318 rate999.313 by 1194, which produces an interrupt rate of 999.3 Hz, or about 0.068% error. However, the performance counter on an x86 is as exact as the crystal frequency of the CPU is. FWIW, you can read the performance counter with rdtsc on a 386/486 and the CYCLES and CYCLES2 registers on later Intel/AMD chips. So yes, there is at least one example of a cpu that does not use it's most accurate (or highest resolution) time source. > (*)I'm actually referring to what is commonly available for that platform, > and not where a board has a high precision/accuracy source in addition to > the common source. > > As a followup question, how many platforms use two completely independent > sources for udelay() and get_timer() - x86 does, but I plan to change this > so the interrupt kicks the new prescaler which can be done at>> 1ms period > and udelay() and get_timer() will use the same tick source and therefore > have equivalent accuracy. Are you sure of this? From what I see in arch/x86/lib/pcat_timer.c, the timer 0 is programmed to produce the 1 kHz rate timer tick and is also read repeatedly in __udelay to produce the delay value. They even preserve the 1194 inaccuracy, for some strange reason. I see that the sc520 does appear to use different timers for the interrupt source, and it would appear that it may be "exact", but I don't know what the input to the prescaler is so I can't be sure. Is the input to the prescaler really 8.3 MHz exactly? Also, is the same crystal used for the input to the prescaler counter and the "software timer millisecond count". If not, then we may have different accuracies in this case as well. Also of note, it appears that the pcat_timer.c, udelay is not available intil interrupts are enabled. That is technically non-compliant, although it obviously seems not to matter. Best Regards, Bill Campbell >>> OK, let's wind back - My original suggestion made no claim towards changing >>> what the API is used for, or how it looks to those who use it (for all >>> practical intents and purposes). I suggested: >>> - Removing set_timer() and reset_timer() >>> - Implement get_timer() as a platform independent function >> Trying to remember what I have read in this thread I believe we have >> an agreement on these. >> >>> Exposing ticks and tick_frequency to everyone via a 'tick' HAL >> I skip this. I don't even read it. > Hmmm, I think it is worthwhile at least comparing the two - What is the > lesser of two evils > > 1. Exposing 'ticks' through a HAL for the prescaler > 2. Duplicating a function with identical code 50+ times across the source > tree > > I personally think #2 is way worse - The massive redundant duplication and > blind copying of code is what has get us into this (and many other) messes > >>> === >>> Not exposing ticks and tick_frequency to everyone >>> >>> In /lib/timer.c >>> >>> void prescaler(u32 ticks, u32 tick_frequency) >>> { >>>u32 current_ms; >>> >>>/* Bill's algorithm */ >>> >>>/* result stored in gd->timer_in_ms; */ >>> } >>> >>> In /arch/cpu/soc/timer.c or /arch/cpu/timer.c or /board//timer.c >>> >>> static u32 get_ticks(void) >> Currently we have unsigned long long get_ticks(void) which is better >> as it matches existing hardware. > Matches PPC - Does it match every other platform? I know it does not match > the sc520 which has a 16-bit millisecond and a 16-bit microsecond counter > (which only counts to 999 before resetting to 0) > > Don't assume every platform can implement a 64-bit tick counter. But yes, > we should cater for those platforms that can >
Re: [U-Boot] [RFC] Review of U-Boot timer API
On 5/24/2011 12:19 PM, Wolfgang Denk wrote: > Dear Graeme Russ, > > In message<4ddbe22d.6050...@gmail.com> you wrote: Why must get_timer() be used to perform "meaningful time measurement?" >>> Excellent question! It was never intended to be used as such. >> Because get_timer() as it currently stands can as it is assumed to return >> milliseconds > Yes, but without any guarantee for accuracy or resolution. > This is good enough for timeouts, but nothing for time measurements. > >> OK, let's wind back - My original suggestion made no claim towards changing >> what the API is used for, or how it looks to those who use it (for all >> practical intents and purposes). I suggested: >> - Removing set_timer() and reset_timer() >> - Implement get_timer() as a platform independent function > Trying to remember what I have read in this thread I believe we have > an agreement on these. > >> Exposing ticks and tick_frequency to everyone via a 'tick' HAL > I skip this. I don't even read it. > >> === >> Not exposing ticks and tick_frequency to everyone >> >> In /lib/timer.c >> >> void prescaler(u32 ticks, u32 tick_frequency) >> { >> u32 current_ms; >> >> /* Bill's algorithm */ >> >> /* result stored in gd->timer_in_ms; */ >> } >> >> In /arch/cpu/soc/timer.c or /arch/cpu/timer.c or /board//timer.c >> >> static u32 get_ticks(void) > Currently we have unsigned long long get_ticks(void) which is better > as it matches existing hardware. > > Note that we also have void wait_ticks(u32) as needed for udelay(). Hi All, I didn't comment before on the definition of ticks, but I fear that was unwise. The stated definition was: A tick is defined as the smallest increment of system time as measured by a computer system (seehttp://en.wikipedia.org/wiki/System_time): System time is measured by a system clock, which is typically implemented as a simple count of the number of ticks that have transpired since some arbitrary starting date, called the epoch. Unfortunately, this definition is obsolete, and has been for quite some years. When computers had a single timer, the above definition worked, but it no longer does, as many (most?) computers have several hardware timers. A "tick" today is the time increment of any particular timer of a computer system. So, when one writes a function called get_ticks on a PPC, does one mean read the decrementer, or does one read the RTC or does one read the TB register(s) A similar situation exists on the Blackfin BF531/2/3, that has a preformance counter, a real-time clock, and three programmable timers. Which tick do you want? For each u-boot implementation, we can pick one timer as the "master" timer, but it may not be the one with the most rapid tick rate. It may be the one with the most USEFUL tick rate for get_timer. If you take the above definition at face value, only the fastest counter value has ticks, and all other counters time increments are not ticks. If they are not ticks, what are they? This is one of the reasons I favor the performance monitoring system be separate from the get_timer timing methodology, as it will often use a different counter and time base anyway. That is also why I prefer to have a conversion routine that converts timer values to seconds and nano-seconds without reference to "tick" rates, so the user never has to deal with these ambiguities. Yes, "under the hood", somebody does, but that need not be specified in the external interface. )Nobody has yet commented on my proposed performance measuring functions, and I know this group well enough not to assume that silence implies consent! ) The prescaler function defined by Graeme needs to read some timer value. It turns out that a u32 value is the most appropriate value for the get_timer operation. The value desired is usually not of hardware timer tick resolution. It should be the specific bits in the timer, such that the LSB resolution is between 1 ms and 0.5 ms. We use greater resolution that this only when the counter is short enough that we need lower significance bits to make the counter 32 bits. For that reason, the function should probably be call something like u32 read_timer_value(void). I really don't much care what it is called as long as we understand what it does. Best Regards, Bill Campbell >> static u32 get_tick_frequency(void) >> { >> u32 tick_frequency; >> >> /* Determine tick frequency */ >> >> return tick_frequency; >> } > Note that we also have u32 usec2ticks(u32 usec) and u32 ticks2usec(u32 ticks). > > Best regards, > > Wolfgang Denk > ___ U-Boot mailing list U-Boot@lists.denx.de http://lists.denx.de/mailman/listinfo/u-boot
Re: [U-Boot] [RFC] Review of U-Boot timer API
On 5/24/2011 9:51 AM, Graeme Russ wrote: > On 25/05/11 00:19, Wolfgang Denk wrote: > = Hi all, I have a few of questions. First, it seems that the get_timer interface is expected to work properly only after relocation and only when bss is available. I say this because the PPC version uses an (initialized) variable, timestamp, to hold the time. If that is the case, there is no need to hold the timer static data in gd, as I have been doing up to now. Am I correct ? Second, udelay is expected to be available before bss is available, very early in the startup process. There a comments to that effect in several places in the code. Therefore, udelay cannot use global or static variables. Is this true? And third, udelay is only expected/required to work correctly for "short", like say 10 seconds. I say this based on comments contained in powerpr/lib/time.c. Please ack or nak this as well. I have a couple of "small" changes to the previously submitted code, based partly on the above > Exposing ticks and tick_frequency to everyone via a 'tick' HAL > > In /lib/timer.c ulong timestamp = 0; /* This routine is called to initialize the timer after BSS is available */ void __weak prescaler_init(void) { u32 tick_frequency = get_tick_frequency(); /* initialize prescaler variables */ } > void __weak prescaler(void) > { > u32 ticks = get_ticks(); /* Bill's algorithm */ /* result stored in timestamp; */ > } > > u32 get_timer(u32 base) > { #if defined(CONFIG_SYS_NEED_PRESCALER) > prescaler(); #endif > return timestamp - base; > } > > In /arch/cpu/soc/timer.c or /arch/cpu/timer.c or /board//timer.c > > u32 get_ticks(void) > { > u32 ticks; > > /* Get ticks from hardware counter */ > > return ticks; > } > > u32 get_tick_frequency(void) > { > u32 tick_frequency; > > /* Determine tick frequency - likely very trivial */ > > return tick_frequency; > } or instead the user may override prescaler_init and prescaler if, for some reason, a highly optimized version is desired. Note also that if the configuration variable CONFIG_SYS_NEED_PRESCALER is not defined, the additional prescaler routines will not be called anywhere so the routines should not be loaded. Yes, it it a #define to manage, but it should allow the existing u-boots to be the same size as before, with no unused code. This size matters to some people a lot! > === > Not exposing ticks and tick_frequency to everyone > > In /lib/timer.c > > void prescaler(u32 ticks, u32 tick_frequency) > { > u32 current_ms; > > /* Bill's algorithm */ > > /* result stored in gd->timer_in_ms; */ > } > > In /arch/cpu/soc/timer.c or /arch/cpu/timer.c or /board//timer.c > > static u32 get_ticks(void) > { > u32 ticks; > > /* Get ticks from hardware counter */ > > return ticks; > } > > static u32 get_tick_frequency(void) > { > u32 tick_frequency; > > /* Determine tick frequency */ > > return tick_frequency; > } > > u32 get_timer(u32 base) > { > u32 ticks = get_ticks(); > u32 tick_frequency = get_tick_frequency(); > > prescaler(ticks, tick_frequency); > > return gd->timer_in_ms - base; > } > > === > I personally prefer the first - There is only one implementation of > get_timer() in the entire code and the platform implementer never has to > concern themselves with what the tick counter is used for. If the API gets > extended to include get_timer_in_seconds() there is ZERO impact on > platforms. Using the second method, any new feature would have to be > implemented on all platforms - and we all know how well that works ;) > > And what about those few platforms that are actually capable of generating > a 1ms timebase (either via interrupts or natively in a hardware counter) > without the prescaler? Well, with prescaler() declared weak, all you need > to do in /arch/cpu/soc/timer.c or /arch/cpu/timer.c or > /board//timer.c is: > > For platforms with a 1ms hardware counter: > void prescaler(void /* or u32 ticks, u32 tick_frequency*/) > { > gd->timer_in_ms = get_milliseconds(); > } > > For platforms with a 1ms interrupt source: > void timer_isr(void *unused) > { > gd->timer_in_ms++; > } > > void prescaler(void /* or u32 ticks, u32 tick_frequency*/) > { > } > > > And finally, if the platform supports interrupts but either the hardware > counter has better accuracy than the interrupt generator or the interrupt > generator cannot generate 1ms interrupts, configure the interrupt generator > to fire at any rate better than the tick counter rollover listed in > previous post and: > > void timer_isr(void *unused) > { > /* >* We are here to stop the tick counter rolling over. All we >* need to do is kick the prescaler - get_timer() does that :) >*/ > get_timer(0); > } > > In summary, platform s
Re: [U-Boot] [PATCH] [Timer]Remove calls to [get, reset]_timer outside arch/
On 5/23/2011 10:13 PM, Graeme Russ wrote: > On Tue, May 24, 2011 at 7:02 AM, Graeme Russ wrote: > > [snip] >> Well, we have no control over the argument in cfi driver (unless you plan >> to put #ifdef NIOS all over the place) >> >> Maybe we could round up the parameter inside get_timer() itself? > Wow, what was I on! - Oh, thats right, no coffee ;) > > The parameter to get_timer() is not a timeout, it is a reference epoch Hi Graeme, No, you were not on drugs. If base != 0 then { if (timer - base < 20) return 0; return timer - base -10} return timer. That will make the subsequent comparison do what you wanted. Differences of less than 20 are unreliable because of quantization, so we return 0. If the difference was > 20, return a conservative number, the difference -10. If the input base was 0, return the timer, the user is just trying to read it. Best Regards Bill Campbell > So I think adding reset_timer() to the NIOS get_timer() when the parameter > is zero is the only option > > Regards, > > Graeme > > ___ U-Boot mailing list U-Boot@lists.denx.de http://lists.denx.de/mailman/listinfo/u-boot
Re: [U-Boot] [RFC] Review of U-Boot timer API
On 5/24/2011 7:12 AM, Wolfgang Denk wrote: > Dear Albert ARIBAUD, > > In message<4ddb4c1c.7030...@aribaud.net> you wrote: >> Not sure I still follow what the two options are -- a heads up is welcome. >> However, I do like the simplicity in having a single time unit (ticks) >> for the timer API -- asuming it covers all needs -- and providing other >> time units only as helper functions. > I don't think using ticks is a good idea. You would need to change > all definitiuons of timeouts and delays and such. > > Why not using a scaled unit like microsecods or the currently used > milliseconds? > > I wonder why we suddenly have to change everything that has been > working fine for more than a decade (ignoring the large number of > incorrect, incomplete or broken implementations). But so far we > really never needed anything else but udelay() for the typical short > device related timeouts, and get_time() for longer, often protocol > defined, timeouts. > > Is there anything wrong with these two solutions, based on standard > units (us and ms) ? Hi All, After really looking into this, I think I agree with Wolfgang that using ms for a get_timer timebase is the best way to go. This thinking is heavily influenced (in my case anyway) by the fact that in the interrupt driven cases (and these are the ONLY fully compliant cases ATM I think), the "cost" of using ms is 0, because that is the "native" unit in which the timer ticks. This makes everything real simple. We can, right today, produce an API that supports u32 get_time_ms(void) for all CPUs in use. This would allow u32 get_timer(u32 base) to continue to exist as-is. These implementations would still be technically "broken" in the non-interrupt case, but they would work at least as well as they presently do. In fact, they would operate better because they would all use a single routine, not a bunch of different routines (some of which I am pretty sure have errors). Wolfgang would need to accept the fact that we are not yet "fixing" all the non-interrupt cases. This needs to be done, but is a different problem (I hope). In the non-interrupt case there is some cost in converting to ms from the native units, but we are in a timout loop, so a few extra instructions do not matter much. It is only a few 32 bit multiplies, which these days are real fast. If we can figure out how to use interrupts eventually, even this will go away. Then, we can ADD an optional performance measuring API like Scott suggests. This API would be something similar to the following: struct ticks { u32tickmsb; u32 ticklsb; } ; struct time_value { u32 seconds; u 32 nano_seconds; }; and the functions would be get_ticks(struct ticks * p) , get_tick_delta(struct ticks * minuend, struct ticks * subtrahend), and cvt_tick_delta(struct time_value * result, struct ticks *input). I didn't use u64 on purpose, because I don't want any un-necessary functions pulled from the library. get_ticks reads a "hi precision" counter. How high the precision is depends on the hardware. get_tick_delta subtracts two tick values, leaving the difference in the first operand. Yes, this is may be simple to do in open code but it is better to hide the details. (What if msb is seconds and lsb is nanoseconds, then it is not so simple). cvt_tick_delta converts from ticks to seconds and nano_seconds. We also need a u32 get_tick_resolution, which would return the tick resolution in ns. The user never needs to do any arithmetic on ticks, so that makes his life much easier. However, the user may want to know the resolution of these measurements. All these functions are quite fast except possibly cvt_tick_delta, which is only needed for printing anyway. If the hardware has a performance monitoring counter, implementing these functions is quite simple. The PPC and the x86 certainly do (well, any x86 pentium and above anyway). This is a whole lot of chips off our plate. In cases where no such counter exists, we will use whatever counters there are available already. Some of the ARM counters are running at 32 kHz, so their resolution won't be great, but it is what it is. If we find out that some of these other CPUs have a performance counter, we will use it. This API will be completely optional in u-boot and can be removed by changing a #define. Thoughts welcome. Best Regards, Bill Campbell > Best regards, > > Wolfgang Denk > ___ U-Boot mailing list U-Boot@lists.denx.de http://lists.denx.de/mailman/listinfo/u-boot
Re: [U-Boot] [PATCH] [Timer]Remove calls to [get, reset]_timer outside arch/
On 5/23/2011 2:02 PM, Graeme Russ wrote: > On 24/05/11 06:49, J. William Campbell wrote: >> On 5/23/2011 1:10 PM, Graeme Russ wrote: >>> On 24/05/11 04:29, Scott McNutt wrote: >>>> Hi Bill, >>>> >>>> J. William Campbell wrote: >>>>> On 5/23/2011 6:12 AM, Scott McNutt wrote: >>>>>> Dear Graeme, >>>>>> >>>>>> Graeme Russ wrote: >>>>>>> On 23/05/11 22:19, Scott McNutt wrote: >>>>>>>> Hi Graeme, >>>>>>>> >>>>>>>> Graeme Russ wrote: >>>>>>>>> There is no need to use get_timer() and reset_timer() and there are >>>>>>>>> build >>>>>>>> I must have missed something WRT reset_timer() -- my apologies >>>>>>>> if I'm covering old ground. >>>>>>>> >>>>>>>> When the timestamp is incremented using an interrupt that occurs with >>>>>>>> a period greater than 1 ms, we can get early timeouts. reset_timer() >>>>>>>> solved the problem. What's the recommended approach for dealing with >>>>>>>> this without reset_timer() ? >>>>> Hi Scott, >>>>> Are you saying that the interrupt frequency is greater than >>>>> 1000 times per second, or as I read it, the frequency is less than 1000 >>>>> per second (period greater than 1 ms). If anything, that should make the >>>>> timer run slow, not fast. >>>>>I wonder if it is a resolution issue. What are the typical delays in ms >>>>> you are using? >>>> Some older nios2 implementations have _fixed_ 10 msec timers. >>>> Basically, the timestamp is incremented asynchronous to get_timer(0). >>>> So a 10 msec timeout can occur, for example, almost immediately if >>>> the timer isn't reset just prior to calling get_timer(0). There are >>>> more details in the comments for the following commits: >>>> >>>> nios2: Reload timer count in reset_timer(): >>>> d8bc0a2889700ba063598de6d4e7d135360b537e >>>> >>>> cfi_flash: reset timer in flash status check: >>>> 22d6c8faac4e9fa43232b0cf4da427ec14d72ad3 >>>> >>>> I'm totally in favor of cleaning this stuff up. It caused some >>>> headaches (and wasted time) about 13 months ago. My primary concern >>>> is to avoid breaking things that currently work for us nios2 >>>> weenies ... at least for any length of time. >>>> >>>> Things are a bit tight for me until next week or so. I'll probably >>>> come up for air around June 1st ... and I'll be glad to help out. >>>> >>> Is there any reason why we cannot silently perform a reset_timer() any time >>> set_timer() is called with a parameter of 0? >> Hi All, >> I assume you mean get_timer(0)? In principle, you cannot do this > Yes - it's early, no coffee yet ;) >> because it could be inside another get_timer(0) loop that has already some >> time elapsed before you hit the inner get_timer(0). I think what needs to > Correct, but that is what is already happening for ALL arches in cfi due to > the reset_timer() before get_timer(0) - I am suggesting sandboxing the > problem to NIOS until we sort out the timer API properly > >> happen on the old NIOS with 10 ms resolution on the interrupt times is that >> all timer intervals must have 10 ms added and then rounded up to the >> nearest multiple of 10. Thus, if you wanted to wait for 1 millisecond, you >> must use an argument of 20 ms to be sure you wait at all! If you use an >> argument of 10, it won't help because you could get an interrupt right away >> and exit. If these routines are nios2 specific, you could add a local >> reset_timer, but I assume they are generic. . Note that if these routines >> are not nios2 specific, is there any harm in waiting "too long"? > Well, we have no control over the argument in cfi driver (unless you plan > to put #ifdef NIOS all over the place) > > Maybe we could round up the parameter inside get_timer() itself? Hi All, That would probably be the best way to go for now. It might slow things down a bit though, if these delays are all desired to be "short", like 1 ms. We would expand the 1 ms delay to 15 ms (average) while the current (illegal) solution would expand a 1 ms delay to 10 ms always. It is worth trying I think. It is also true that any other delays in the program will suffer from the 10 ms resolution problem, so your idea is I think a good one. Best Regards, Bill Campbell > Regards, > > Graeme > > > ___ U-Boot mailing list U-Boot@lists.denx.de http://lists.denx.de/mailman/listinfo/u-boot
Re: [U-Boot] [PATCH] [Timer]Remove calls to [get, reset]_timer outside arch/
On 5/23/2011 1:10 PM, Graeme Russ wrote: > On 24/05/11 04:29, Scott McNutt wrote: >> Hi Bill, >> >> J. William Campbell wrote: >>> On 5/23/2011 6:12 AM, Scott McNutt wrote: >>>> Dear Graeme, >>>> >>>> Graeme Russ wrote: >>>>> On 23/05/11 22:19, Scott McNutt wrote: >>>>>> Hi Graeme, >>>>>> >>>>>> Graeme Russ wrote: >>>>>>> There is no need to use get_timer() and reset_timer() and there are >>>>>>> build >>>>>> I must have missed something WRT reset_timer() -- my apologies >>>>>> if I'm covering old ground. >>>>>> >>>>>> When the timestamp is incremented using an interrupt that occurs with >>>>>> a period greater than 1 ms, we can get early timeouts. reset_timer() >>>>>> solved the problem. What's the recommended approach for dealing with >>>>>> this without reset_timer() ? >>> Hi Scott, >>>Are you saying that the interrupt frequency is greater than >>> 1000 times per second, or as I read it, the frequency is less than 1000 >>> per second (period greater than 1 ms). If anything, that should make the >>> timer run slow, not fast. >>> I wonder if it is a resolution issue. What are the typical delays in ms >>> you are using? >> Some older nios2 implementations have _fixed_ 10 msec timers. >> Basically, the timestamp is incremented asynchronous to get_timer(0). >> So a 10 msec timeout can occur, for example, almost immediately if >> the timer isn't reset just prior to calling get_timer(0). There are >> more details in the comments for the following commits: >> >> nios2: Reload timer count in reset_timer(): >>d8bc0a2889700ba063598de6d4e7d135360b537e >> >> cfi_flash: reset timer in flash status check: >>22d6c8faac4e9fa43232b0cf4da427ec14d72ad3 >> >> I'm totally in favor of cleaning this stuff up. It caused some >> headaches (and wasted time) about 13 months ago. My primary concern >> is to avoid breaking things that currently work for us nios2 >> weenies ... at least for any length of time. >> >> Things are a bit tight for me until next week or so. I'll probably >> come up for air around June 1st ... and I'll be glad to help out. >> > Is there any reason why we cannot silently perform a reset_timer() any time > set_timer() is called with a parameter of 0? Hi All, I assume you mean get_timer(0)? In principle, you cannot do this because it could be inside another get_timer(0) loop that has already some time elapsed before you hit the inner get_timer(0). I think what needs to happen on the old NIOS with 10 ms resolution on the interrupt times is that all timer intervals must have 10 ms added and then rounded up to the nearest multiple of 10. Thus, if you wanted to wait for 1 millisecond, you must use an argument of 20 ms to be sure you wait at all! If you use an argument of 10, it won't help because you could get an interrupt right away and exit. If these routines are nios2 specific, you could add a local reset_timer, but I assume they are generic. . Note that if these routines are not nios2 specific, is there any harm in waiting "too long"? Best Regards, Bill Campbell > Regards, > > Graeme > > ___ U-Boot mailing list U-Boot@lists.denx.de http://lists.denx.de/mailman/listinfo/u-boot
Re: [U-Boot] [RFC] Review of U-Boot timer API
On 5/23/2011 12:33 PM, Wolfgang Denk wrote: Dear "J. William Campbell", In message<4ddaa705.1040...@comcast.net> you wrote: My apologies for being a little (perhaps more than a little) dense. As they say, "after further review", I think the key aspect of the PPC timer system is that it uses the decrementer register to generate an interrupt at a 1 KHz rate. What I have been attempting here is to produce a timer system that does not use interrupts at all. This is a fundamental design question. Naturally, systems that can generate No, it is not. It is an implementation detail which is irrelevant to almost all users of U-Boot. Or do you actucally care if your UART driver uses polling or interrupts? Hi All, I might care a lot if I expect typeahead to work, or if I am sending a command script via a terminal emulator and I don't want to loose characters while u-boot is off executing a command. One might (correctly) say that that is too much to expect of a boot loader, and define polling as "good enough", which I advocate, or not. YMMV. an interrupt at a 1 KHz rate (or at any (reasonable) higher rate for that matter) using the decrementer register can produce a 1 ms resolution software counter that updates "by magic". If my understanding of this PPC code is incorrect, somebody please stop me before I make a further fool of myself! Is it then a design requirement that the timer system use interrupts? Is that what is meant by using the PPC system as No, it is not a design requirement. It is just one possible implementation. Any other method that achieves the same or similar results is as good. As noted before, on PowerPC we could have probably avoided this and just base all timer services on the timebase register. [The reason for this dual implementation is historical. When I wrote this code, I did not know if we would ever need any fancy timer- controlled callbacks or similar. And I needed to implement interrupt handling for a few other purposes (for example for use in standalone applications; this was an explicit requirement at that time). And the timer was something that made a good and simple example.] a model? If so, is it possible/reasonable on all the u-boots that are out there to generate and process timer interrupts at some (hopefully but not necessarily) programmable rate? I consider this an implementation detail. On all architectures it should be possible to use interrupts, so if the hardware supports a timer that can generate interrupts it should be possible to use this. But it is not a requirement that all implementations must work like Ok, this is very nice to understand. I will attempt to summarize what I think this email and the previous one means. First, the required properties of the get_timer routine. 1. It must have 1 millisecond resolution and accuracy (more or less). For instance, the old NIOS timer that incremented the timestamp by 10 every 10 milliseconds in response to an interrupt is not compliant. 2. The get_timer routine must have full period accuracy without any assumptions regarding what is going on in u-boot. This period is 4294967 seconds or so. I then suggest that the minimum system requirements to support the u-boot timer are as follows: * Either there exists a free-running timer whose period is > 4294967 and whose resolution is 1 millisecond or better. This probably includes all 64 bit timestamp counters. * Or there exists a method of generating interrupts at a known rate of 1 millisecond or faster. This is a superset of the current PPC method. * Or there exists a method of generating interrupts at a known fixed rate slower than once a millisecond AND there exists a readable free running counter whose period is longer or the same as the interrupt rate AND whose resolution is at least 1 ms. This would include N bit counters that generate an interrupt when they overflow, or some long timestamp counter with another, possibly unrelated interrupt generating methodology that is faster than the counter overflow interval. There are many systems that are able to do all three cases, or two of three, so they have a choice on how to implement get_timer(). I claim that these are sufficient conditions. I also claim they are necessary. If a hardware system does not meet these criteria, I claim it can't meet the get_timer requirements. Do such systems exist today that use u-boot? I think probably so, but maybe they are corner cases. Note that you cannot extend a 32 bit counter to a 64 bit counter reliably without using interrupts to do so when get_timer is not being called. Systems using the first approach above have essentially all their logic in the get_timer routine, but MUST use at least some 64 bit arithmetic (actually 33 bit arithmetic if you don't count shifts) to do
Re: [U-Boot] [RFC] Review of U-Boot timer API
On 5/23/2011 6:19 AM, Wolfgang Denk wrote: > Dear Graeme Russ, > This is what PPC is doing. And I understand that Reinhard did the same > in software for AT91. Hi All, My apologies for being a little (perhaps more than a little) dense. As they say, "after further review", I think the key aspect of the PPC timer system is that it uses the decrementer register to generate an interrupt at a 1 KHz rate. What I have been attempting here is to produce a timer system that does not use interrupts at all. This is a fundamental design question. Naturally, systems that can generate an interrupt at a 1 KHz rate (or at any (reasonable) higher rate for that matter) using the decrementer register can produce a 1 ms resolution software counter that updates "by magic". If my understanding of this PPC code is incorrect, somebody please stop me before I make a further fool of myself! Is it then a design requirement that the timer system use interrupts? Is that what is meant by using the PPC system as a model? If so, is it possible/reasonable on all the u-boots that are out there to generate and process timer interrupts at some (hopefully but not necessarily) programmable rate? Best Regards, Bill Campbell > Best regards, > > Wolfgang Denk > ___ U-Boot mailing list U-Boot@lists.denx.de http://lists.denx.de/mailman/listinfo/u-boot
Re: [U-Boot] [RFC] Review of U-Boot timer API
On 5/23/2011 6:19 AM, Wolfgang Denk wrote: Dear Graeme Russ, In message<4dda5334.4060...@gmail.com> you wrote: - A helper function in /lib/ u32 get_raw_ms() which uses get_raw_ticks() and get_tick_rate() to correctly maintain the ms counter used by get_timer() - This function can be weak (so next point) Ditto. What would that do? If it gets milliseconds as the name suggest, that's already the function needed for get_timer()? OK, there appears to be a consensus that not all hardware actually supports a free-running timer with 1ms resolution. To overcome this, the idea is to Indeed. I guess most of them do not. create a common library function which maintains the free running counter. The library function acts as a pre-scaler using a 'raw tick counter' and a 'raw tick rate' supplied by the low level architecture. We define this weak What are "raw" ticks? And what are "cooked" ticks, then? Hi all, FWIW, "cooked" ticks would be 1 ms ticks, although we never really use them as such. so that if the architecture can provide a free running 1ms counter, there is no (code size) penalty Why do we need a free running 1ms counter at all? Any free running counter of at least millisecoind resolution should be good enough. Correct. Any free running counter whose resolution is better than one millisecond and which is "long enough" that it will not overflow between calls to get_timer is sufficient. This approach eliminates all the existing per-arch code which (attempts) to manage the time base behind get time. So we simplify each arch down to it's bare essentials - Provide a counter which increments at a natural fixed rate and what the rate is - Let common library code deal with the rest. Did you have a look at the PowerPC implementation? I'd like to see this used as reference. I have looked at it as a reference. However, there is one disadvantage in using the PPC code as a reference. It has a 64 bit timestamp. Many systems do not have a 64 bit timestamp, but rather a 32 bit timestamp. It is possible to extend the 32 bit timestamp to a 64 bit timestamp if get_timer is called often enough that there will only be a single rollover of the bottom 32 bits between uses. However, if that condition is met, there is no need to extend the timer to 64 bits. Instead, just convert the elapsed time since the last call (which you know) to ms and be done with it. As Wolfgang said above, any counter that has better than 1 ms resolution is adequate to the task. The additional requirement that we have stated is that the counter be long enough that it does not overflow between calls to get_timer. If the counter is 64 bits long, it pretty much for sure meets this requirement (although bits below 0.5 ms resolution really don't help any). If the timer is 32 bits long, it will meet any requirements using get_timer to time out hardware intervals. My original implementation used a 32 bit divide and does exactly that. This is the shortest and simplest approach, and we can get that working in all cases quite easily I think. We can avoid the "no divide" optimization until everybody is satisfied with what we have. - Calling of get_raw_ticks() regularly in the main loop (how ofter will depend on the raw tick rate, but I image it will never be necessary to call more often than once every few minutes) NAK. This concept is fundamentally broken. I will not accept it. Some existing timers are fundamentally broken - The glitch at the 0x to 0x rollover or rollover early - The method discussed in this thread eliminates all such glitches. Provided pre-scaler in /lib/ (triggered by get_timer() usually) is called often enough (71 minutes for a 32-bit 1MHz counter) then there is no need. Even then, it is only important We already have this nightmare of code for triggering the watchdog on systems that use it. Assuming there are places in the main loop that get executed often enough is a broken concept, and I will not accept any such code. That is fine with me. The reason this was being done was to attempt to emulate, as much as possible, the power PC, where the 64 bit timestamp counter allows calls to get_timer separated by many minutes and several console commands to work properly. These get timer commands will NOT work properly on systems that have a 32 bit counter that overflows every 200 seconds or so. The call in the idle loop was an attempt to make the 32 bit systems work more like the 64 bit systems. One may then either 1. Define calls to get_timer to measure an elapsed interval separated by any returns to the command processor as broken. 2. Require the use of interrupts to extend the 32 bit timestamp. (This may not be possible on all systems as the timer used for performance monitoring does not interrupt, etc.) 3. Allow the call in the idle loop under the assumption that we are talking about timing in the minutes range, not a few
Re: [U-Boot] [RFC] Review of U-Boot timer API
On 5/22/2011 11:29 PM, Albert ARIBAUD wrote: Hi all, Sorry, could not follow the discussion although I find it very interesting, so I will handle the task of coming in late and asking the silly questions. I am glad you are looking at our discussion. I am sure we are going to need all the help/oversight/questions that we can get, as this is a change that will affect all architectures. Le 23/05/2011 07:25, Graeme Russ a écrit : On Mon, May 23, 2011 at 3:02 PM, J. William Campbell wrote: On 5/22/2011 6:42 PM, Graeme Russ wrote: OK, so in summary, we can (in theory) have: - A timer API in /lib/ with a single u32 get_timer(u32 base) function - A HAL with two functions - u32 get_raw_ticks() - u32 get_raw_tick_rate() which returns the tick rate in kHz (which max's out at just after 4GHz) - A helper function in /lib/ u32 get_raw_ms() which uses get_raw_ticks() and get_tick_rate() to correctly maintain the ms counter used by get_timer() - This function can be weak (so next point) - If the hardware supports a native 32-bit ms counter, get_raw_ms() can be overridden to simply return the hardware counter. In this case, get_raw_ticks() would return 1 Are you sure you did not mean 'get_raw_ticks_rate' here? Besides, I'd like the name to specify the units used: 'get_raw_ticks_rate_in_khz' (or conversively 'get_raw_ticks_per_ms', depending on which is simpler to implement and use). I think you are correct, it was the rate function desired here. I think the best way to go is use a "get_raw_tick_rate_in_mhz" function, because it is probably the easiest one to implement, and in many cases something like it already exists. - Calling of get_raw_ticks() regularly in the main loop (how ofter will depend on the raw tick rate, but I image it will never be necessary to call more often than once every few minutes) That's to keep track of get_raw_ticks() rollovers, right? And then the get_timer function (which, again, I would prefer to have '... in ms' expressed in its name) would call get_raw_ticks() in confidence that at most one rollover may have occurred since the last time the helper function was called, so a simple difference of the current vs last tick value will always be correct. Exactly so. Note that this same function probably needs to be called in udelay for the same reason. More precisely, the get_timer function will call get_raw_ms, which will call get_raw_ticks. I think it may be better to move get_timer "down" a level in the hierarchy, so we don't need a get_raw_ms. get_timer would then be part of the HAL. One would use a get_timer(0) in order to do what get_raw_ms alone would have done. If the user had a good reason, he would then override get_timer with his own version. What do you think Graeme? It reduces the nesting depth by one level. As for the name change to get_timer_in_ms, I would support it. Naturally, such a change would be up to Mr. Denk. Since by definition that is what the function does, it seems to be a good change from my point of view. - If the hardware implements a native 32-bit 1ms counter, no call in the main loop is required - An optional HAL function u32 get_raw_us() which can be used for performance profiling (must wrap correctly) Hi All, Graeme, I think you have stated exactly what is the "best" approach to this problem. I will provide a version of "get_raw_ms" that is initialized using get_raw_tick_rate that will work for all "reasonable" values of raw_tick_rate. This will be the "generic" solution. Both the initialization function and the get_raw_ms function can be overridden if there is reason to do so, like "exact" clock rates. I will do the same with get_raw_us. This will be posted sometime on Monday for people to review, and to make sure I didn't get too far off base. Thank you to both Graeme and Reinhard for looking at/working on this.. Hopefully, this solution will put this timing issue to rest for all future ports as well as the presently existing ones. In Greame's description, I did not see a get_raw_ms, only a get_raw_us. Was this last one a typo or is that a third HAL function? get_raw_ms was referenced as a library function a few lines above. Right now, I think the functionality we require from the HAL is 1. get_raw_tick_rate_in_mhz 2. get_raw_ms 3. get_raw_ticks 4. (optional)get_raw_us There is also APIs for these functions called get_timer. I think we need to add a call to another function called "initialize_timer_system" or similar that will initialize the data structures in gd by calling get_raw_tick_rate_in_mhz. Additionally, I think we need to provide a udelay function, simply because it can interact with calling get_raw_ms often enough. We are somewhat caught between two fires here, in t
Re: [U-Boot] [RFC] Review of U-Boot timer API
On 5/22/2011 8:26 PM, Reinhard Meyer wrote: > Dear J. William Campbell, >> On 5/22/2011 1:15 AM, Reinhard Meyer wrote: >>> Dear J. William Campbell, >>> >>> please demonstrate for me (and others), by a practical example, >>> how _any_ arithmetic (even less with just shifts and multiplies) >>> can convert a free running 3.576 MHz (wild example) free running >>> 32 bit counter (maybe software extended to 64 bits) into a ms >>> value that will properly wrap from 2**32-1 to 0 ? >> Hi All >> I accept the challenge! I will present two ways to do this, one using >> a 32 bit by 16 bit divide, and one using only multiplies. >> This first method is "exact", in that there is no difference in >> performance from a "hardware" counter ticking at the 1 ms rate. This >> is accomplished by operating the 1 ms counter based on the delta time >> in the hardware time base. It is necessary to call this routine often >> enough that the hardware counter does not wrap more than once between >> calls. This is not really a problem, as this time is 1201 seconds or >> so. If the routine is not called for a long time, or at the first >> call, it will return a timer_in_ms value that will work for all >> subsequent calls that are within a hardware rollover interval. Since >> the timer in ms is a 32 bit number anyway. The same rollover issue >> will exist if you "software extend" the timer to 64 bits. You must >> assume 1 rollover. If it is more than 1, the timer is wrong. >> >> >> The variables in the gd are >> u32 prev_timer; >> u32 timer_in_ms; >> u16 timer_remainder; >> >> /* gd->timer remainder must be initialized to 0 (actually, an number >> less than 3576, but 0 is nice). Other two variables don't matter but >> can be initialized if desired */ >> >> u32 get_raw_ms() >> { >> u32 delta; >> u32 t_save; >> >> read(t_save); /* atomic read of the hardware 32 bit timer running at >> 3.576 MHz */ >> delta_t = (t_save - gd->prev_timer) ; >> >> gd->prev_timer = t_save; >> /* >> Hopefully, the following two lines only results in one hardware >> divide when optimized. If your CPU has no hardware divide, or if it >> slow, see second method . >> */ >> gd->timer_in_ms += delta_t / 3576; /* convert elapsed time to ms */ >> gd->timer_remainder += delta_t % 3576; /* add in remaining part not >> included above */ >> if (gd->timer_remainder >= 3576) /* a carry has been detected */ >> { >> ++gd->timer_in_ms; >> gd->timer_remainder -= 3576; /* fix remainder for the carry above */ >> } >> >> return(gd->timer_in_ms) >> } > > Thank you! Basically this is similar to a Bresenham Algorithm. Hi All, Yes, I think you are correct. I didn't know it by that name, but i think you are correct. It is a bit different use of the idea, but it is very similar. > >> >> This approach works well when the number of ticks per ms is an exact >> number representable as a small integer, as it is in this case. It is >> exact with a clock rate of 600 MHz, but is not exact for a clock rate >> of 666 MHz. 67 is not an exact estimate of ticks per ms, It is >> off by 0.5 % That should be acceptable for use as a timeout >> delay. The accumulated error in a 10 second delay should be less than >> 0.5 ms. > > I would think the non exact cases result in such a small error that > can be > tolerated. We are using the ms tick for timeouts, not for providing a > clock > or exact delays. We should just round up when calculating the divider. Yes, we should round off the divider value, so 666.66 MHz rounds to 67 ticks/Ms while 333.33 MHz rounds to 33 ticks/Ms. > > Hence the hick-ups that result when this is not called frequent enough to > prevent a multiple rollover of the raw value between calls do not matter > either (they should be just documented). Good, I am glad we agree on this also. > >> >> There is a way that the divide above can be approximated by >> multiplying by an appropriate fraction, taking the resulting delta t >> in ms, multiplying it by 3576, and subtracting the product from the >> original delta to get the remainder. This is the way to go if your >> CPU divides slowly or not at all. This approach is presented below. >> > [...] > > Optimizations would be up to the implementer of such a hardware and work > only if the divider is a compile time constant. Often the divider will be > run time determined (AT91 for example). Correct. I will provide a "generic" version that computes the constants at run time. If the clock rate is a constant, these routines can be overridden at compile/link time. This generic version should be available on Monday for further review. Best Regards, Bill Campbell > > Reinhard > > ___ U-Boot mailing list U-Boot@lists.denx.de http://lists.denx.de/mailman/listinfo/u-boot
Re: [U-Boot] [RFC] Review of U-Boot timer API
On 5/22/2011 6:42 PM, Graeme Russ wrote: > OK, so in summary, we can (in theory) have: > - A timer API in /lib/ with a single u32 get_timer(u32 base) function > - A HAL with two functions > - u32 get_raw_ticks() > - u32 get_raw_tick_rate() which returns the tick rate in kHz (which > max's out at just after 4GHz) > - A helper function in /lib/ u32 get_raw_ms() which uses get_raw_ticks() > and get_tick_rate() to correctly maintain the ms counter used by > get_timer() - This function can be weak (so next point) > - If the hardware supports a native 32-bit ms counter, get_raw_ms() > can be overridden to simply return the hardware counter. In this case, > get_raw_ticks() would return 1 > - Calling of get_raw_ticks() regularly in the main loop (how ofter will > depend on the raw tick rate, but I image it will never be necessary > to call more often than once every few minutes) > - If the hardware implements a native 32-bit 1ms counter, no call in > the main loop is required > - An optional HAL function u32 get_raw_us() which can be used for > performance profiling (must wrap correctly) Hi All, Graeme, I think you have stated exactly what is the "best" approach to this problem. I will provide a version of "get_raw_ms" that is initialized using get_raw_tick_rate that will work for all "reasonable" values of raw_tick_rate. This will be the "generic" solution. Both the initialization function and the get_raw_ms function can be overridden if there is reason to do so, like "exact" clock rates. I will do the same with get_raw_us. This will be posted sometime on Monday for people to review, and to make sure I didn't get too far off base. Thank you to both Graeme and Reinhard for looking at/working on this.. Hopefully, this solution will put this timing issue to rest for all future ports as well as the presently existing ones. On a different note, the "graylisting" application is causing some (about half) of my replies to the list from showing up. The messages are going to the specified individuals correctly, but not always to the list. This is apparently because my ISP has so many different network address available that the probability of using the same one (or same subnet) is not very high. Is there anything that can be done about this? Best Regards, Bill Campbell > Regards, > > Graeme > > ___ U-Boot mailing list U-Boot@lists.denx.de http://lists.denx.de/mailman/listinfo/u-boot
Re: [U-Boot] [RFC] Review of U-Boot timer API
On 5/22/2011 5:02 PM, Graeme Russ wrote: > Dear Reinhard, > > On Sun, May 22, 2011 at 6:15 PM, Reinhard Meyer > wrote: >> Dear J. William Campbell, >> >> please demonstrate for me (and others), by a practical example, >> how _any_ arithmetic (even less with just shifts and multiplies) >> can convert a free running 3.576 MHz (wild example) free running >> 32 bit counter (maybe software extended to 64 bits) into a ms >> value that will properly wrap from 2**32-1 to 0 ? >> >> I fail to see how that will be possible... > These may help: > > http://blogs.msdn.com/b/devdev/archive/2005/12/12/502980.aspx > http://www.hackersdelight.org/divcMore.pdf > > of Google 'division of unsigned integer by a constant' > > Basically you will be dividing by a constant value of 3576 which can be > highly optimised Hi All, Yes, you can do it this way. I prefer not to use properties of the constant because it doesn't allow multiple clock rates easily. The code I posted just a moment ago is easy to change for different rates because it does not rely much on the exact bit structure of 3576. There are simple formulas for all the "magic numbers" in what I posted, so it is easy to change the input frequency. Best Regards, Bill Campbell > Regards, > > Graeme > > ___ U-Boot mailing list U-Boot@lists.denx.de http://lists.denx.de/mailman/listinfo/u-boot
Re: [U-Boot] [RFC] Review of U-Boot timer API
On 5/22/2011 1:15 AM, Reinhard Meyer wrote: Dear J. William Campbell, please demonstrate for me (and others), by a practical example, how _any_ arithmetic (even less with just shifts and multiplies) can convert a free running 3.576 MHz (wild example) free running 32 bit counter (maybe software extended to 64 bits) into a ms value that will properly wrap from 2**32-1 to 0 ? Hi All I accept the challenge! I will present two ways to do this, one using a 32 bit by 16 bit divide, and one using only multiplies. This first method is "exact", in that there is no difference in performance from a "hardware" counter ticking at the 1 ms rate. This is accomplished by operating the 1 ms counter based on the delta time in the hardware time base. It is necessary to call this routine often enough that the hardware counter does not wrap more than once between calls. This is not really a problem, as this time is 1201 seconds or so. If the routine is not called for a long time, or at the first call, it will return a timer_in_ms value that will work for all subsequent calls that are within a hardware rollover interval. Since the timer in ms is a 32 bit number anyway. The same rollover issue will exist if you "software extend" the timer to 64 bits. You must assume 1 rollover. If it is more than 1, the timer is wrong. The variables in the gd are u32 prev_timer; u32 timer_in_ms; u16 timer_remainder; /* gd->timer remainder must be initialized to 0 (actually, an number less than 3576, but 0 is nice). Other two variables don't matter but can be initialized if desired */ u32 get_raw_ms() { u32 delta; u32 t_save; read(t_save); /* atomic read of the hardware 32 bit timer running at 3.576 MHz */ delta_t = (t_save - gd->prev_timer) ; gd->prev_timer = t_save; /* Hopefully, the following two lines only results in one hardware divide when optimized. If your CPU has no hardware divide, or if it slow, see second method . */ gd->timer_in_ms += delta_t / 3576; /* convert elapsed time to ms */ gd->timer_remainder += delta_t % 3576; /* add in remaining part not included above */ if (gd->timer_remainder >= 3576) /* a carry has been detected */ { ++gd->timer_in_ms; gd->timer_remainder -= 3576; /* fix remainder for the carry above */ } return(gd->timer_in_ms) } This approach works well when the number of ticks per ms is an exact number representable as a small integer, as it is in this case. It is exact with a clock rate of 600 MHz, but is not exact for a clock rate of 666 MHz. 67 is not an exact estimate of ticks per ms, It is off by 0.5 % That should be acceptable for use as a timeout delay. The accumulated error in a 10 second delay should be less than 0.5 ms. There is a way that the divide above can be approximated by multiplying by an appropriate fraction, taking the resulting delta t in ms, multiplying it by 3576, and subtracting the product from the original delta to get the remainder. This is the way to go if your CPU divides slowly or not at all. This approach is presented below. the vaues in gd are as follows: u32 prev_timer; u32 timer_in_ms; /* One tick of the 3.576 MHz timer corresponds to 1/3.576/1000 ms, or 0.000279642 ms. Scale the fraction by 65536 (16 bit shift), you get 37532.9217 */ u32 get_raw_ms(void) { u32 t_save; u32 temp; /* read the hardware 32 bit timer running at 3.576 MHz */ read_timer_atomic(t_save); t_save -= gd->prev_timer; /* get "delta time" since last call */ gd->prev_timer += t_save; /* assume we will use all of the counts */ /* * This first while loop is entered for any delta time > about 18.3 ms. The * while loop will execute twice 2.734% of the time, otherwise once. */ while (t_save > 65535) { temp = t_save >> 16; /* extract msb */ temp = ((temp * 37532) + ((temp * 60404) >> 16)) >> 11; /* temp = (temp * 37532) >> 11; */ gd->timer_in_ms += temp; t_save -= temp * 3576; } /* * This second while loop is entered for 94.837% of all possible delta times, * 0 through 0X. The second while loop will execute twice 0.037% of * the time, otherwise once. */ while (t_save >= 3576) { temp = (t_save * 37532) >> (16 + 11); if (temp == 0) temp = 1; /* we know that 1 works for sure */ gd->timer_in_ms += temp; t_save -= temp * 3576; } gd->prev_timer -= t_save; /* restore any counts we didn't use this time */ return gd->timer_in_ms; } I have tested this code and it seems to work fine for me. I have attached a more readable copy as "example .c" for those who wish to play around with it. In my original post, I had a version of this code that could be used with d
Re: [U-Boot] [RFC] Review of U-Boot timer API
On 5/21/2011 11:23 PM, Graeme Russ wrote: > On 22/05/11 14:26, Reinhard Meyer wrote: >> Dear Graeme Russ, >>> Hi All, >>> >>> I've just had a good look through the timer API code with the view of >>> rationalising it and fixing up some of the inconsistencies. What I found >>> was a little scarier than I expected! Anyway, here is a write-up of what I >>> found - Please comment >> We have been at this discussion a multiple of times :) but never reached a >> consent. >> >> However, at current master, I have reduced at91 architecture to only use >> get_timer(base), set_timer() never existed and reset_timer() has been >> removed. > Excellent > >> As it showed up recently, common cfi code still calls reset_timer() - which >> certainly >> can be fixed with little effort... > Yes, this is one of the easy fixes as all call sites already use the start > = get_timer(0), elapsed = get_timer(start) convention anyway - The > reset_timer() calls are 100% redundant (provided get_timer() behaves > correctly at the 32-bit rollover for all arches) > >>> The U-Boot timer API is not a 'callback' API and cannot 'trigger' a >>> function call after a pre-determined time. >> that would be too complex to implement and of little use in a single task >> system. u-boot can do fine with polling. > I am in no way suggesting this - I just want to clarify the API for anyone > who needs to use it > >>> NOTE: http://lists.denx.de/pipermail/u-boot/2010-June/073024.html appears >>> to imply the following implementation of get_timer() is wrong: >>> >>> ulong get_timer(ulong base) >>> { >>> return get_timer_masked() - base; >>> } >> Is is not wrong as long as get_timer_masked() returns the full 32 bit space >> of numbers and 0x is followed by 0x. Most implementations >> probably do NOT have this property. >>> U-Boot Timer API Details >>> >>> There are currently three functions in the U-Boot timer API: >>> ulong get_timer(ulong start_time) >> As you point out in the following, this is the only function required. >> However it REQUIRES that the internal timer value must exploit the full >> 32 bit range of 0x to 0x before it wraps back to 0x. > So this needs to be clearly spelt out in formal documentation > >>> Rationalising the API >>> = >>> Realistically, the value of the free running timer at the start of a timing >>> operation is irrelevant (even if the counter wraps during the timed period). >>> Moreover, using reset_timer() and set_timer() makes nested usage of the >>> timer API impossible. So in theory, the entire API could be reduced to >>> simply get_timer() >> Full ACK here !!! > I don't think there will be much resistance to this > >>> 3. Remove reset_timer_masked() >>> -- >>> This is only implemented in arm but has propagated outside arch/arm and >>> into board/ and drivers/ (bad!) >>> >>> regex "[\t ]*reset_timer_masked\s*\([^)]*\);" reveals 135 callers! >>> >>> A lot are in timer_init() and reset_timer(), but the list includes: >>>- arch/arm/cpu/arm920t/at91rm9200/spi.c:AT91F_SpiWrite() >>>- arch/arm/cpu/arm926ejs/omap/timer.c:__udelay() >>>- arch/arm/cpu/arm926ejs/versatile/timer.c:__udelay() >>>- arch/arm/armv7/s5p-common/timer.c:__udelay() >>>- arch/arm/lh7a40x/timer.c:__udelay() >>>- A whole bunch of board specific flash drivers >>>- board/mx1ads/syncflash.c:flash_erase() >>>- board/trab/cmd_trab.c:do_burn_in() >>>- board/trab/cmd_trab.c:led_blink() >>>- board/trab/cmd_trab.c:do_temp_log() >>>- drivers/mtd/spi/eeprom_m95xxx.c:spi_write() >>>- drivers/net/netarm_eth.c:na_mii_poll_busy() >>>- drivers/net/netarm_eth.c:reset_eth() >>>- drivers/spi/atmel_dataflash_spi.c/AT91F_SpiWrite() >> Fixed in current master. > Excellent. I have not pulled master for a little while, guess I should > >>>- If hardware supports microsecond resolution counters, get_timer() could >>> simply use get_usec_timer() / 1000 >> That is wrong. Dividing 32 bits by any number will result in a result that >> does not >> exploit the full 32 bit range, i.e. wrap from (0x/1000) to >> 0x, >> which makes time differences go wrong when they span across such a wrap! >> > Yes, this has already been pointer out - 42 bits are needed as a bare > minimum. However, we can get away with 32-bits provided get_timer() is > called at least every 71 minutes > > P.S. Can we use the main loop to kick the timer? > >>>- get_usec_timer_64() could offer a longer period (584942 years!) >> Correct. And a "must be" when one uses such division. > Unless we can rely on get_timer() to be called at least every 71 minutes in > which case we can handle the msb's without error in software > >> But you have to realize that most hardware does not provide a simple means to >> implement a timer that runs in either exact microseconds or exact >> milliseconds. > This is where t
Re: [U-Boot] [RFC] Review of U-Boot timer API
On 5/21/2011 9:26 PM, Reinhard Meyer wrote: > Dear Graeme Russ, >> Hi All, >> >> I've just had a good look through the timer API code with the view of >> rationalising it and fixing up some of the inconsistencies. What I found >> was a little scarier than I expected! Anyway, here is a write-up of what I >> found - Please comment > We have been at this discussion a multiple of times :) but never reached a > consent. > > However, at current master, I have reduced at91 architecture to only use > get_timer(base), set_timer() never existed and reset_timer() has been removed. > > As it showed up recently, common cfi code still calls reset_timer() - which > certainly > can be fixed with little effort... > >> At the lowest level, the U-Boot timer API consists of a unsigned 32-bit >> free running timestamp which increments every millisecond (wraps around >> every 4294967 seconds, or about every 49.7 days). The U-Boot timer API >> allows: >>- Time deltas to be measured between arbitrary code execution points >> ulong start_time; >> ulong elapsed_time; >> >> start_time = get_timer(0); >> ... >> elapsed_time = get_timer(start_time); >> >>- Repetition of code for a specified duration >> ulong start_time; >> >> start_time = get_timer(0); >> >> while (get_timer(start_time)< REPEAT_TIME) { >> ... >> } >> >>- Device timeout detection >> ulong start_time; >> >> send_command_to_device(); >> start = get_timer (0); >> while (device_is_busy()) { >> if (get_timer(start)> TIMEOUT) >> return ERR_TIMOUT; >> udelay(1); >> } >> return ERR_OK; > correct. > >> The U-Boot timer API is not a 'callback' API and cannot 'trigger' a >> function call after a pre-determined time. > that would be too complex to implement and of little use in a single task > system. u-boot can do fine with polling. > >> NOTE: http://lists.denx.de/pipermail/u-boot/2010-June/073024.html appears >> to imply the following implementation of get_timer() is wrong: >> >> ulong get_timer(ulong base) >> { >> return get_timer_masked() - base; >> } > Is is not wrong as long as get_timer_masked() returns the full 32 bit space > of numbers and 0x is followed by 0x. Most implementations > probably do NOT have this property. > >> U-Boot Timer API Details >> >> There are currently three functions in the U-Boot timer API: >> ulong get_timer(ulong start_time) > As you point out in the following, this is the only function required. > However it REQUIRES that the internal timer value must exploit the full > 32 bit range of 0x to 0x before it wraps back to 0x. > >> void set_timer(ulong preset) >> void reset_timer(void) >> >> get_timer() returns the number of milliseconds since 'start_time'. If >> 'start_time' is zero, therefore, it returns the current value of the >> free running counter which can be used as a reference for subsequent >> timing measurements. >> >> set_timer() sets the free running counter to the value of 'preset' >> reset_timer() sets the free running counter to the value of zero[1]. In >> theory, the following examples are all identical >> >> >> ulong start_time; >> ulong elapsed_time; >> >> start_time = get_timer(0); >> ... >> elapsed_time = get_timer(start_time); >> >> ulong elapsed_time; >> >> reset_timer(); >> ... >> elapsed_time = get_timer(0); >> >> ulong elapsed_time; >> >> set_timer(0); >> ... >> elapsed_time = get_timer(0); >> >> >> [1] arch/arm/cpu/arm926ejs/at91/ and arch/arm/cpu/arm926ejs/davinci/ are >> exceptions, they set the free running counter to get_ticks() instead > Not anymore on at91. >> Architecture Specific Peculiarities >> === >> ARM >>- Generally define get_timer_masked() and reset_timer_masked() >>- [get,reset]_timer_masked() are exposed outside arch\arm which is a bad >> idea as no other arches define these functions - build breakages are >> possible although the external files are most likely ARM specific (for >> now!) >>- Most CPUs define their own versions of the API get/set functions which >> are wrappers to the _masked equivalents. These all tend to be the same. >> The API functions could be moved into arch/arm/lib and made weak for >> the rare occasions where they need to be different >>- Implementations generally look sane[2] except for the following: >> - arm_intcm - No timer code (unused CPU arch?) >> - arm1136/mx35 - set_timer() is a NOP >> - arm926ejs/at91 - reset_timer() sets counter to g
Re: [U-Boot] [RFC] Review of U-Boot timer API
On 5/21/2011 5:06 PM, Graeme Russ wrote: > On 22/05/11 01:33, J. William Campbell wrote: >> On 5/21/2011 5:38 AM, Graeme Russ wrote: >>> Hi All, >>> 4. Implement microsecond API - get_usec_timer() >>> --- >>>- Useful for profiling >>>- A 32-bit microsecond counter wraps in 71 minutes - Probably OK for most >>> U-Boot usage scenarios >>>- By default could return get_timer() * 1000 if hardware does not support >>> microsecond accuracy - Beware of results> 32 bits! >> Hi All, >>I think the multiply overflowing an unsigned long is ok here as long >> as the timeout value you desire is less than 71 seconds. This assumes that >> the CPU returns the correct lower 32 bits when overflow occurs, but I think >> this is the "normal" behavior(?) > I think you mean 71 minutes - 2^32 = 4294967296 (usec) = 4294967.296 (msec) > = 4294.967296 s = 71.582788267 minutes > > So provided any millisecond timer using a 32-bit microsecond timer as a > time base is called every 71 minutes, a 32-bit microsecond timer should > suffice to keep the msb's accurately maintained by software > Hi Graeme, Yes, you are certainly correct, it is minutes, not seconds. >>>- If hardware supports microsecond resolution counters, get_timer() could >>> simply use get_usec_timer() / 1000 >> I think this actually is NOT equivalent to the "current" API in >> that the full 32 bits of the timer is not available and as a result the >> "wrapping" properties of a 32 bit subtract for delta times will not work >> properly. If a "larger" counter is available in hardware, then it is >> certainly possible to do a 64 by 32 bit divide in get_timer, but probably >> you don't want to do that either. As previously discussed, it is possible >> to extract a 32 bit monotonic counter of given resolution (microsecond or >> millisecond resolution) from a higher resolution counter using a shift to >> "approximately" the desired resolution followed by a couple of multiply/add >> functions of 32 bit resolution. To do this with a microsecond resolution, >> a 42 bit or larger timer is required. The "extra" bits can be provided in > Of course, how silly of me - To downscale the microsecond timer down to > milliseconds you need to have at least 1000 times more resolution > (9.965784285 bits) - It was late ;) > >> software as long as the get_timer/get_usec_timer routines are called more >> often than every 71/2 sec, so that a correct delta in microseconds can be >> obtained. Note that when the timer is not actively in use (not called >> often enough), the millisecond timer msb would stop updating, but that >> wouldn't matter. > Minutes - see above Correct. >> If the hardware supports sub-microsecond accuracy in a "longer" >> register, say 64 bits, you can just convert the 64 bit hardware timer to 32 >> bit microseconds or milliseconds by a shift and 32 bit multiplies > Yes > >>Good luck with this effort. I think getting the timer API and also >> the method of implementation of the interface to the hardware to be the >> same across all u-boot architectures is a very good idea, and it is >> possible. However, it is a quite a bit of work and I am glad you are brave >> enough to try! > It's all there already - it just needs a little bit of housekeeping :) Correct, if we do not worry too much about the "low level" details of get_timer. It looks to me like there is a lot of cruft there, depending on which architecture one looks at. Many implementations create a parallel universe of get_ticks or some similar timing system that is then used to support get_timer. However, the get_ticks routines are also used in timeouts elsewhere in the code. Searching on "get_timer" doesn't find these non-standard usages. It would be nice to fix these also, but every bit does help! Best Regards, Bill Campbell > Regards, > > Graeme > > ___ U-Boot mailing list U-Boot@lists.denx.de http://lists.denx.de/mailman/listinfo/u-boot
Re: [U-Boot] [RFC] ARM timing code refactoring
On 1/24/2011 5:02 AM, Wolfgang Denk wrote: > Dear Albert ARIBAUD, > > In message<4d3d2942.4060...@free.fr> you wrote: >> That is assuming a 64-bit timebase, isn't it? for CPUs / SoCs that don't> >> have such a timebase but only a 32-bit timer, the bogo_ms/jiffy would> >> not go through the full 32-bit range, which would cause issues with the> >> timing loops on rollover -- and while a timeout of more than 65 sec may> >> not be too likely, a timeout starting near the wraparound value of> >> bogo_ms still could happen. Hi All, If you use my approach of shifting right by n bits (in order to get a counter that has "about" 1 ms resolution), zeros are shifted into the top. The counter would never reach more than (32-n) bits in magnitude. This is easily fixed by creating a "virtual" n msbs for the counter, as must be done if a 64 bit counter is to be simulated. Actually, I only need "n" bits, but in any case it requires detecting that the counter has "backed up" mod 32 bits and if so, increase the virtual counter by 1. If you really, really, really want a timer that "ticks" at a 1 ms rate, instead of bogo_ms/jiffies, I propose the following: u32 get_time() { u32 delta_t_bogo_ms; u32 t_save; #if defined(TIMER_IS_64_BITS) u64 tick; read(&tick); t_save = tick >> gd->timer_shift; delta_t_bogo_ms = t_bogo_ms - gd->prev_timer; #else read(t_save); delta_t_bogo_ms = (t_save - gd->prev_timer) >> gd->timer_shift; #endif gd->prev_timer = t_save; /* save previous counter */ if (delta_t_bogo_ms < gd->bogo_ms_per_65_sec) { gd->fract_timer+= delta_t_bogo_ms * gd->cvt_bogo_ms_to_ms; /* accumulate fraction of ms */ gd->timer_in_ms += gd->fract_timer >> 16; gd->fract_timer &= 0X; } else gd->fract_timer = 0; /* start accumulating from 0 fraction */ return(gd->timer_in_ms) } This routine will create a timer in ms, that will be accurate as long as this routine is called either once about ever 65 seconds or once per time base rollover, whichever is less. Nested timer use is ok. If the timer is not called frequently enough, it will return the same value twice, but after that it will start timing normally. This shouldn't matter, as the second returned value will be the start of a timing loop. Timeout values can be the full 32 bits, as long as you keep calling the routine frequently enough. No initialization is required. Note that you can (and probably should) use the bottom 32 bits of the hardware timebase as a 32 bit timebase unless the clock would overflow in 65 seconds (running faster than about 66 MHz), or if you want to relax the 65 seconds. If you want to save a word in the gd data, use 0X1 instead of bogo_ms_per_65_sec (or a more precise value if you know it). Note that gd->timer_shift and gd->cvt_bogo_ms_to_ms can also be replaced by constants if the clock speed is fixed. This is more expensive than using the bogo_ms timer, but does have the advantage that everything is in ms. FWIW, I think converting from ms to some other unit for loop control is fine, as long as we have a standard routine to do that that is "cheap". However, others may not agree. For sure, passing around 64 bit tick values to do this process is, IMHO, vast overkill and not a good general solution, as many processors really don't like to do 64 bit operations. Best Regards. Bill Campbell > Sorry, but I don't get it. What exactly is the problem with a 32 bit > counter, and why would it not go through the full 32-bit range? > > Best regards, > > Wolfgang Denk > ___ U-Boot mailing list U-Boot@lists.denx.de http://lists.denx.de/mailman/listinfo/u-boot
Re: [U-Boot] [RFC] ARM timing code refactoring
On 1/23/2011 2:57 PM, Wolfgang Denk wrote: > Dear Reinhard Meyer, > > In message<4d3c9bfc.1010...@emk-elektronik.de> you wrote: get_timer() returns a monotonous upward counting time stamp with a resolution of milliseconds. After reaching ULONG_MAX the timer wraps around to 0. >> Exactly that wrap makes the situation so complicated, since the simple code >> u32 get_timer(void) >> { >>return (ticks * 1000ULL) / tickspersec; >> } >> won't do that wrap. > Do you have a better suggestion? > The get_timer() implementation may be interrupt based and is only available after relocation. >> Currently it is used before relocation in some places, I think I have >> seen it in NAND drivers... That would have to be changed then. > Indeed. It is unreliable or even broken now. > >> This is already implemented functionally very closely (apart from factoring >> and the >> get_timer(void) change) to this in AT91, the only (academic) hitch is that >> it will >> burp a few billion years after each reset :) >> What bothers me is the need for 64 bit mul/div in each loop iteration, for >> CPUs without >> hardware for that this might slow down data transfer loops of the style >> >> u32 start_time = get_timer(); >> do { >> if ("data_ready") >> /* transfer a byte */ >> if (get_timer() - start_time> timeout) >> /* fail and exit loop */ >> } while (--"bytestodo"> 0); >> >> since get_timer() will be somewhat like: >> >> return (tick * 1000ULL) / tickspersec; >> >> As I stated before, tickspersec is a variable in, for example, AT91. So the >> expression cannot be optimized by the compiler. > I don't think this is the only way to implement this. How does Linux > derive time info from jiffies? Hi All, In order to avoid doing 64 bit math, we can define a "jiffie" or a "bogo_ms" that is the 64 bit timebase shifted right such that the lsb of the bottom 32 bits has a resolution of between 0.5 ms and 1 ms. It is then possible to convert the difference between two jiffie/bogo_ms values to a number of ms using a 32 bit multiply and a right shift of 16 bits, with essentially negligible error. get_bogo_ms() would return a 32 bit number in bogo_ms, thus the timing loop would be written. u32 start_time = get_bogo_ms(); do { if ("data_ready") /* transfer a byte */ if (bogo_ms_to_ms(get_timer() - start_time)> TIMEOUT_IN_MS) /* fail and exit loop */ } while (--"bytestodo"> 0); u32 get_bogo_ms() { u64 tick; read(tick); return (tick>> gd->timer_shift); } u32 bogo_ms_to_ms(u32 x) { /* this code assumes the resulting ms will be between 0 and 65535, or 65 seconds */ return ((x * gd->cvt_bogo_ms_to_ms)>> 16); /* cvt_bogo_ms_to_ms is a 16 bit binary fraction */ } All the above code assumes timeouts are 65 seconds or less, which I think is probably fair. Conversion of ms values up to 65 seconds to bogo_ms is also easy, and a 32 bit multiplied result is all that is required. What is not so easy is converting a 32 bit timer value to ms. It can be done if the CPU can do a 32 by 32 multiply to produce a 64 bit result, use the msb, and possibly correct the result by an add if bit 32,of the timer is set. You need a 33 bit counter in bogo_ms to get a monotonic, accurate 32 bit counter in ms. The powerpc can use a mulhw operation to do this, and any CPU that will produce a 64 bit product can do this. However, many CPUs do not produce 64 bit products easily. Using division to do these operations are even less appealing, as many CPUs do not provide hardware division at all. Since it is not necessary to do this conversion to easily use timeouts with 1 ms resolution and accuracy, I think the idea of not using a timer in ms but rather bogo_ms/jiffies is possibly better? Best Regards, Bill Campbell > Best regards, > > Wolfgang Denk > ___ U-Boot mailing list U-Boot@lists.denx.de http://lists.denx.de/mailman/listinfo/u-boot
Re: [U-Boot] TIMER cleanup RFC, was: [PATCH 4/4] arm920t/at91/timer: replace bss variables by gd
On 11/30/2010 7:48 AM, Reinhard Meyer wrote: > Dear J. William Campbell, >> On 11/30/2010 1:14 AM, Reinhard Meyer wrote: >>> Dear Wolfgang Denk, >>> >>> what we really need is only a 32 bit monotonous free running tick that >>> increments >>> at a rate of at least 1 MHz. As someone pointed out a while ago, even >>> at 1GHz that would >>> last for four seconds before it rolls over. But a 1HGz counter could >>> be 64 bit internally >>> and always be returned as 32 bits when it is shifted right to bring it >>> into the "few" MHz >>> range. >>> >>> Any architecture should be able to provide such a 32 bit value. On >>> powerpc that would >>> simply be tbu|tbl shifted right a few bits. >>> >>> An architecture and SoC specific timer should export 3 functions: >>> >>> int timer_init(void); >>> u32 get_tick(void); /* return the current value of the internal free >>> running counter */ >>> u32 get_tbclk(void); /* return the rate at which that counter >>> increments (per second) */ >>> >>> A generic timer function common to *most* architectures and SoCs would >>> use those two >>> functions to provice udelay() and reset_timer() and get_timer(). >>> Any other timer functions should not be required in u-boot anymore. >>> >>> However get_timer() and reset_timer() are a bit of a functional problem: >>> >>> currently reset_timer() does either actually RESET the free running >>> timer (BAD!) or >>> remember its current value in another (gd-)static variable which later >>> is subtracted >>> when get_timer() is called. That precludes the use of several timers >>> concurrently. >>> >>> Also, since the 1000Hz base for that timer is usually derived from >>> get_tick() by >>> dividing it by some value, the resulting 1000Hz base is not exhausting >>> the 32 bits >>> before it wraps to zero. > (###) Hi All, A correct method to provide a full 32 bits of resolution at a 1 ms rate is simple. It requires maintaining a software timer in ms. This timer is updated by converting the difference in the hardware time base to ms and then adding the correct number of ms to the software timer. As previously discussed, this only requires calling get_timer once a second or so. The conversion can be made simply in most cases, especially if the clock rate is a "nice" number. Even in the worst case, it is not too hard to "adjust" the saved hardware timer value to contain any remainder ticks. Yes, if this is done incorrectly, the results are bad. >>> Therefore I propose two new functions that are to replace >>> reset_timer() and get_timer(): >>> >>> u32 init_timeout(u32 timeout_in_ms); /* return the 32 bit tick value >>> when the timeout will be */ >>> bool is_timeout(u32 reference); /* return true if reference is in the >>> past */ >>> >>> A timeout loop would therefore be like: >>> >>> u32 t_ref = timeout_init(3000);/* init a 3 second timeout */ >>> >>> do ... loop ... while (!is_timeout(t_ref)); >>> >>> coarse sketches of those functions: >>> >>> u32 init_timeout(u32 ms) >>> { >>> return get_ticks() + ((u64)get_tbclk() * (u64)ms) / (u64)1000; >>> } >>> >>> bool is_timeout(u32 reference) >>> { >>> return ((int)get_ticks() - (int)reference)> 0; >>> } >>> >>> Unfortunately this requires to "fix" all uses of get_timer() and >>> friends, but I see no other >>> long term solution to the current incoherencies. >>> >>> Comments welcome (and I would provide patches)... >> Hi All, >>The idea of changing the get_timer interface to the >> init_timeout/is_timeout pair has the advantage that it is only necessary >> to change the delay time in ms to an internal timebase once, and after >> that, only a 32-bit subtraction is required. > Exactly. >> I do not however like the >> idea of using 64 bit math to do so, as on many systems this is quite >> expensive. However, this is a feature that can be optimized for >> particular CPUs. > 64 Bit math is only necessary when get_tbclk() times the maximun anticipated > timeout (in ms) is larger than 32 bits. This happens easyly when tbclk is a > few MHz. What I think you mean to say here is that the delay time fits easily into 32 bits for all reasonable clock rates. For instance, a 1 GHz clock rate yields a
Re: [U-Boot] TIMER cleanup RFC, was: [PATCH 4/4] arm920t/at91/timer: replace bss variables by gd
On 11/30/2010 1:14 AM, Reinhard Meyer wrote: > Dear Wolfgang Denk, > > what we really need is only a 32 bit monotonous free running tick that > increments > at a rate of at least 1 MHz. As someone pointed out a while ago, even at 1GHz > that would > last for four seconds before it rolls over. But a 1HGz counter could be 64 > bit internally > and always be returned as 32 bits when it is shifted right to bring it into > the "few" MHz > range. > > Any architecture should be able to provide such a 32 bit value. On powerpc > that would > simply be tbu|tbl shifted right a few bits. > > An architecture and SoC specific timer should export 3 functions: > > int timer_init(void); > u32 get_tick(void); /* return the current value of the internal free running > counter */ > u32 get_tbclk(void); /* return the rate at which that counter increments (per > second) */ > > A generic timer function common to *most* architectures and SoCs would use > those two > functions to provice udelay() and reset_timer() and get_timer(). > Any other timer functions should not be required in u-boot anymore. > > However get_timer() and reset_timer() are a bit of a functional problem: > > currently reset_timer() does either actually RESET the free running timer > (BAD!) or > remember its current value in another (gd-)static variable which later is > subtracted > when get_timer() is called. That precludes the use of several timers > concurrently. > > Also, since the 1000Hz base for that timer is usually derived from get_tick() > by > dividing it by some value, the resulting 1000Hz base is not exhausting the 32 > bits > before it wraps to zero. > > Therefore I propose two new functions that are to replace reset_timer() and > get_timer(): > > u32 init_timeout(u32 timeout_in_ms); /* return the 32 bit tick value when the > timeout will be */ > bool is_timeout(u32 reference); /* return true if reference is in the past */ > > A timeout loop would therefore be like: > > u32 t_ref = timeout_init(3000); /* init a 3 second timeout */ > > do ... loop ... while (!is_timeout(t_ref)); > > coarse sketches of those functions: > > u32 init_timeout(u32 ms) > { > return get_ticks() + ((u64)get_tbclk() * (u64)ms) / (u64)1000; > } > > bool is_timeout(u32 reference) > { > return ((int)get_ticks() - (int)reference)> 0; > } > > Unfortunately this requires to "fix" all uses of get_timer() and friends, but > I see no other > long term solution to the current incoherencies. > > Comments welcome (and I would provide patches)... Hi All, The idea of changing the get_timer interface to the init_timeout/is_timeout pair has the advantage that it is only necessary to change the delay time in ms to an internal timebase once, and after that, only a 32-bit subtraction is required. I do not however like the idea of using 64 bit math to do so, as on many systems this is quite expensive. However, this is a feature that can be optimized for particular CPUs. I also REALLY don't like the idea of having a get_ticks function, because for sure people will use this instead of the desired interface to the timer because it is "better". Then we get back into a mess. Since in most cases get_ticks is one or two instructions, please, let us hide them in init_timeout/is_timeout. An alternate approach, which has the merit of being more like the originally intended interface, simply disallows reset_timer since it is totally unnecessary. The only dis-advantage of the original approach using just get_timer is that the conversion to ms must be considered at each call to get_timer, and will require at a minimum one 32 bit integer to remember the hardware timer value the last time get_timer was called (unless the hardware time can be trivially converted to a 32 bit value in ms, which is quite uncommon). This is not a high price to pay, and matches the current usage. This is probably for Mr. Denk to decide. If we were just starting now, the init_timeout/is_timeout is simpler, but since we are not, perhaps keeping the current approach has value. I would really like to help by providing some patches, but I am just way too busy at present. Best Regards, Bill Campbell > Reinhard > > ___ > U-Boot mailing list > U-Boot@lists.denx.de > http://lists.denx.de/mailman/listinfo/u-boot > > ___ U-Boot mailing list U-Boot@lists.denx.de http://lists.denx.de/mailman/listinfo/u-boot
Re: [U-Boot] Timer implementations
On 10/27/2010 11:02 PM, Reinhard Meyer wrote: > Dear J. William Campbell, >> Hi All, >> >> I am pretty sure the migration to 64 bits was caused by 1) people not >> understanding that the timer operating on time DIFFERENCES would work >> fine even if the underlying timer wrapped around (most probable >> problem) and possibly 2) broken timer functions causing bogus >> timeouts, improperly "fixed" by switching to 64 bits. >> >> I think u-boot could get along just fine with only 2 time related >> functions, uint_32 get_timer(uint_32 base) and udelay(uint 32 delay). >> udelay will only work on "small" values of delay, on the order of >> milliseconds. It is to be used when a "short" but "precise" delay in >> microsecond resolution is required. Users of get_timer must >> understand that it is only valid if it is called "often enough", i.e. >> at least once per period of the underlying timer. This is required >> because u-boot does not want to rely on interrupts as a timer update >> method. Therefore, all uses of get_timer must 1) call it once >> initially to get a start value, and 2) call get_timer at least once >> per period of the underlying hardware counter. This underlying period >> is guaranteed to be at least 4.29 seconds (32 bit counter at 4 GHz). >> Note that this does NOT mean that the total wait must be less than >> 4.29 seconds, only that the rate at which the elapsed time is TESTED >> must be adequate. >> >> In order to implement this functionality, at least one hardware timer >> of some kind is required. An additional software "timer" in 1 ms >> resolution may be useful in maintaining the software time. If the >> hardware timer update rate is programmable, u-boot MAY set the update >> rate on initialization On initialization, u-boot MAY reset the >> hardware timer and MAY reset any associated software timer. The >> hardware timer MAY be started on initialization. On each call to >> get_timer(), u-boot MUST start the hardware timer if it was not >> started already. On calls to get_timer, u-boot MUST NOT reset the >> hardware timer if it was already started. The software timer MAY be >> reset if u-boot can unambiguously determine that more than 4.29 >> seconds has elapsed since the last call to get_timer. >> >> The simplest case for implementing this scheme is if two programmable >> timers exist that can be set to 1ms and 1us. The timers are >> initialized at start-up, get_timer just returns the 32 bit 1 ms timer >> and udelay just waits for the number of ticks required on the second >> timer to elapse. The most common harder case is where there is only >> one timer available, it is running at 1 us per tick or faster, and we >> cannot control the rate. udelay is still easy, because we can convert >> the (small) delay in us to a delay in ticks by a 32 bit multiply that >> will not overflow 32 bits even if we have quite a few fractional bits >> in the tics per microsecond value. The elapsed ticks required is the >> (delay in us * us/per tick) >> # fractional bits in us/per tick. If >> that is not close enough for you, you can do it as (delay in us * >> (integer part of us/tick)) + ((delay in us * (fractional >> part)us/tick) >> # fraction bits). For "nice" numbers, like any >> integral number of MHz, there is no fractional > >> part. Only numbers like 66 MHz, or 1.666 GHz require messing with the >> fractional part. >> For get_timer, it is a bit harder. The program must keep two 32 bit >> global variables, the timer reading "last time" and the software >> timer in 1 ms resolution. Whenever get_timer is called, it must >> increase the software timer by the number of ms that have elapsed >> since the previous update and record the corresponding timer reading >> as the new "last time". Note that if the number of ms elapsed is not >> an integer (a common case), the value recorded as the "last time" >> must be decreased by the number of ticks not included in the 1 ms >> resolution software timer. There are many different ways to >> accomplish update, depending on what hardware math capabilities are >> available, and whether one thinks efficiency is important here. >> Conceptually, you convert the elapsed time in ticks into an >> equivalent number of ms, add that number to the software timer, store >> the current value of the hardware timer in last time, and subtract >> any "remainder" ticks from that value. If the
Re: [U-Boot] Timer implementations
On 10/26/2010 6:33 AM, Reinhard Meyer wrote: > Dear Wolfgang Denk, >>> Then the define CONFIG_SYS_HZ should not be in every.h since that >>> suggests that a board developer has some freedom there... >> Agreed - there are historical reasons this has ever been changable at >> all. >> >>> and MOST IMPORTANT that some implementations of udelay() might >>> call reset_timer() and therefore break an outer timeout loop !!! >> Such implementations are inherently broken and need to be fixed. > Found such in arm926ejs/omap... But then, that timer is multiple-broken: > relocation broken (uses static data), returns 32 a bit value in get_ticks(), > returns CONFIG_SYS_HZ in get_tbclk() instead of the rate get_ticks() > increments... > > PXA: > void udelay_masked (unsigned long usec) > { > unsigned long long tmp; > ulong tmo; > > tmo = us_to_tick(usec); > tmp = get_ticks() + tmo;/* get current timestamp */ > > while (get_ticks()< tmp) > /* loop till event _OR FOREVER is tmp happens to be> 32 bit_ */ >/*NOP*/; > > } > > unsigned long long get_ticks(void) > { > return readl(OSCR); > } > - not any better :( -- its the same code that AT91 had before I fixed it. > >>> It is also open if reset_timer() does actually reset the hardware timer >>> (e.g. tbu/tbl at PPC) - which would be messing up any time difference >>> calculation using get_ticks() - or does emulate that by remembering >>> the hardware value and subtracting it later in every subsequent >>> get_timer() call? >> This is an implementation detail. > IF we require that get_ticks() and get_timer() shall not interfere with > each other and IF both are based on the same hardware timer only the > second method is available (same if the hardware timer is not easyly > resettable). > >>> 2. get_ticks() and friends operate at a higher rate (tbu/tbl for PPC). >>> Since they are defined as having 64 bits they MUST not wrap at 32 bits, >>> i.e. if the hardware provides only 32 bits, the upper 32 bits must be >>> emulated by software. >> Right. >> >>> Otherwise we have to document that get_ticks() cannot be used to get >>> 64 bit time differences. >> No. Such an implementation is broken and needs fixing. > Original AT91 timer.c was like that, and I think other ARMs where this was > copied around should be looked at... I don't know when get_timer() became > 64 bits, but it seems that some implementations just did change the return > type: uint64 get_timer(void) {return (uint64)timer_val_32;} Hi All, I am pretty sure the migration to 64 bits was caused by 1) people not understanding that the timer operating on time DIFFERENCES would work fine even if the underlying timer wrapped around (most probable problem) and possibly 2) broken timer functions causing bogus timeouts, improperly "fixed" by switching to 64 bits. I think u-boot could get along just fine with only 2 time related functions, uint_32 get_timer(uint_32 base) and udelay(uint 32 delay). udelay will only work on "small" values of delay, on the order of milliseconds. It is to be used when a "short" but "precise" delay in microsecond resolution is required. Users of get_timer must understand that it is only valid if it is called "often enough", i.e. at least once per period of the underlying timer. This is required because u-boot does not want to rely on interrupts as a timer update method. Therefore, all uses of get_timer must 1) call it once initially to get a start value, and 2) call get_timer at least once per period of the underlying hardware counter. This underlying period is guaranteed to be at least 4.29 seconds (32 bit counter at 4 GHz). Note that this does NOT mean that the total wait must be less than 4.29 seconds, only that the rate at which the elapsed time is TESTED must be adequate. In order to implement this functionality, at least one hardware timer of some kind is required. An additional software "timer" in 1 ms resolution may be useful in maintaining the software time. If the hardware timer update rate is programmable, u-boot MAY set the update rate on initialization On initialization, u-boot MAY reset the hardware timer and MAY reset any associated software timer. The hardware timer MAY be started on initialization. On each call to get_timer(), u-boot MUST start the hardware timer if it was not started already. On calls to get_timer, u-boot MUST NOT reset the hardware timer if it was already started. The software timer MAY be reset if u-boot can unambiguously determine that more than 4.29 seconds has elapsed since the last call to get_timer. The simplest case for implementing this scheme is if two programmable timers exist that can be set to 1ms and 1us. The timers are initialized at start-up, get_timer just returns the 32 bit 1 ms timer and udelay just waits for the number of ticks required on the second timer to elapse. The most common harder case is where there is only one timer
Re: [U-Boot] [PATCH v2] mmc: omap: timeout counter fix
On 10/25/2010 11:01 PM, Reinhard Meyer wrote: > Dear Wolfgang Denk, >> Dear Reinhard Meyer, >> >> In message<4cc66a67.4000...@emk-elektronik.de> you wrote: It fails in case the timer wraps around. Assume 32 bit counters, start time = 0xFFF0, delay = 0x20. It will compute end = 0x10, the while codition is immediately false, and you don't have any delay at all, which most probably generates a false error condition. >>> I used and assumed a 64 bit counter, that will not wrap around while >>> our civilization still exists... >> >> The code is still wrong, and as a simple correct implementation exists >> there is no excuse for using such incorrect code. >> >> Please fix that! > Agreed here. People are invited to dig through u-boot and find all > those places. > >>> If get_ticks() is only 32 bits worth, both methods will misbehave >>> at a 32 bit wrap over. >> No. >> start = time(); while ((time() - start)>> True, provided the underlying timer is really 64 bits, otherwise >>> this fails, too... > > You are wrong. Try for example this: > > > > - snip --- > > #include > > > > int main(void) > > { > > unsigned int time = 0xFFF0; > > unsigned int delay = 0x20; > > unsigned int start; > > You are wrong here, because you take it out of context. > My "demo" is using the (declared as) 64 bit function get_ticks(). > I mentioned above that this function MUST be truly returning 64 > bits worth of (incrementing) value to make any version work. > If get_ticks() just returns a 32 bit counter value neither method will work > reliably. Just check all implementations that this function is implemented > correctly. Hi All, I have wondered for quite some time about the rush to make get_ticks() return a 64 bit value. For any "reasonable" purpose, like waiting a few seconds for something to complete, a 32 bit timebase is plenty adequate. If the number of ticks per second is 10, i.e. a 1 GHz clock rate, the clock wraps in a 32 bit word about every five seconds. The trick is that time always moves forward, so a current get_ticks() - a previous get_ticks() is ALWAYS a positive number. It is necessary to check the clock more often than (0X1 - your_timeout) times per second, but unless your timeout is very near the maximum time that fits into 32 bits, this won't be a problem. Most CPUs have a counter that count at a "reasonable" rate. Some CPUs also have a "cycle counter" that runs at the CPU clock rate. These counters are useful to determine exactly how many machine cycles a certain process took, and therefore they have high resolution. Timers for simple delays neither need nor want such resolution. If the only counter available on you CPU runs at several GHz, and is 64 bits long, just shift it right a few bits to reduce the resolution to a "reasonable" resolution and return a 32 bit value. There is no need for a bunch of long long variables and extra code running around to process simple timeouts. It may be that we need a different routine for precision timing measurements with high resolution, but it needn't, and probably shouldn't IMHO be get_ticks(). Best Regards, Bill Campbell > ___ > U-Boot mailing list > U-Boot@lists.denx.de > http://lists.denx.de/mailman/listinfo/u-boot > > ___ U-Boot mailing list U-Boot@lists.denx.de http://lists.denx.de/mailman/listinfo/u-boot
Re: [U-Boot] [PATCH 2/2] ARM: fix relocation support for onenand device.
On 10/23/2010 1:56 PM, Wolfgang Denk wrote: > Dear Enric Balletbo i Serra, > > In message<1287479602-21721-3-git-send-email-eballe...@iseebcn.com> you > wrote: >> We also have to relocate the onenand command table manually, otherwise >> onenand command don't work. >> >> Signed-off-by: Enric Balletbo i Serra >> --- >> arch/arm/lib/board.c |3 +++ >> common/cmd_onenand.c |6 ++ >> 2 files changed, 9 insertions(+), 0 deletions(-) Is this patch still necessary? I thought the relocation change made it OBE. Best Regards, Bill Campbell > Applied, thanks. > > Best regards, > > Wolfgang Denk > ___ U-Boot mailing list U-Boot@lists.denx.de http://lists.denx.de/mailman/listinfo/u-boot
Re: [U-Boot] [PATCH] mpc83xx: Add -fpic relocation support
On 10/12/2010 11:30 PM, Albert ARIBAUD wrote: > Le 12/10/2010 23:00, Joakim Tjernlund a écrit : > >> Yes, but the difference isn't really the arch. It is the -mrelocatable >> flag that is the big difference. > Not only: obviously, implementing GOT relocation is not done the same on > both archs, and it simply is not beneficial on ARM wrt PPC in terms of > instructions. I did a pretty extensive run of tests with and without > -fPIC and -fPIE on ARM, and GOT relocation clearly makes code bigger, > whereas it does not PPC. > > This simply implies that -fPIC is a better choice for PPC (and hence > -mrelocatable) while -fpie is a better one for ARM. Hi All, In particular, the PPC takes two 32 bit instructions to load the known address of a variable into a register. If the GOT is used, a single 32 bit instruction can load the address of a variable from the GOT table (pointed to by a "fixed" register) into a register. In both cases, there are two memory cycles, but in the GOT case, only one instruction is required. This is why the GOT based code is smaller. However, the GOT cannot be used to address constants and some other items that are not "variables". I do think that -fPIC and -fpie are not mutually incompatible. On the PPC, the GOT references would be relocated in the loop that updates the GOT and the references to constants would be relocated by the ELF relocation code. That is how shared libraries are relocated. Best Regards, Bill Campbell > Amicalement, ___ U-Boot mailing list U-Boot@lists.denx.de http://lists.denx.de/mailman/listinfo/u-boot
Re: [U-Boot] [PATCH] futile c relocation attempt
On 10/6/2010 2:43 AM, Graeme Russ wrote: > On 06/10/10 01:48, Reinhard Meyer wrote: >> --- >> arch/arm/cpu/arm926ejs/start.S |8 - >> arch/arm/lib/board.c | 57 >> +++- >> include/configs/top9000_9xe.h |1 + >> 3 files changed, 63 insertions(+), 3 deletions(-) >> > I had a quick look at this and nothing is jumping out at me. Of course I am > not familiar with ARM asm... > > I don't see any reason why this ultimately will not work eventually. You > may be having some issues with the transition from asm->C->asm through the > relocation - This was an especially painful thing for me involving an > intermediate trampoline which I have only recently figured out how to remove. > > Maybe some memory barriers are needed to stop the C optimiser mangling things? > > I am sure what you have is very close to the real solution :) > > I do think the main relocation fixup loop can be moved into a common > library in which case we can add additional case statements. The nice thing > is that x86 as all Type 8 which is specifically allocated to x86 so my "if Hi All, I think that type 8 IS NOT allocated to the 386. For instance, R_PPC_ADDR14_BRTAKEN also has a value of 8. So does R_ARM_ABS8. I think that there will be a lot of #ifdefs just to keep the references symbolic. Many of the different platform relocation types will come out to be the same code in the switch statement, but different symbolic names. There will also be some entries that are processor specific and have no equivalent on other processors. I think it would be a good idea to use the symbolic values of the Relocation types (as opposed to integer constants), as it will make the code clearer. There are sort of two ways to organize the code inside the switch statement. Since the code inside the switch statement is very short, it might be best for each architecture (ELF format) to be bunched together, even at the expense of repeating the same executable statements that some other formats may use, as follows: #ifdef PPC case R_PPC_ADDR32: /* S + A */ break; case R_PPC_RELATIVE: /* B + A */ break; #endif #ifdef I386 case R_386_32: /* S + A */ /* I think this is the other location type Graeme used, but I could be wrong */ break; case R_386_RELATIVE: /* B + A */ break; #endif #ifdef ARM case R_ARM_ABS32: /* S + A */ /* I don't remember the ARM relocation types used */ break; case R_ARM_REL32: /* B + A - P */ break; #endif Or we could group the various Relocation types by what they actually do: #ifdef PPC case R_PPC_ADDR32: /* S + A */ #endif #ifdef I386 case R_386_32: /* S + A */ /* I think this is the other location type Graeme used, but I could be wrong */ #endif #ifdef ARM case R_ARM_ABS32: /* S + A */ /* I don't remember the ARM relocation types used */ #endif break; #ifdef PPC case R_PPC_RELATIVE: /* B + A */ #endif #ifdef I386 case R_386_RELATIVE: /* B + A */ #endif break; #ifdef ARM case R_ARM_REL32: /* B + A - P */ break; #endif Note that the ARM_REL32 is defined differently than the PPC/I386 relative, FWIW. I also don't know what to use for the names of the binary formats. It would be nice if we could use something already in the header files? Thoughts on all this solicited! Best Regards, Bill Campbell >> TEXT_BASE" checks can be kept. For size freaks, we could litter the code > with #ifdefs to remove un-needed cases ;) > > Interestingly, ARM is adding gd->reloc_off while x86 is subtracting > gd->reloc_off. If this is correct, I need to change the calculation of > gd_reloc_off to be consistent Adding is the way the specifications define it. add B+A. Best Regards, Bill Campbell > Regards, > > Graeme > ___ > U-Boot mailing list > U-Boot@lists.denx.de > http://lists.denx.de/mailman/listinfo/u-boot > > ___ U-Boot mailing list U-Boot@lists.denx.de http://lists.denx.de/mailman/listinfo/u-boot
Re: [U-Boot] [RFC] [PATCH] arm: arm926ejs: use ELF relocations
On 10/4/2010 10:30 PM, Wolfgang Denk wrote: > Dear Albert ARIBAUD, > > In message<4caa50aa.3000...@free.fr> you wrote: >> Remember: this patch only applies to boards which boot from NOR FLASH! >> You can test it on other types of boards (NAND-based, etc) for >> regression testing, but nothing more. > Assuming the NAND loder does not load U-Boot to it's final location > at the upper end of RAM, but - say - somewhere in lower memory, the > standard relocation preocess will be running, so I think there should > be no real difference between (such) NAND booting systems and NOR > booting ones - or am I missing something? FWIW I think you are right. If u-boot is linked for the address where the NAND loader put it, everything should work fine. It can size memory, move a copy of u-boot to the top of memory, and branch to the entry point that continues initialization. Bill Campbell > Best regards, > > Wolfgang Denk > ___ U-Boot mailing list U-Boot@lists.denx.de http://lists.denx.de/mailman/listinfo/u-boot
Re: [U-Boot] [RFC] [PATCH] arm: arm926ejs: use ELF relocations
On 10/4/2010 5:16 PM, Albert ARIBAUD wrote: > Le 05/10/2010 01:21, Graeme Russ a écrit : >> On Tue, Oct 5, 2010 at 9:57 AM, Albert ARIBAUD >> wrote: >>> Le 05/10/2010 00:22, Graeme Russ a écrit : On Tue, Oct 5, 2010 at 9:01 AM, Albert Aribaud wrote: >>> The output from MAKEALL is curiously calculated... If I look at objdumps of >>> the GOT and ELF binaries, I find that: >>> >>> - the GOT .text section is 118960 bytes and the ELF .text section only >>> 108112. This is due to the fact that GOT relocation requires additional >>> instruction for GOT indirection whereas ELF relocations work by patching the >>> code. >> It would be interesting to compare against the basline non-relocatable >> version > I #defined CONFIG_RELOC_FIXUP_WORKS and removed -pie from the ARM > config.mk. This puts the edminiv2 code in the non-reloc build case, and > produces identical .text and .data, and almost identical .rodata, as the > ELF case. > >>> - the .rodata section is 22416 for GOT, 22698 for ELF, whereas the .data >>> section is 2908 for GOT, 2627 for ELF. Some initialized data apparently >>> moved from non-const ton const for some reason, but basically, initialized >>> data remains constant. >>> >>> - the .bss section remains constant too, 16640 for GOT vs. 16636 for ELF. >>> I'm not going to track what causes the 4 byte difference. :) >>> >>> Many sections are output in the ELF file which do not appear in the GOT >>> file, such as .interp, .dynamic, .dynstr etc. They probably pollute >>> MAKEALL's figures. >> I now discard a few sections: >> >> /DISCARD/ : { *(.dynstr*) } >> /DISCARD/ : { *(.dynamic*) } >> /DISCARD/ : { *(.plt*) } >> /DISCARD/ : { *(.interp*) } >> /DISCARD/ : { *(.gnu*) } >> >> Not that it makes a huge difference - most of these are trivially small > Thanks. I'll add this to the .lds as a measure of clarity. > >>> That's roughly consistent with the numbers I get: about 19 KB of .rel.dyn >>> plus .dynsym, which we will be able to cut by half if we preprocess it. >> Which is not copied to RAM, so not as nasty as the .got related increase > True also. Note that we could probably shrink the table to 1/4 of its > current size by taking advantage from the fact that the few > non-program-base-relative relocations it has can easily be converted to > program-base-relative, and that two consecutive relocations are always > less than 64 KB away from each other. Of course that moves away from > using the ELF structures as-is, and requires additional build steps, but > people with small FLASH devices may want it. Hi All, This may be pushing it in the more general case. ARM has only a few relocation types. Other CPU types have more types, and therefore still may need a type field. You can certainly get 1/2 in all cases, and more if you are willing to get a bit more complex in the preprocessing. That said, I think this is best left to later when all CPUs are in the relocatable state. >> I'm also looking at moving the low-level intialisation and relocation code >> into a seperate section (outside .text) so I even less to relocate to RAM > As Wolfgang pointed out, there might be issues in that all the code that > runs in FLASH should be truly PI, which might not be a piece of cake. > ARM C code, for instance, tends to generate literals which need to be > relocated if you don't run the code where it was linked for. True, but the code WILL be running at the address it was linked for. It just won't be copied and relocated to the "new" address, as it would never be run again anyway. This goal is along the lines of the "two stage u-boot" that has been/is being considered, where all execute only once code can be concentrated into a segment that is not moved into ram. Bill Campbell >> Then I could even compress the relocatable section, but that is just being >> silly ;) > :) > >> Regards, >> >> Graeme > Amicalement, ___ U-Boot mailing list U-Boot@lists.denx.de http://lists.denx.de/mailman/listinfo/u-boot
Re: [U-Boot] ARM relocation, question to Heiko
On 10/4/2010 10:06 AM, Wolfgang Denk wrote: Dear "J. William Campbell", In message<4ca9f294.8080...@comcast.net> you wrote: Yes, I think Wolfgang is correct. This is not going to be easy to do in general. To run anywhere, the code must be true Position Independent code. If you intend to use any C code in the initialization, this will result in needing -fPIC for at least that code. I am not sure you can mix -fPIC and non -fPIC code in the same link, but I expect not. I am a bit surprised that it is possible to get even the initialization code to be Position Independent, but it appears that on at least some PPC it is possible/has been done. Not really. On PowerPC, only the first 20 or 30 lines of assembler code in start.S are position independent; then we compute the link (resp. execution) address and branch to it. From then, we run from the very address range we were linked for (starting at TEXT_BASE). Hi Wolfgang, You are of course correct. I was referring more to Jocke's (joakim.tjernl...@transmode.se) statements regarding: Yes, that is there today. I am talking about linking to any TEXT_BASE(say 0) but burn and run into another address. I impl. this quite some time ago for PPC(search for LINK_OFF) I understand from his comment that he had achieved total PIC for the initialization, that would run at any location regardless of TEXT_BASE. I think this code was not accepted into mainline, so it is not a problem at present. However, any relocation code added would have to be modified by Jocke if he wished to preserve that capability. I am amazed that he was able to get the rest of u-boot to work under the constraints you pointed out. It must have been quite tedious. I also wish to support Graeme's desire that the added relocation code at the end of the day be written in C. The routine to do the relocation does not require .bss and is not real long. The obvious advantage of this approach is that all architectures can use it. The ELF relocation codes will have to be changed to the architecture equivalents, and in some casesarchitecture specific relocation code processing added, but the theory will always be the same. This approach will make using relocation much easier/trivial for new architecture ports, thereby reducing resistance to doing it! Best Regards, Bill Campbell Albert is doing now, and Graeme did before, is the first option, creating a loader that understands ELF. This has the advantage that it will work on all architectures. However, once this understanding is in place, it would be easy to write a small post-processing program that would reduce the size of the relocation entries, much like -mrelocatable does. This may or may not be necessary, but it is certainly possible. Eventually we might even add -mrelocatable support for the other architectures to the tool chain. Best regards, Wolfgang Denk ___ U-Boot mailing list U-Boot@lists.denx.de http://lists.denx.de/mailman/listinfo/u-boot
Re: [U-Boot] ARM relocation, question to Heiko
On 10/4/2010 3:13 AM, Wolfgang Denk wrote: > Dear Albert ARIBAUD, > > In message<4ca999ee.5030...@free.fr> you wrote: >> Note however that linking for base address 0 is not mandatory for >> achieving true position independence. What is required is that the code >> which runs from power-up until relocation be able to run anywhere, i.e., >> this code should not require any relocation fixup. That can be achieved >> on ARM by using only relative branches and accessing data only relative >> to pc (e.g. literals) or truly absolute (e.g. HW registers etc). > That means you need to build all of U-Boot that way, because > significant parts of the code already run before relocation > (including all clocks and timers setup, console setup, printf and all > routines these pull in). > Yes, I think Wolfgang is correct. This is not going to be easy to do in general. To run anywhere, the code must be true Position Independent code. If you intend to use any C code in the initialization, this will result in needing -fPIC for at least that code. I am not sure you can mix -fPIC and non -fPIC code in the same link, but I expect not. I am a bit surprised that it is possible to get even the initialization code to be Position Independent, but it appears that on at least some PPC it is possible/has been done. On a related topic, I did find some information on the -mrelocatable history. Take a look at http://www.mail-archive.com/g...@gcc.gnu.org/msg02528.html. If you read both thread entries, it explains -mrelocatable as more or less the post-processor that re-formats the ELF relocation information into a smaller format and puts it in the text as another segment. What Albert is doing now, and Graeme did before, is the first option, creating a loader that understands ELF. This has the advantage that it will work on all architectures. However, once this understanding is in place, it would be easy to write a small post-processing program that would reduce the size of the relocation entries, much like -mrelocatable does. This may or may not be necessary, but it is certainly possible. Best Regards, Bill Campbell > Best regards, > > Wolfgang Denk > ___ U-Boot mailing list U-Boot@lists.denx.de http://lists.denx.de/mailman/listinfo/u-boot
Re: [U-Boot] ARM relocation, question to Heiko
On 10/3/2010 11:29 AM, Wolfgang Denk wrote: > Dear Reinhard Meyer, > > In message<4ca79896.2010...@emk-elektronik.de> you wrote: >> I agree here. _If_ relocation, it should work without hand-adding >> fixup stuff to all functions using initialized data with pointers. >> Even Wolfgang forgot to fixup his 2nd level command table in >> cmd_nvedit.c ;) > I didn't forget it - at least not in the sensse that I think this is > something that needs to be done. > > This works fine on PPC with relocation, and we should make it work > the same on other arches. > >> And, for space concerns in flash, relocation should always be an >> option on a board by board basis... > NAK. > >> And as an idea, if position independent code is used, only pointers >> in initialized data need adjustment. Cannot the linker emit a table >> of addresses that need fixing? > It does. That's the GOT. I think this is actually a misunderstanding. The purpose of the GOT, at least from GCC's point of view, is to hold the absolute addresses of private data referenced by shared library code. That is what it was invented to do. This is similar to, but not identical with, relocating all data references. Initialized data in the library must have a copy created (and relocated as necessary if it contains pointers) by the runtime linker when the library is initialized in the address space of the process using the library. The code in the shared library is -fPIC, but it still needs the runtime linker to allocate a copy of the GOT for the current user AND to allocate and relocate any data that is required for the library that is private to the user. It is that second step where we have trouble. Best Regards, Bill Campbell > Best regards, > > Wolfgang Denk > ___ U-Boot mailing list U-Boot@lists.denx.de http://lists.denx.de/mailman/listinfo/u-boot
Re: [U-Boot] ARM relocation, question to Heiko
On 10/3/2010 11:14 AM, Wolfgang Denk wrote: > Dear "J. William Campbell", > > In message<4ca75bfb.5030...@comcast.net> you wrote: >>>>> And I think there are more places of this type in u-boot... >>>> Yes, maybe. But relocation as I did for arm, also works >>>> on m68k, sparc, mips, avr32 and they must do also this >>>> fixups, so for common functions (except the new env handling, >>>> which I think got never tested on this architectures?) should >>>> work ... >>> This pointer problem is solved with the fixup relocs on ppc and >>> should work without manual relocation. I think this is a ppc >>> only extension but I might be wrong. >> You are correct that this is a ppc only extension. As such, it is >> not a good candidate for "general" use. > On contrary. > > If it works for PPC, then there should be ways to do the same on other > architectures. > Well, maybe so, but GCC won't do it now, and there has been no move by other architectures to adopt this capability. I suspect that it is extremely unlikley that this capability will ever be ported to other architectures since it has been available for so long on PPC without any movement to other systems. >> Using the full relocation scheme eliminates the need for all these >> "fixups" in u-boot C code. I think this is a very desirable result. >> It is also not clear to me that hard coding in the relocation as several >> C routines will produce a u-boot that is "smaller" than the one >> produced by using normal ELF relocation. However, using full relocation >> creates an environment that is true "C" and does not rely on people >> remembering that they may have to fix up some parts of their code. It is >> hard to see much downside in using the full relocation capability >> provided by Graeme's code. > Agreed. But if we take this path, we need to find an implementation > that looks clean and readable. Agreed. This should be possible to do now that there is a better understanding of the ELF format by the u-boot community. Perhaps the place to start would be trying to port what Graeme has done to ARM or perhaps better yet, PPC. Since lots of people on this list are PPC folks, we should have a lot of leverage there. Best Regards, Bill Campbell > > Best regards, > > Wolfgang Denk > ___ U-Boot mailing list U-Boot@lists.denx.de http://lists.denx.de/mailman/listinfo/u-boot
Re: [U-Boot] ARM relocation, question to Heiko
On 10/3/2010 1:58 AM, Albert ARIBAUD wrote: > Le 03/10/2010 10:44, Graeme Russ a écrit : > >>> Bill just said that -pic (or, for ARM, -fPIC or -fPIE) was unnecessary >>> for relocation. You seem to imply it actually is... In my experience, >>> -fPIC and-fPIE do increase code by adding GOT relocation to symbols >>> that >>> need fixing, so they would indeed be redundant to any other relocation >>> mechanism -- I just did some test with basic code and this seems to >>> confirm, no -fPIx is needed to get relocation the way you do on ARM. >> >> Just to clarify -fpic is a compiler option, -pic is a linker option. x86 >> has no compile time relocation options (therefore no referencing .got >> etc). >> Using the link time pic option produces the relocation data table >> (.rel.dyn) which must be pre-processed before execution can begin at the >> relocated address > > Thanks for clarifying, Graeme. > > This is consistent with the ARM compile-time options -fPIC/-fPIE vs > link-time option -pie. So there may be at least an interest in > investigating ELF-style relocation on ARM and comparing it to > GOT-based relocation in terms of FLASH and RAM sizes and code speed. > Hi All, It is for sure that -fPIC/-fPIE programs will contain more executable instructions than programs compiled without these options. The program will also contain more data space for the got. If -fPIC actually produced a fully position-independent executable, the extra overhead would perhaps be tolerable. However, since it does not do this, (problems with initialized data etc.) there is really no advantage in using these compile-time options. The executable code and required data space for the program without these switches will "always" be smaller and faster than with them. In order to fix the remaining issues even when using -fPIC, a relocation loop must exist in the u-boot code, either one global one or a bunch of user written specific ones. Also, the -pie switch will be needed anyway at link time to build the relocation table for the remaining relocation requirements. Programs compiled without -fPIC will have a larger .rel.dyn table than those compiled with -fPIC. However, the table entries in the relocation table occupy about the same storage as the code generated by the compiler to relocate a reference to the symbol at run time. So this is probably a almost a wash. Also, the dynamic relocation data need not be copied into the run-time object, as it is no longer needed. So the likely outcome is that the "flash" image is about the same size/slightly larger than the one compiled by -fPIC, and that the ram footprint after relocation is slightly smaller. If one is REALLY pressed for space, the size of the dynamic relocation area can be reduced by a post-processor program that would re-format the relocation entries. This re-formatting is possible because 1) ELF is a very general format and we only need a small subset of it, and 2) u-boot code will never occupy say 16 MB of space, so each relocation can probably be compressed into a 32 bit word. I doubt anyone is that desperate, but it IS possible. It will be interesting to see what the results of this comparison are. For me, the no user awareness of relocation is worth a lot, and the fact that the difference/overhead of relocation will all be in exactly one place is very appealing. Best Regards, Bill Campbell >> Cheers, >> >> Graeme > > Amicalement, ___ U-Boot mailing list U-Boot@lists.denx.de http://lists.denx.de/mailman/listinfo/u-boot
Re: [U-Boot] ARM relocation, question to Heiko
On 10/2/2010 3:17 AM, Joakim Tjernlund wrote: Hello Reinhard, Reinhard Meyer wrote: Dear Albert ARIBAUD, I try to understand how the relocation process could handle pointers (to functions or other data) in const or data sections. Your code cannot know what is data and what is a pointer that needs adjustment? Best Regards, Reinhard Hi Reinhart, Short answer - the relocation process does not handle pointers inside data structures. And yes, this means the content arrays of pointers such as init_sequence is not relocated. Been there, done that, can give you one of the The init_sequence should not called anymore after relocation, as it is the init_sequence ... or? tee-shirts I got :) ATM I have not found a way to fix this, except making the code which uses the pointers aware that the are location-sensitive and fix them when using them. That means that things like this cannot work (with relocation), unless adding the relocation offset before using the pointer: Yep, you have to fix these pointers after relocation ... const struct { const u8 shift; const u8 idcode; struct spi_flash *(*probe) (struct spi_slave *spi, u8 *idcode); } flashes[] = { #ifdef CONFIG_SPI_FLASH_SPANSION { 0, 0x01, spi_flash_probe_spansion, }, #endif [...] #ifdef CONFIG_SPI_FRAM_RAMTRON { 6, 0xc2, spi_fram_probe_ramtron, }, # ifdef CONFIG_SPI_FRAM_RAMTRON_NON_JEDEC { 0, 0xff, spi_fram_probe_ramtron, }, # endif # undef IDBUF_LEN # define IDBUF_LEN 9 /* we need to read 6+3 bytes */ #endif }; And I think there are more places of this type in u-boot... Yes, maybe. But relocation as I did for arm, also works on m68k, sparc, mips, avr32 and they must do also this fixups, so for common functions (except the new env handling, which I think got never tested on this architectures?) should work ... This pointer problem is solved with the fixup relocs on ppc and should work without manual relocation. I think this is a ppc only extension but I might be wrong. Hi All, You are correct that this is a ppc only extension. As such, it is not a good candidate for "general" use. I believe that the other alternative is to do it as x86 does which I think is the general way which should work on any arch. Graem Russ would know better. Almost exactly a year ago, this was all pretty much presented by Graeme in the threads Relocation size penalty calculation (October 14, 2009) i386 Relocation (November 24, 2009) Using the full relocation scheme eliminates the need for all these "fixups" in u-boot C code. I think this is a very desirable result. It is also not clear to me that hard coding in the relocation as several C routines will produce a u-boot that is "smaller" than the one produced by using normal ELF relocation. However, using full relocation creates an environment that is true "C" and does not rely on people remembering that they may have to fix up some parts of their code. It is hard to see much downside in using the full relocation capability provided by Graeme's code. FWIW, the relocation code and data does not have to be moved into ram if space is at a premium. Best Regards, Bill Campbell Jocke ___ U-Boot mailing list U-Boot@lists.denx.de http://lists.denx.de/mailman/listinfo/u-boot ___ U-Boot mailing list U-Boot@lists.denx.de http://lists.denx.de/mailman/listinfo/u-boot
Re: [U-Boot] [PATCH] flash.h: pull in common.h for types
Mike Frysinger wrote: > On Tuesday 17 November 2009 16:56:58 Wolfgang Denk wrote: > >> Scott Wood wrote: >> My question: is there a definitive position somewhere (for example for the Linux kernel; I'm sure we don't have one for U-Boot [yet]), whether system headers should be self-sufficient? >>> I'd say they should be self-sufficient, in that the inclusion of the >>> header itself should not fail if I haven't included some arbitrary other >>> header. I don't see what the argument would be for not doing this. >>> >> Well, Theo de Raadt says for example "... people would be able to >> include less files; indeed, almost be careless about what they >> include. But this would not increase portability in any way. And >> 'make build' would probably, if it was taken the nth degree, take >> twice as long. Therefore there is no benefit for the crazy rule you >> suggest..." - see >> http://www.mail-archive.com/t...@openbsd.org/msg00425.html >> > > i disagree with this using, ironically, the same base logic, but a different > conclusion: > http://sourceware.org/ml/libc-alpha/2006-08/msg00064.html > > also, i think a self contained system like u-boot which has full control at > the api level can be better at this than a user interface which really sits > on > top of an abi and has to deal with a lot of crap from user code. > > while i'm not asking for you or anyone else to audit header paths here as i > think that level of enforcement will bog things down, small patches from > people who choose to fix things should be merged. > > FWIW, I think one needs to be very careful with this reasoning. It is clear that experienced and capable programmers disagree on the correct approach to this problem. It is also true that the logical structure of the include chain is important. A "crap" interface is going to be hard to maintain no matter how you do it. The problem is, "crap" has a way of sneaking into well-designed interfaces in the form of small patches to "fix" things. (I am not saying this is one of them). I have observed that in any large program, as time goes on, more and more things get included in more and more places. Hiding this fact by doing the inclusion in other header files often obscures the drift towards a very polluted name space, where editing just about any include file requires the entire system to be rebuilt. Some would say, so what, it is easy to do. However, these situations often result in hard to find bugs. The question of "which modules can access this variable" becomes "all of them" because of (badly designed?) interfaces being included in other interfaces without the user being explicitly aware of this. Often I see in patch critiques the statement "This belongs in a header file", which may be true, but can lead to a bunch of un-related things being stuck together in a header file just to meet the requirement. Those bad choices then get included in other header files and we are off to the races. Since the users are not explicitly aware they are pulling in a bunch of stuff that has nothing to do with the main purpose of the module being written, there is often no incentive to clean up bad header designs until things get so bad it is almost impossible to untangle things. So when adding an include to an include file, think hard whether you are bringing along a lot of things that are "not required" and that one would not expect to be exposed in all files using the modified include file. If you are bringing along a lot of "baggage", perhaps the interface you are including could use some re-work. Best Regards, Bill Campbell >>> I don't know whether Linux has a specific policy on this, but I haven't >>> noticed many problems in this regard, and when I did find one in the >>> kernel a few years back I didn't get any argument when I submitted a >>> patch to fix it. >>> > > ive semi-frequently post fixes to linux headers so that you can include just > that header and have it work. i have yet to hear anyone complain; rather > every one has been merged (ignoring issues unrelated to the original purpose). > > >>> Which man pages are you looking at? >>> >> Well, for example: >> >> open(2): >> SYNOPSIS >> #include >> #include >> #include >> >> mknod(2): >> SYNOPSIS >> #include >> #include >> #include >> #include >> >> stat(2): >> SYNOPSIS >> #include >> #include >> #include >> >> Why do we need these lists of #includes? WHy doe - for example - >> not auto-include anything it might need? >> >> To me this seems to be an indication that there is no intention to >> make headers self-sufficent, but I am absolutely not sure. >> > > i'm pretty sure your man page example is an unrelated issue. the include > list > does not imply that any one of those headers ca
Re: [U-Boot] Relocation size penalty calculation
Graeme Russ wrote: > On Thu, Oct 15, 2009 at 3:45 AM, J. William Campbell > wrote: > >> Joakim Tjernlund wrote: >> > > Apologies if this is getting way off-topic for a simple boot loader, but > this is information I have gathered from far and wide over the net. I am > surprised that there isn't a web site out there on 'How to create a > relocatable boot loader'... > > OK, its all starting to come together now - It helps when you look at the > right files ;) > > Firstly, u-boot.map > > 0x380589a0__rel_dyn_start = . > > .rel.dyn0x380589a0 0x42b0 > *(.rel.dyn) > .rel.got 0x0x0 cpu/i386/start.o > .rel.plt 0x0x0 cpu/i386/start.o > .rel.text 0x380589a0 0x2e28 cpu/i386/start.o > .rel.start16 0x3805b7c8 0x10 cpu/i386/start.o > .rel.data 0x3805b7d8 0xc18 cpu/i386/start.o > .rel.rodata0x3805c3f0 0x360 cpu/i386/start.o > .rel.u_boot_cmd > 0x3805c750 0x500 cpu/i386/start.o > 0x3805cc50__rel_dyn_end = . > > > And the output of readelf... > > Section Headers: > [Nr] Name TypeAddr OffSize ES Flg Lk Inf > Al > [ 0] NULL 00 00 00 0 0 > 0 > [ 1] .text PROGBITS3804 001000 0118a4 00 AX 0 0 > 4 > [ 2] .rel.text REL 066c68 005d00 08 40 1 > 4 > [ 3] .rodata PROGBITS380518a4 0128a4 005da5 00 A 0 0 > 4 > [ 4] .rel.rodata REL 06c968 000360 08 40 3 > 4 > [ 5] .interp PROGBITS38057649 018649 13 00 A 0 0 > 1 > [ 6] .dynstr STRTAB 3805765c 01865c 0001ee 00 A 0 0 > 1 > [ 7] .hash HASH3805784c 01884c cc 04 A 11 0 > 4 > [ 8] .data PROGBITS38057918 018918 000a3c 00 WA 0 0 > 4 > [ 9] .rel.data REL 06ccc8 000c18 08 40 8 > 4 > [10] .got.plt PROGBITS38058354 019354 0c 04 WA 0 0 > 4 > [11] .dynsym DYNSYM 38058360 019360 000200 10 A 6 1 > 4 > [12] .dynamic DYNAMIC 38058560 019560 80 08 WA 6 0 > 4 > [13] .u_boot_cmd PROGBITS380585e0 0195e0 0003c0 00 WA 0 0 > 4 > [14] .rel.u_boot_cmd REL 06d8e0 000500 08 40 13 > 4 > [15] .bss NOBITS 3805cc50 01ec50 001a34 00 WA 0 0 > 4 > [16] .bios PROGBITS 01e000 00053e 00 AX 0 0 > 1 > [17] .rel.bios REL 06dde0 c0 08 40 16 > 4 > [18] .rel.dyn REL 380589a0 0199a0 0042b0 08 A 11 0 > 4 > [19] .start16 PROGBITSf800 01e800 000110 00 AX 0 0 > 1 > [20] .rel.start16 REL 06dea0 38 08 40 19 > 4 > [21] .resetvec PROGBITSfff0 01eff0 10 00 AX 0 0 > 1 > [22] .rel.resetvec REL 06ded8 08 08 40 21 > 4 > > ... > > Relocation section '.rel.text' at offset 0x66c68 contains 2976 entries: > Offset InfoTypeSym.Value Sym. Name > 38040010 0101 R_386_32 3804 .text > 3804001e 0101 R_386_32 3804 .text > 38040028 0101 R_386_32 3804 .text > 3804003f 0101 R_386_32 3804 .text > 38040051 0101 R_386_32 3804 .text > 38040075 0101 R_386_32 3804 .text > 38040085 0101 R_386_32 3804 .text > 3804009d 0003e602 R_386_PC32380403fa load_uboot > 380400a6 0101 R_386_32 3804 .text > 38040015 00029f02 R_386_PC323804bdd8 early_board_init > 38040023 0003f702 R_386_PC323804bdda show_boot_progress_asm > > ... > > Relocation section '.rel.rodata' at offset 0x6c968 contains 108 entries: > Offset InfoTypeSym.Value Sym. Name > 38051908 0201 R_386_32 380518a4 .rodata > 38051938 0201 R_386_32 380518a4 .rodata > 38051968 0201 R_386_32 380518a4 .rodata > 38051998 0201 R_386_32 380518a4 .rodata > 380519c8 0201 R_386_32 380518a4 .rodata > 380519f8 0201 R_386_32 380518a4 .rodata > > ... > > Relocation section '.rel.dyn' at offset 0x199a0 contains 2134 entries: > Offset
Re: [U-Boot] Relocation size penalty calculation
Joakim Tjernlund wrote: > "J. William Campbell" wrote on 14/10/2009 > 17:35:44: > >> Joakim Tjernlund wrote: >> >>> "J. William Campbell" wrote on 14/10/2009 >>> 01:48:52: >>> >>> >>>> Joakim Tjernlund wrote: >>>> >>>> >>>>> Graeme Russ wrote on 13/10/2009 22:06:56: >>>>> >>>>> >>>>> >>>>> >>>>>> On Tue, Oct 13, 2009 at 10:53 PM, Joakim Tjernlund >>>>>> wrote: >>>>>> >>>>>> >>>>>> >>>>>>> Graeme Russ wrote on 13/10/2009 13:21:05: >>>>>>> >>>>>>> >>>>>>> >>>>>>>> On Sun, Oct 11, 2009 at 11:51 PM, Joakim Tjernlund >>>>>>>> wrote: >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>>> Graeme Russ wrote on 11/10/2009 12:47:19: >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>> [Massive Snip :)] >>>>>>>> > > [Yet another SNIP :)] > > >>>>>>> Evil idea, skip -fpic et. all and add the full reloc procedure >>>>>>> to relocate by rewriting directly in TEXT segment. Then you save space >>>>>>> but you need more relocation code. Something like dl_do_reloc from >>>>>>> uClibc. Wonder how much extra code that would be? Not too much I think. >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>> With the following flags >>>>>> >>>>>> PLATFORM_RELFLAGS += -fvisibility=hidden >>>>>> PLATFORM_CPPFLAGS += -fno-dwarf2-cfi-asm >>>>>> PLATFORM_LDFLAGS += -pic --emit-relocs -Bsymbolic -Bsymbolic-functions >>>>>> >>>>>> I get no .got, but a lot of R_386_PC32 and R_386_32 relocations. I think >>>>>> this might mean I need the symbol table in the binary in order to resolve >>>>>> them >>>>>> >>>>>> >>>>>> >>> BTW, how many relocs do you get compared with -fPIC? I suspect you more >>> now but hopefully not that many more. >>> >>> >>> >>>>> Possibly, but I think you only need to add an offset to all those >>>>> relocs. >>>>> >>>>> >>>>> >>>> Almost right. The relocations specify a symbol value that needs to be >>>> added to the data in memory to relocate the reference. The symbol values >>>> involved should be the start of the text section for program references, >>>> the start of the uninitialized data section for bss references, and the >>>> start of the data section for initialized data and constants. So there >>>> are about four symbols whose value you need to keep. Take a look at >>>> http://refspecs.freestandards.org/elf/elf.pdf (which you have probably >>>> already looked at) and it tells you what to do with R_386_PC32 ad >>>> R_386_32 relocations. Hopefully the objcopy with the --strip-unneeded >>>> will remove all the symbols you don't actually need, but I don't know >>>> that for sure. Note also that you can change the section flags of a >>>> section marked noload to load. >>>> >>>> >>> Still think you can get away with just ADDING an offset. The image is >>> linked to a >>> specific address and then you move the whole image to a new address. >>> Therefore >>> you should be able to read the current address, add offset, write back the >>> >> new address. >> >>> Normally one do what you describe but here we know that the whole img has >>> moved so >>> we don't have to do calculate the new address from scratch. >>> >>> >> If the addresses of the bss, text, and data segments change by the same >> value, I think you are correct. However, if the text and data/bss >> segments are moved by different offsets, naturally the relocations would >> be different. One reason to retain this capability would be to allow the >> u-
Re: [U-Boot] Relocation size penalty calculation
Joakim Tjernlund wrote: > Graeme Russ wrote on 14/10/2009 13:48:27: > >> On Wed, Oct 14, 2009 at 6:25 PM, Joakim Tjernlund >> wrote: >> >>> "J. William Campbell" wrote on 14/10/2009 >>> 01:48:52: >>> >>>> Joakim Tjernlund wrote: >>>> >>>>> Graeme Russ wrote on 13/10/2009 22:06:56: >>>>> >>>>> >>>>> >>>>>> On Tue, Oct 13, 2009 at 10:53 PM, Joakim Tjernlund >>>>>> wrote: >>>>>> >>>>>> >>>>>>> Graeme Russ wrote on 13/10/2009 13:21:05: >>>>>>> >>>>>>> >>>>>>>> On Sun, Oct 11, 2009 at 11:51 PM, Joakim Tjernlund >>>>>>>> wrote: >>>>>>>> >>>>>>>> >>>>>>>>> Graeme Russ wrote on 11/10/2009 12:47:19: >>>>>>>>> >>>>>>>>> >>>>>>>> [Massive Snip :)] >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>>>> So, all that is left are .dynsym and .dynamic ... >>>>>>>>>> .dynsym >>>>>>>>>> - Contains 70 entries (16 bytes each, 1120 bytes) >>>>>>>>>> - 44 entries mimic those entries in .got which are not relocated >>>>>>>>>> - 21 entries are the remaining symbols exported from the linker >>>>>>>>>> script >>>>>>>>>> - 4 entries are labels defined in inline asm and used in C >>>>>>>>>> >>>>>>>>>> >>>>>>>>> Try adding proper asm declarations. Look at what gcc >>>>>>>>> generates for a function/variable and mimic these. >>>>>>>>> >>>>>>>>> >>>>>>>> Thanks - Now .dynsym contains only exports from the linker script >>>>>>>> >>>>>>>> >>>>>>> :) >>>>>>> >>>>>>> >>>>>>>>>> - 1 entry is a NULL entry >>>>>>>>>> >>>>>>>>>> .dynamic >>>>>>>>>> - 88 bytes >>>>>>>>>> - Array of Elf32_Dyn >>>>>>>>>> - typedef struct { >>>>>>>>>> Elf32_Sword d_tag; >>>>>>>>>> union { >>>>>>>>>> Elf32_Word d_val; >>>>>>>>>> Elf32_Addr d_ptr; >>>>>>>>>> } d_un; >>>>>>>>>> } Elf32_Dyn; >>>>>>>>>> - 0x11 entries >>>>>>>>>> [00] 0x0010, 0x DT_SYMBOLIC, (ignored) >>>>>>>>>> [01] 0x0004, 0x38059994 DT_HASH, points to .hash >>>>>>>>>> [02] 0x0005, 0x380595AB DT_STRTAB, points to .dynstr >>>>>>>>>> [03] 0x0006, 0x3805BDCC DT_SYMTAB, points to .dynsym >>>>>>>>>> [04] 0x000A, 0x03E6 DT_STRSZ, size of .dynstr >>>>>>>>>> [05] 0x000B, 0x0010 DT_SYMENT, ??? >>>>>>>>>> [06] 0x0015, 0x DT_DEBUG, ??? >>>>>>>>>> [07] 0x0011, 0x3805A8F4 DT_REL, points to .rel.text >>>>>>>>>> [08] 0x0012, 0x14D8 DT_RELSZ, ??? >>>>>>>>>> >>>>>>>>>> >>>>>>>>> How big DT_REL is >>>>>>>>> >>>>>>>>> >>>>>>>>>> [09] 0x0013, 0x0008 DT_RELENT, ??? >>>>>>>>>> >>>>>>>>>> >>>>>>>>> hmm, cannot remeber :) >>>>>>>>> >>>>>>>>> >>>>>>>> How big an entry in DT_REL is >>>>>
Re: [U-Boot] Relocation size penalty calculation
Joakim Tjernlund wrote: > "J. William Campbell" wrote on 14/10/2009 > 01:48:52: > >> Joakim Tjernlund wrote: >> >>> Graeme Russ wrote on 13/10/2009 22:06:56: >>> >>> >>> >>>> On Tue, Oct 13, 2009 at 10:53 PM, Joakim Tjernlund >>>> wrote: >>>> >>>> >>>>> Graeme Russ wrote on 13/10/2009 13:21:05: >>>>> >>>>> >>>>>> On Sun, Oct 11, 2009 at 11:51 PM, Joakim Tjernlund >>>>>> wrote: >>>>>> >>>>>> >>>>>>> Graeme Russ wrote on 11/10/2009 12:47:19: >>>>>>> >>>>>>> >>>>>> [Massive Snip :)] >>>>>> >>>>>> >>>>>> >>>>>>>> So, all that is left are .dynsym and .dynamic ... >>>>>>>> .dynsym >>>>>>>> - Contains 70 entries (16 bytes each, 1120 bytes) >>>>>>>> - 44 entries mimic those entries in .got which are not relocated >>>>>>>> - 21 entries are the remaining symbols exported from the linker >>>>>>>> script >>>>>>>> - 4 entries are labels defined in inline asm and used in C >>>>>>>> >>>>>>>> >>>>>>> Try adding proper asm declarations. Look at what gcc >>>>>>> generates for a function/variable and mimic these. >>>>>>> >>>>>>> >>>>>> Thanks - Now .dynsym contains only exports from the linker script >>>>>> >>>>>> >>>>> :) >>>>> >>>>> >>>>>>>> - 1 entry is a NULL entry >>>>>>>> >>>>>>>> .dynamic >>>>>>>> - 88 bytes >>>>>>>> - Array of Elf32_Dyn >>>>>>>> - typedef struct { >>>>>>>> Elf32_Sword d_tag; >>>>>>>> union { >>>>>>>> Elf32_Word d_val; >>>>>>>> Elf32_Addr d_ptr; >>>>>>>> } d_un; >>>>>>>> } Elf32_Dyn; >>>>>>>> - 0x11 entries >>>>>>>> [00] 0x0010, 0x DT_SYMBOLIC, (ignored) >>>>>>>> [01] 0x0004, 0x38059994 DT_HASH, points to .hash >>>>>>>> [02] 0x0005, 0x380595AB DT_STRTAB, points to .dynstr >>>>>>>> [03] 0x0006, 0x3805BDCC DT_SYMTAB, points to .dynsym >>>>>>>> [04] 0x000A, 0x03E6 DT_STRSZ, size of .dynstr >>>>>>>> [05] 0x000B, 0x0010 DT_SYMENT, ??? >>>>>>>> [06] 0x0015, 0x DT_DEBUG, ??? >>>>>>>> [07] 0x0011, 0x3805A8F4 DT_REL, points to .rel.text >>>>>>>> [08] 0x0012, 0x14D8 DT_RELSZ, ??? >>>>>>>> >>>>>>>> >>>>>>> How big DT_REL is >>>>>>> >>>>>>> >>>>>>>> [09] 0x0013, 0x0008 DT_RELENT, ??? >>>>>>>> >>>>>>>> >>>>>>> hmm, cannot remeber :) >>>>>>> >>>>>>> >>>>>> How big an entry in DT_REL is >>>>>> >>>>>> >>>>> Right, how could I forget :) >>>>> >>>>> >>>>>>>> [0a] 0x0016, 0x DT_TEXTREL, ??? >>>>>>>> >>>>>>>> >>>>>>> Oops, you got text relocations. This is generally a bad thing. >>>>>>> TEXTREL is commonly caused by asm code that arent truly pic so it needs >>>>>>> to modify the .text segment to adjust for relocation. >>>>>>> You should get rid of this one. Look for DT_TEXTREL in .o files to find >>>>>>> the culprit. >>>>>>> >>>>>>> >>>>>>> >>>>
Re: [U-Boot] Relocation size penalty calculation
Joakim Tjernlund wrote: > Graeme Russ wrote on 13/10/2009 22:06:56: > > >> On Tue, Oct 13, 2009 at 10:53 PM, Joakim Tjernlund >> wrote: >> >>> Graeme Russ wrote on 13/10/2009 13:21:05: >>> On Sun, Oct 11, 2009 at 11:51 PM, Joakim Tjernlund wrote: > Graeme Russ wrote on 11/10/2009 12:47:19: > [Massive Snip :)] >> So, all that is left are .dynsym and .dynamic ... >> .dynsym >> - Contains 70 entries (16 bytes each, 1120 bytes) >> - 44 entries mimic those entries in .got which are not relocated >> - 21 entries are the remaining symbols exported from the linker >> script >> - 4 entries are labels defined in inline asm and used in C >> > Try adding proper asm declarations. Look at what gcc > generates for a function/variable and mimic these. > Thanks - Now .dynsym contains only exports from the linker script >>> :) >>> >> - 1 entry is a NULL entry >> >> .dynamic >> - 88 bytes >> - Array of Elf32_Dyn >> - typedef struct { >> Elf32_Sword d_tag; >> union { >> Elf32_Word d_val; >> Elf32_Addr d_ptr; >> } d_un; >> } Elf32_Dyn; >> - 0x11 entries >> [00] 0x0010, 0x DT_SYMBOLIC, (ignored) >> [01] 0x0004, 0x38059994 DT_HASH, points to .hash >> [02] 0x0005, 0x380595AB DT_STRTAB, points to .dynstr >> [03] 0x0006, 0x3805BDCC DT_SYMTAB, points to .dynsym >> [04] 0x000A, 0x03E6 DT_STRSZ, size of .dynstr >> [05] 0x000B, 0x0010 DT_SYMENT, ??? >> [06] 0x0015, 0x DT_DEBUG, ??? >> [07] 0x0011, 0x3805A8F4 DT_REL, points to .rel.text >> [08] 0x0012, 0x14D8 DT_RELSZ, ??? >> > How big DT_REL is > >> [09] 0x0013, 0x0008 DT_RELENT, ??? >> > hmm, cannot remeber :) > How big an entry in DT_REL is >>> Right, how could I forget :) >>> >> [0a] 0x0016, 0x DT_TEXTREL, ??? >> > Oops, you got text relocations. This is generally a bad thing. > TEXTREL is commonly caused by asm code that arent truly pic so it needs > to modify the .text segment to adjust for relocation. > You should get rid of this one. Look for DT_TEXTREL in .o files to find > the culprit. > > Alas I cannot - The relocations are a result of loading a register with a return address when calling show_boot_progress in the very early stages of initialisation prior to the stack becoming available. The x86 does not allow direct access to the IP so the only way to find the 'current execution address' is to 'call' to the next instruction and pop the return address off the stack >>> hmm, same as ppc but that in it self should not cause a TEXREL, should it? >>> Ahh, the 'call' is absolute, not relative? I guess there is some way around >>> it >>> but it is not important ATM I guess. >>> >>> Evil idea, skip -fpic et. all and add the full reloc procedure >>> to relocate by rewriting directly in TEXT segment. Then you save space >>> but you need more relocation code. Something like dl_do_reloc from >>> uClibc. Wonder how much extra code that would be? Not too much I think. >>> >>> >> With the following flags >> >> PLATFORM_RELFLAGS += -fvisibility=hidden >> PLATFORM_CPPFLAGS += -fno-dwarf2-cfi-asm >> PLATFORM_LDFLAGS += -pic --emit-relocs -Bsymbolic -Bsymbolic-functions >> >> I get no .got, but a lot of R_386_PC32 and R_386_32 relocations. I think >> this might mean I need the symbol table in the binary in order to resolve >> them >> > > Possibly, but I think you only need to add an offset to all those > relocs. > Almost right. The relocations specify a symbol value that needs to be added to the data in memory to relocate the reference. The symbol values involved should be the start of the text section for program references, the start of the uninitialized data section for bss references, and the start of the data section for initialized data and constants. So there are about four symbols whose value you need to keep. Take a look at http://refspecs.freestandards.org/elf/elf.pdf (which you have probably already looked at) and it tells you what to do with R_386_PC32 ad R_386_32 relocations. Hopefully the objcopy with the --strip-unneeded will remove all the symbols you don't actually need, but I don't know that for sure. Note also that you can change the section flags of a section marked noload to load. Best Regards, Bill Campbell > Jokce > > ___ > U-Boot mailing list > U-Boot@lists.denx.de > ht
Re: [U-Boot] Relocation size penalty calculation
Joakim Tjernlund wrote: > Graeme Russ wrote on 13/10/2009 13:21:05: > >> On Sun, Oct 11, 2009 at 11:51 PM, Joakim Tjernlund >> wrote: >> >>> Graeme Russ wrote on 11/10/2009 12:47:19: >>> >> [Massive Snip :)] >> >> So, all that is left are .dynsym and .dynamic ... .dynsym - Contains 70 entries (16 bytes each, 1120 bytes) - 44 entries mimic those entries in .got which are not relocated - 21 entries are the remaining symbols exported from the linker script - 4 entries are labels defined in inline asm and used in C >>> Try adding proper asm declarations. Look at what gcc >>> generates for a function/variable and mimic these. >>> >> Thanks - Now .dynsym contains only exports from the linker script >> > :) > - 1 entry is a NULL entry .dynamic - 88 bytes - Array of Elf32_Dyn - typedef struct { Elf32_Sword d_tag; union { Elf32_Word d_val; Elf32_Addr d_ptr; } d_un; } Elf32_Dyn; - 0x11 entries [00] 0x0010, 0x DT_SYMBOLIC, (ignored) [01] 0x0004, 0x38059994 DT_HASH, points to .hash [02] 0x0005, 0x380595AB DT_STRTAB, points to .dynstr [03] 0x0006, 0x3805BDCC DT_SYMTAB, points to .dynsym [04] 0x000A, 0x03E6 DT_STRSZ, size of .dynstr [05] 0x000B, 0x0010 DT_SYMENT, ??? [06] 0x0015, 0x DT_DEBUG, ??? [07] 0x0011, 0x3805A8F4 DT_REL, points to .rel.text [08] 0x0012, 0x14D8 DT_RELSZ, ??? >>> How big DT_REL is >>> [09] 0x0013, 0x0008 DT_RELENT, ??? >>> hmm, cannot remeber :) >>> >> How big an entry in DT_REL is >> > > Right, how could I forget :) > [0a] 0x0016, 0x DT_TEXTREL, ??? >>> Oops, you got text relocations. This is generally a bad thing. >>> TEXTREL is commonly caused by asm code that arent truly pic so it needs >>> to modify the .text segment to adjust for relocation. >>> You should get rid of this one. Look for DT_TEXTREL in .o files to find >>> the culprit. >>> >>> >> Alas I cannot - The relocations are a result of loading a register with a >> return address when calling show_boot_progress in the very early stages of >> initialisation prior to the stack becoming available. The x86 does not >> allow direct access to the IP so the only way to find the 'current >> execution address' is to 'call' to the next instruction and pop the return >> address off the stack >> > > hmm, same as ppc but that in it self should not cause a TEXREL, should it? > Ahh, the 'call' is absolute, not relative? I guess there is some way around it > but it is not important ATM I guess. > > Evil idea, skip -fpic et. all and add the full reloc procedure > to relocate by rewriting directly in TEXT segment. Then you save space > but you need more relocation code. Something like dl_do_reloc from > uClibc. Wonder how much extra code that would be? Not too much I think. > I think this approach will turn out to be a big win. At present, the problem with just using the relocs is that objcopy is stripping them out when u-boot.bin is created, as I understand it. It seems this can be solved by changing the command switches appropriately, like using --strip-unneeded. In any case, there is some combination of switches that will preserve the relocation data. The executable code will get smaller, there will be no .got, and the relocation data will be larger (than with -fpic). In total size, it probably will be slightly smaller, but that is a guess. The most important benefit of this approach is that it will work for all architectures, thereby solving the problem once and forever! Even if the result is a bit larger, the RAM footprint will be reduced by the smaller object code size (since the relocation data need not be copied into ram).Having this approach as an option would be real nice, since it would always "just work". Best Regards, Bill Campbell > >> This is not a problem because this is very low-level init that is not >> called once relocated into RAM - These relocations can be safely ignored >> > > [0b] 0x6FFA, 0x0236 ???, Entries in .rel.dyn [0c] 0x, 0x DT_NULL, End of Array [0d] 0x, 0x DT_NULL, End of Array [0e] 0x, 0x DT_NULL, End of Array [0f] 0x, 0x DT_NULL, End of Array [10] 0x, 0x DT_NULL, End of Array I think some more investigation into the need for .dynsym and .dynamic is still required... >> .dynsym may still be required if only for accessing the __u_boot_cmd >> structure. However, I may be able to
Re: [U-Boot] Relocation size penalty calculation
Joakim Tjernlund wrote: >> On Fri, Oct 9, 2009 at 9:27 AM, J. William Campbell >> wrote: >> >>> Graeme Russ wrote: >>> >>>> On Fri, Oct 9, 2009 at 2:58 AM, J. William Campbell >>>> wrote: >>>> >>>> >>>>> Graeme Russ wrote: >>>>> >>>>> >>>>>> Out of curiosity, I wanted to see just how much of a size penalty I am >>>>>> incurring by using gcc -fpic / ld -pic on my x86 u-boot build. Here are >>>>>> the results (fixed width font will help - its space, not tab, >>>>>> formatted): >>>>>> >>>>>> Section non-reloc reloc >>>>>> --- >>>>>> .text000118c4 000137fc <- 0x1f38 bytes (~8kB) bigger >>>>>> .rodata 5bad 59d0 >>>>>> .interp n/a 0013 >>>>>> .dynstr n/a 0648 >>>>>> .hashn/a 0428 >>>>>> .eh_frame3268 34fc >>>>>> .data0a6c 01dc >>>>>> .data.reln/a 0098 >>>>>> .data.rel.ro.local n/a 0178 >>>>>> .data.rel.local n/a 07e4 >>>>>> .got 01f0 >>>>>> .got.plt n/a 000c >>>>>> .rel.got n/a 03e0 >>>>>> .rel.dyn n/a 1228 >>>>>> .dynsym n/a 0850 >>>>>> .dynamic n/a 0080 >>>>>> .u_boot_cmd 03c0 03c0 >>>>>> .bss 1a34 1a34 >>>>>> .realmode0166 0166 >>>>>> .bios053e 053e >>>>>> === >>>>>> Total0001d5dd 00022287 <- 0x4caa bytes (~19kB) bigger >>>>>> >>>>>> Its more than a 16% increase in size!!! >>>>>> >>>>>> .text accounts for a little under half of the total bloat, and of that, >>>>>> the crude dynamic loader accounts for only 341 bytes >>>>>> >>>>>> >>>>>> >>>>> Hi Graeme, >>>>> I would be interested in a third option (column), the x86 build with >>>>> just -mrelocateable but NOT -fpic. It will not be definitive because >>>>> there >>>>> will be extra code that references the GOT and missing code to do some of >>>>> the relocation, but it would still be interesting. >>>>> >>>>> >>>> x86 does not have -mrelocatable. This is a PPC only option :( >>>> >>>> >>> Hi Graeme, >>> You are unfortunately correct. However, I wonder if we can get >>> essentially the same result by executing the final ld step with the >>> --emit-relocs switch included. This may also include some "extra" sections >>> that we would want to strip out, but if it works, it could give all >>> ELF-based systems a way to a relocatable u-boot. >>> >>> >> I don't think --emit-relocs is necessary with -pic. I haven't gone through >> all the permutations to see if there is a smaller option, but gcc -fpic and >> ld -pie creates enough information to perform relocation on the x86 >> platform >> > > It is true that --emit-relocs is not required when -pic and -pie are used instead. However, pic and pie are designed to allow shared code (libraries) to appear at different logical addresses in several programs without altering the text. This is grand overkill for what we need, which is the ability to relocate the code. The -pic and -pie code will be larger than the code without pic and pie. How much larger is a good question. On the PPC, it is larger but not much larger, because there are lots of registers available and one is almost for sure got (no pun intended) the magic relocation constant(s) in it. On the 386 with many fewer registers, pic and pie will cause the code to be percentage-wise larger than on the PPC. Thus avoiding pic and pie is a Good Thing in most cases. > Try -fvisibility=hidden > I assume the -fvisibility=hidden is suggested in order to reduce (eliminate) the symbol table from the output, which we don't need because there are assumed to be no undefined symbols in our final ld. If that works, great! I was assuming we might need a custom "strip" program to delete any sections that we don't need, but this sounds easier if it gets them all. Best Regards, Bill Campbell > Jocke > > > > ___ U-Boot mailing list U-Boot@lists.denx.de http://lists.denx.de/mailman/listinfo/u-boot
Re: [U-Boot] Relocation size penalty calculation
Graeme Russ wrote: > On Fri, Oct 9, 2009 at 2:58 AM, J. William Campbell > wrote: > >> Graeme Russ wrote: >> >>> Out of curiosity, I wanted to see just how much of a size penalty I am >>> incurring by using gcc -fpic / ld -pic on my x86 u-boot build. Here are >>> the results (fixed width font will help - its space, not tab, formatted): >>> >>> Section non-reloc reloc >>> --- >>> .text000118c4 000137fc <- 0x1f38 bytes (~8kB) bigger >>> .rodata 5bad 59d0 >>> .interp n/a 0013 >>> .dynstr n/a 0648 >>> .hashn/a 0428 >>> .eh_frame3268 34fc >>> .data0a6c 01dc >>> .data.reln/a 0098 >>> .data.rel.ro.local n/a 0178 >>> .data.rel.local n/a 07e4 >>> .got 01f0 >>> .got.plt n/a 000c >>> .rel.got n/a 03e0 >>> .rel.dyn n/a 1228 >>> .dynsym n/a 0850 >>> .dynamic n/a 0080 >>> .u_boot_cmd 03c0 03c0 >>> .bss 1a34 1a34 >>> .realmode0166 0166 >>> .bios053e 053e >>> === >>> Total0001d5dd 00022287 <- 0x4caa bytes (~19kB) bigger >>> >>> Its more than a 16% increase in size!!! >>> >>> .text accounts for a little under half of the total bloat, and of that, >>> the crude dynamic loader accounts for only 341 bytes >>> >>> >> Hi Graeme, >> I would be interested in a third option (column), the x86 build with >> just -mrelocateable but NOT -fpic. It will not be definitive because there >> will be extra code that references the GOT and missing code to do some of >> the relocation, but it would still be interesting. >> > > x86 does not have -mrelocatable. This is a PPC only option :( > Hi Graeme, You are unfortunately correct. However, I wonder if we can get essentially the same result by executing the final ld step with the --emit-relocs switch included. This may also include some "extra" sections that we would want to strip out, but if it works, it could give all ELF-based systems a way to a relocatable u-boot. Best Regards, Bill Campbell ** > > >> Best Regards, >> Bill Campbell >> >>> Have any metrics been done for PPC? >>> >>> Regards, >>> >>> Graeme >>> > > Once the reloc branch has been merged, how many arches are left which do > not support relocation? > > Regards, > > Graeme > > > ___ U-Boot mailing list U-Boot@lists.denx.de http://lists.denx.de/mailman/listinfo/u-boot
Re: [U-Boot] Relocation size penalty calculation
Peter Tyser wrote: > On Thu, 2009-10-08 at 08:53 -0700, J. William Campbell wrote: > >> Peter Tyser wrote: >> >>> On Thu, 2009-10-08 at 22:54 +1100, Graeme Russ wrote: >>> >>> >>>> Out of curiosity, I wanted to see just how much of a size penalty I am >>>> incurring by using gcc -fpic / ld -pic on my x86 u-boot build. Here are >>>> the results (fixed width font will help - its space, not tab, formatted): >>>> >>>> Section non-reloc reloc >>>> --- >>>> .text000118c4 000137fc <- 0x1f38 bytes (~8kB) bigger >>>> .rodata 5bad 59d0 >>>> .interp n/a 0013 >>>> .dynstr n/a 0648 >>>> .hashn/a 0428 >>>> .eh_frame3268 34fc >>>> .data0a6c 01dc >>>> .data.reln/a 0098 >>>> .data.rel.ro.local n/a 0178 >>>> .data.rel.local n/a 07e4 >>>> .got 01f0 >>>> .got.plt n/a 000c >>>> .rel.got n/a 03e0 >>>> .rel.dyn n/a 1228 >>>> .dynsym n/a 0850 >>>> .dynamic n/a 0080 >>>> .u_boot_cmd 03c0 03c0 >>>> .bss 1a34 1a34 >>>> .realmode0166 0166 >>>> .bios053e 053e >>>> === >>>> Total0001d5dd 00022287 <- 0x4caa bytes (~19kB) bigger >>>> >>>> Its more than a 16% increase in size!!! >>>> >>>> .text accounts for a little under half of the total bloat, and of that, >>>> the crude dynamic loader accounts for only 341 bytes >>>> >>>> Have any metrics been done for PPC? >>>> >>>> >>> Things actually improve a little bit when we use -mrelocatable and get >>> rid of all the manual "+= gd->reloc_off" fixups: >>> >>> 1) Top of mainline on XPedite5370: >>>textdata bss dec hex filename >>> 308612 24488 33172 366272 596c0 u-boot >>> >>> 2) Top of "reloc" branch on XPedite5370 (ie -mrelocatable): >>>textdata bss dec hex filename >>> 303704 28644 33156 365504 593c0 u-boot >>> >>> >>> >> Hi Peter, >> Just to be clear, the total text+data length of u-boot with the >> "manual" relocations (#1) is LARGER than the text+data length of u-boot >> with the "manual" relocations removed and the necessary centralized >> relocation code added, along with any additional data sections required >> by -mrelocateable (#2), by 768 (dec) bytes? >> > > Hi Bill, > Doah, looks like I chose a bad board as an example. The XPedite5370 > already had -mrelocatable defined in its own > board/xes/xpedite5370/config.mk in mainline, so the above comparison > should be ignored as both builds used -mrelocatable. > > Here's some *real* results from the MPC8548CDS: > 1) Top of mainline: >text data bss dec hex filename > 219968 17052 22992 260012 3f7ac u-boot > > 2) Top of "reloc" branch (ie -mrelocatable) >text data bss dec hex filename > 219192 20640 22980 262812 4029c u-boot > > So the reloc branch is 2.7K bigger for the MPC8548CDS. > Hi Peter, OK, that's more like it! A 1.2 % size increase in ROM seems like a very small price to pay for a truly relocatable u-boot image that will run on any size memory without the programmer having to actively worry about what may need relocating as code is written. . Also, it should be noted that the size increase in 2) is mostly in relocation segments that do not need to be copied into ram, so the ram footprint should be smaller for 2) than 1). The relocation code itself could also be placed is a segment that is not copied into ram, although that may be more trouble than it is worth. I am looking forward to Graeme's results with the 386. I expect that it will not be quite so favorable, perhaps a 4 or 5% size increase for -mrelocatable over an absolute build. However, -mrelocatable vs. -fpic may be comparable, with -mrelocatable actually winning. But then again, I could be totally wrong! Best Regards, Bill Campbell > Best, > Peter > > > > ___ U-Boot mailing list U-Boot@lists.denx.de http://lists.denx.de/mailman/listinfo/u-boot
Re: [U-Boot] Relocation size penalty calculation
Graeme Russ wrote: > Out of curiosity, I wanted to see just how much of a size penalty I am > incurring by using gcc -fpic / ld -pic on my x86 u-boot build. Here are > the results (fixed width font will help - its space, not tab, formatted): > > Section non-reloc reloc > --- > .text000118c4 000137fc <- 0x1f38 bytes (~8kB) bigger > .rodata 5bad 59d0 > .interp n/a 0013 > .dynstr n/a 0648 > .hashn/a 0428 > .eh_frame3268 34fc > .data0a6c 01dc > .data.reln/a 0098 > .data.rel.ro.local n/a 0178 > .data.rel.local n/a 07e4 > .got 01f0 > .got.plt n/a 000c > .rel.got n/a 03e0 > .rel.dyn n/a 1228 > .dynsym n/a 0850 > .dynamic n/a 0080 > .u_boot_cmd 03c0 03c0 > .bss 1a34 1a34 > .realmode0166 0166 > .bios053e 053e > === > Total0001d5dd 00022287 <- 0x4caa bytes (~19kB) bigger > > Its more than a 16% increase in size!!! > > .text accounts for a little under half of the total bloat, and of that, > the crude dynamic loader accounts for only 341 bytes > Hi Graeme, I would be interested in a third option (column), the x86 build with just -mrelocateable but NOT -fpic. It will not be definitive because there will be extra code that references the GOT and missing code to do some of the relocation, but it would still be interesting. Best Regards, Bill Campbell > Have any metrics been done for PPC? > > Regards, > > Graeme > ___ > U-Boot mailing list > U-Boot@lists.denx.de > http://lists.denx.de/mailman/listinfo/u-boot > > > ___ U-Boot mailing list U-Boot@lists.denx.de http://lists.denx.de/mailman/listinfo/u-boot
Re: [U-Boot] Relocation size penalty calculation
Peter Tyser wrote: > On Thu, 2009-10-08 at 22:54 +1100, Graeme Russ wrote: > >> Out of curiosity, I wanted to see just how much of a size penalty I am >> incurring by using gcc -fpic / ld -pic on my x86 u-boot build. Here are >> the results (fixed width font will help - its space, not tab, formatted): >> >> Section non-reloc reloc >> --- >> .text000118c4 000137fc <- 0x1f38 bytes (~8kB) bigger >> .rodata 5bad 59d0 >> .interp n/a 0013 >> .dynstr n/a 0648 >> .hashn/a 0428 >> .eh_frame3268 34fc >> .data0a6c 01dc >> .data.reln/a 0098 >> .data.rel.ro.local n/a 0178 >> .data.rel.local n/a 07e4 >> .got 01f0 >> .got.plt n/a 000c >> .rel.got n/a 03e0 >> .rel.dyn n/a 1228 >> .dynsym n/a 0850 >> .dynamic n/a 0080 >> .u_boot_cmd 03c0 03c0 >> .bss 1a34 1a34 >> .realmode0166 0166 >> .bios053e 053e >> === >> Total0001d5dd 00022287 <- 0x4caa bytes (~19kB) bigger >> >> Its more than a 16% increase in size!!! >> >> .text accounts for a little under half of the total bloat, and of that, >> the crude dynamic loader accounts for only 341 bytes >> >> Have any metrics been done for PPC? >> > > Things actually improve a little bit when we use -mrelocatable and get > rid of all the manual "+= gd->reloc_off" fixups: > > 1) Top of mainline on XPedite5370: >text data bss dec hex filename > 308612 24488 33172 366272 596c0 u-boot > > 2) Top of "reloc" branch on XPedite5370 (ie -mrelocatable): >text data bss dec hex filename > 303704 28644 33156 365504 593c0 u-boot > > Hi Peter, Just to be clear, the total text+data length of u-boot with the "manual" relocations (#1) is LARGER than the text+data length of u-boot with the "manual" relocations removed and the necessary centralized relocation code added, along with any additional data sections required by -mrelocateable (#2), by 768 (dec) bytes? And both cases (1 and 2) work equivalently? Best Regards, Bill Campbell. > For fun: > 3) #2 but with s/-mrelocatable/-fpic/ (probably doesn't boot): >text data bss dec hex filename > 303704 24472 33156 361332 58374 u-boot > > > There may be some other changes that affect the size between mainline > and "reloc", but their sizes are in the same general ballpark. > > Best, > Peter > > ___ > U-Boot mailing list > U-Boot@lists.denx.de > http://lists.denx.de/mailman/listinfo/u-boot > > > ___ U-Boot mailing list U-Boot@lists.denx.de http://lists.denx.de/mailman/listinfo/u-boot
Re: [U-Boot] [PATCH 0/2] Make sure 85xx bss doesn't start at 0x0
Joakim Tjernlund wrote: >> On Wed, Oct 7, 2009 at 5:55 PM, Wolfgang Denk wrote: >> >>> Dear Graeme Russ, >>> >>> In message >>> you wrote: >>> I think that even the -mrelocatable / .fixup method may not be needed at all. -pie / -pic used by themselves creates enough information for an OS dynamic loader to relocate an executable, so why not U-Boot? Given that the type and location of each section is easily determined, a striped down dynamic loader can provide a platform-independent relocation scheme. >>> One reason for not using ELF images for the boot loader is size. The >>> ELF header alone is often more than we would be willing to accept, not >>> to mention the additional code. >>> >>> >> But the headers get stripped from the final binary. All we are left with in >> order to locate the ELF section data are the symbols exported from the >> linker script >> >> The extra code is only three very tight for-loops. I had them wrapped in >> functions to improve readability, but they are good inline candidates (only >> called once each) and I doubt they use much code space at all (I'll send >> through actual numbers soon) >> > > But how much space in the extra sections you link in? > > if size is comparable with fixup ptrs we should probably > consider using the same for ppc. Then we can use -fpic/-fpie and > that is significantly smaller then -fPIC on PPC. > > >> Question is, does -mrelocatable result in smaller .got (et al) are is the >> .fixup section adding extra size for the sake of ease of implementation? >> > > fixup section expands with lots of ptrs, then fixup is placed just after .got > > > There is also a trade-off in using the -mrelocatable / .fixup method that should be considered in the general case. -fPIC or even -fpic code is almost always larger than the code generated for -mrelocateable, so it it the final size of the object blob in prom that matters, not just the length of the fixup segment. This effect is architecture dependent, where PIC is much cheaper on a PPC than on an Intel 386. So the "best" combination may depend on the chip. I would also strongly encourage the -mrelocateable method be always available as an option. Historically, -fPIC has been buggy in initial releases of GCC for new architectures, probably because you can do useful work with just relocateable code and PIC can come "later". So it may be that u-boot would need an -mrelocateable approach for some period until -fPIC/-fpic worked. In any case, it would be a good fallback if one suspected a bug in PIC. Best Regards, Bill Campbell ___ U-Boot mailing list U-Boot@lists.denx.de http://lists.denx.de/mailman/listinfo/u-boot
Re: [U-Boot] [PATCH 0/2] Make sure 85xx bss doesn't start at 0x0
Peter Tyser wrote: > On Tue, 2009-10-06 at 13:34 -0700, J. William Campbell wrote: > >> Peter Tyser wrote: >> >>> On Tue, 2009-10-06 at 19:51 +0200, Wolfgang Denk wrote: >>> >>> >>>> Dear Peter Tyser, >>>> >>>> In message <1254843932.24664.2083.ca...@localhost.localdomain> you wrote: >>>> >>>> >>>>> I personally like the current implementation of putting the bss after >>>>> the entire U-Boot image. It keeps U-Boot's code, malloc pool, stack, >>>>> bss, etc all in the same general area which is nice, and has the side >>>>> benefit that the bootpg won't be overwritten. >>>>> >>>>> >>>> OK, if you think so... >>>> >>>> >>>> >>>>> I know ORing in 0x10 is a bit ugly, but what's the real downside of >>>>> doing it? >>>>> >>>>> >>>> Nothing. I just hate to allocate the bss at 0x0, because this is >>>> actually incorrect - it's the result of an address overflow / >>>> truncation, and pretty much misleading to someone trying to read and >>>> understand the code. For the linked image, it does not _look_ as if >>>> the bss was located _after_ the U-Boot image, it looks detached and >>>> allocated in low RAM. >>>> >>>> >>> Do you have a preference Kumar? You're probably going to be the first >>> in line to have to deal with any resulting confusion:) >>> >>> I personally would rank the options: >>> 1. OR in an offset to the bss address and leave some good comments in >>> the linker script and commit message >>> >>> 2. Make the bss the last section like other PPC boards which would >>> result in the bootpg sometimes being overwritten >>> >>> 3. Put the bss at an arbitrary address >>> >>> >> FWIW, I think an arbitrary address disjoint from the u-boot addresses is >> best. While u-boot is in ROM, you can't use the bss anyway. The bss will >> actually be located at an address selected by the u-boot code itself >> after memory is sized. All references to the bss will be re-located by >> subtracting the arbitrary start address and adding the run-time chosen >> start address. So the linked start address is not important, except that >> is cannot be NULL or it may confuse the relocation code that doesn't >> want to re-locate NULL pointers. Some of the confusion in this >> discussion probably stems from the fact that the linker scripts make the >> bss look like "part of u-boot", when it is really not. It is just a >> chunk of "zero'ed" ram, located anywhere the u-boot code decides to put >> it. An arbitrary strange address would make this more apparent. >> > > Hi Bill, > What's the advantage of having the bss not be located next to U-Boot? > The big disadvantage of picking an arbitrary address for the bss is that > there's now 1 more magical section of SDRAM that the user needs to know > shouldn't be used. I already field enough question from people that > corrupt their exception vectors or stack/malloc pool/u-boot code, I > don't want to add more bss questions:) > Hi Peter, The point is that the address chosen for the ld step is NOT the address in ram where the bss will reside anyway. This address can overlap the exception vectors, stack, or even the u-boot code itself and it wouldn't matter (other than possible confusion). The actual physical address where the bss and u-boot itself resides is COMPUTED by u-boot after it sizes memory. u-boot only needs to know how big the section is in order to allow enough room. All references to the bss will then be re-located correctly. Where the bss actually ends up is a function of u-boot code. It may be on some processors that the computation of bss start is done assuming the bss is adjacent to u-boot in the original memory map, but if so, it is an un-necessary restriction. All that is required is a "safe" chunk of ram, which is also what is needed for stack and malloc area and should be chosen in a similar manner. So at run time, bss probably ends up adjacent to u-boot in ram because that's how it was coded, but at ld time it shouldn't matter. Best Regards, Bill Campbell > Best, > Peter > > PS. please keep the original email recipients on CC > > > > ___ U-Boot mailing list U-Boot@lists.denx.de http://lists.denx.de/mailman/listinfo/u-boot
Re: [U-Boot] [PATCH 0/2] Make sure 85xx bss doesn't start at 0x0
Peter Tyser wrote: > On Tue, 2009-10-06 at 19:51 +0200, Wolfgang Denk wrote: > >> Dear Peter Tyser, >> >> In message <1254843932.24664.2083.ca...@localhost.localdomain> you wrote: >> >>> I personally like the current implementation of putting the bss after >>> the entire U-Boot image. It keeps U-Boot's code, malloc pool, stack, >>> bss, etc all in the same general area which is nice, and has the side >>> benefit that the bootpg won't be overwritten. >>> >> OK, if you think so... >> >> >>> I know ORing in 0x10 is a bit ugly, but what's the real downside of >>> doing it? >>> >> Nothing. I just hate to allocate the bss at 0x0, because this is >> actually incorrect - it's the result of an address overflow / >> truncation, and pretty much misleading to someone trying to read and >> understand the code. For the linked image, it does not _look_ as if >> the bss was located _after_ the U-Boot image, it looks detached and >> allocated in low RAM. >> > > Do you have a preference Kumar? You're probably going to be the first > in line to have to deal with any resulting confusion:) > > I personally would rank the options: > 1. OR in an offset to the bss address and leave some good comments in > the linker script and commit message > > 2. Make the bss the last section like other PPC boards which would > result in the bootpg sometimes being overwritten > > 3. Put the bss at an arbitrary address > FWIW, I think an arbitrary address disjoint from the u-boot addresses is best. While u-boot is in ROM, you can't use the bss anyway. The bss will actually be located at an address selected by the u-boot code itself after memory is sized. All references to the bss will be re-located by subtracting the arbitrary start address and adding the run-time chosen start address. So the linked start address is not important, except that is cannot be NULL or it may confuse the relocation code that doesn't want to re-locate NULL pointers. Some of the confusion in this discussion probably stems from the fact that the linker scripts make the bss look like "part of u-boot", when it is really not. It is just a chunk of "zero'ed" ram, located anywhere the u-boot code decides to put it. An arbitrary strange address would make this more apparent. Best Regards, Bill Campbell > Best, > Peter > > ___ > U-Boot mailing list > U-Boot@lists.denx.de > http://lists.denx.de/mailman/listinfo/u-boot > > > ___ U-Boot mailing list U-Boot@lists.denx.de http://lists.denx.de/mailman/listinfo/u-boot
Re: [U-Boot] [PATCH] CFI Driver: Reset watchdog timer after each flash operation
Mike Frysinger wrote: > On Friday 02 October 2009 08:30:51 Wolfgang Denk wrote: > >> Ingo van Lil wrote: >> >>> The CFI driver does not reset the device's watchdog, so long-running >>> flash operations will cause the watchdog timer to expire. A comment in >>> flash_status_check() suggests that udelay() is expected to reset the >>> watchdog, but I can't find any architecture where it does. >>> >> Please have a closer look, then. On PowerPC, udelay() >> ["lib_ppc/time.c"] calls wait_ticks(), which in turn >> ["lib_ppc/ticks.S"] calls WATCHDOG_RESET >> >> If this is missing in other architectures, it should be fixed at the >> root cause, i. e. in udelay() or in the respective support routines. >> > > Blackfin is missing it as well as i really had no idea it was supposed to be > there. certainly no doc states this requirement. perhaps it'd make sense to > break apart the common stuff to a common udelay() that does things like call > the watchdog and then call the implementation __udelay(). there should be at > least a doc/README.arch that includes these kind of details ... > -mike > > At this point it might be appropriate to ask if including such a reset in udelay() is a good idea. The way it is, no "infinite loop" in u-boot that contains a udelay() will ever allow the watchdog to time out. That restriction is somewhat non-obvious. I would argue that there are basically two classes of commands in u-boot, those that should execute so fast that there is no way the watchdog should be triggered, and commands that may take "a long time" during which the watchdog may erroneously be triggered if there is no provision made to reset it. In the second case, the resetting of the watchdog must be placed where "measurable progress" towards command completion is being made, i.e. there is a certainty that we are getting closer to a command complete. Just because the program invokes udelay, there is no assurance that measurable progress is being made. The udelay may be part of a process that will never complete. Therefore, having a watchdog reset in udelay seems like a less than optimal idea in general. Maybe now would be a good time to look at removing it? Bill Campbell > > > ___ > U-Boot mailing list > U-Boot@lists.denx.de > http://lists.denx.de/mailman/listinfo/u-boot > ___ U-Boot mailing list U-Boot@lists.denx.de http://lists.denx.de/mailman/listinfo/u-boot
Re: [U-Boot] Virtual addresses, u-boot, and the MMU
Becky Bruce wrote: > > On Sep 2, 2009, at 2:59 AM, Wolfgang Denk wrote: > >> Dear "J. William Campbell", >> >> In message <4a9d99b1.1010...@comcast.net> you wrote: >>> >> ... >>>> Becky then posted the summary of this discussion here: >>>> >>>> http://thread.gmane.org/gmane.comp.boot-loaders.u-boot/50705 >> ... >>> In quick summary, for the next few years, we will require that all >>> "important" physical addresses have corresponding virtual addresses. >> >> ...unless there are good reasons to deviate from that rule, but these >> cases are expected to be rare exceptions from the rule. >> >> >>>> In any case, I think we should be careful not to mix things: what we >>>> are discussing here are address mappings. What we are not discussing >>>> is specific memory properties like being cached/uncached, guarded/ >>>> non-guarded, etc. >>>> >>>> Such properties are important, too, but they need to get handled >>>> through a separate interface. >>>> >>> Here is where I am quite sure you are going to have a problem. In very >>> many CPUs, cache control and memory management are joined at the hip. >>> Some systems have no easy way to enable and disable (D,I) cache >>> globally, it is only doable on a page or segment basis. The PPC >>> hardware >>> has a relatively low cost way to do so, but not all architectures do. >> >> I am well aware of this problem, which is one of the reasons that the >> majority of systems is running with data cache turned off. Even >> PowerPC has DC off (at least after relocation to RAM), ARM does not >> implement cache support yet, etc. >> >> When we state that U-Boot is a boot loader and thus should be kept >> simple, this more or less logically results in the consequence that >> if it's difficult to enable the DC (on some systems), then just don't >> do it, then. >> >> Nobody enforces you to enable caches when you find it hard to do. >> >> >>> Very True. I did forget about the read being just a memory >>> reference. So >>> if we desire the flash to be cached, it would have to "normally" be >>> cached for reads to take advantage of the operation. >> >> ACK. >> >>> Thanks for looking at this. It therefore seems to me that adding an >>> "uncache(virtual address)" operation (that may return a substitute >>> address for the actual write to the flash) followed by a >>> "restore_cache()" operation inside the flash driver write routine >>> should >>> work. The uncache routine would do nothing if the flash is not >>> cached to >> >> This is where Detlev's comment about using the chance to define a >> cache API comes into play. >> >> Yes, we probably should create a set of functions like >> >> enable_data_cache(address, size); >> and >> disable_data_cache(address, size); >> >> which would turn on resp. off the caching attributes on the given >> memory range. >> >>> begin with, would globally turn off data cache if that is easy to >>> do, or >>> would provide an alternate virtual address to be used in the write. >>> That >> >> This is where I disagree. >> >> I'm not really deep enough in the implementation details and thus >> would appreciate comments for example from Becky and Stefan. In my >> opinion, turning on or off the cache on an address range should be >> implemented as outlined above, i. e. as an operation changing the >> caching properties of the address range. > > This makes sense to me. The disable function would need to flush the > range from the cache, but that's the only difficulty I forsee. > However, I dug up some AVR32 docs, and it looks like the whole dual > cacheable/CI mapping thing may be architectural. The architecture > seems to specify a virtual memory map for privileged state, and part > of the VA range is not translated by the MMU, but has a default > translation. There appear to be segments in the untranslated VA space > that map to the same PA, one cacheable and the other not. Unfortunately, more than one architecture that has problems turning off cache in an arbitrary region of memory. If a system has a "global" cache disable, that would be the easiest option. Since u-boot is single threaded, we don't have to worry about the cache not being turned back on b
Re: [U-Boot] Virtual addresses, u-boot, and the MMU
Wolfgang Denk wrote: > Dear "J. William Campbell", > > In message <4a9d5ef2.4030...@comcast.net> you wrote: > >> I have followed the recent discussions about problems in the CFI >> driver caused by the need to change the attributes of the address at >> which the flash is mapped. This discussion has raised some questions in >> my mind regarding the assumptions u-boot makes regarding the behavior of >> the addresses used in the code. I am writing this message for comments >> that may correct any mis-understandings I have. First, the addresses >> used by the u-boot program to reference memory and code are all >> "virtual" addresses (VA) because they go through the MMU to become a >> physical address(PA). If there is no MMU, the virtual address and the >> physical address are identical. >> > > Even if there is a MMU, we keep it "switched off" on most systems, or > otherwise set it up in such a way that there is still a 1:1 mapping, > i. e. the virtual address and physical address are identical, too. > > There have been several discussions concerning this topic on IRC > (#u-boot at freenode); I'll try to summarize these here - not sure > though if I don't miss anything: please feel free to complement what > might be missing. > > > Becky then posted the summary of this discussion here: > > http://thread.gmane.org/gmane.comp.boot-loaders.u-boot/50705 > > > Note that there was a general agreement among those who raised their > voices. > > In quick summary, for the next few years, we will require that all "important" physical addresses have corresponding virtual addresses. Some limited support for mapping in "other resources" may be provided at an operator interface level, but it will be quite limited. OK, seems reasonable to me. > >> The "normal", or legacy,policy for u-boot is to arrange a memory >> map such that all physical addresses are mapped to some virtual address. >> Originally, the mapping were such that the VA was actually == the PA, >> but today on some CPUs, this is not possible. When the size of the >> physical address space exceeds the size of the virtual address space, >> the VA may not =- the PA numerically, but there is a one-to-one >> correspondence. It MAY also acceptable to map the same PA so that it >> appears more than once in the address space (VA), but if this is done, >> > > This may or may not be possible. It may even make sense or be needed > on some systems, and it may be impossible to do on others. > > In any case, I think we should be careful not to mix things: what we > are discussing here are address mappings. What we are not discussing > is specific memory properties like being cached/uncached, guarded/ > non-guarded, etc. > > Such properties are important, too, but they need to get handled > through a separate interface. > Here is where I am quite sure you are going to have a problem. In very many CPUs, cache control and memory management are joined at the hip. Some systems have no easy way to enable and disable (D,I) cache globally, it is only doable on a page or segment basis. The PPC hardware has a relatively low cost way to do so, but not all architectures do. > >> Becky Bruce "re-wrote the driver to use VAs instead of PAs." I am >> not exactly sure what this means, but I assume it meant allowing the VA >> referencing the flash >> to be distinct from the PA where the flash "lives" (and may require 36 >> bits on a PPC system to represent the PA). Does the driver re-map >> > > I think the information provided above sheds more light on this. > Yes, it did. > >> portions of the flash in order to access them? If the flash is really >> large, I can certainly see a need to do so. However, I assume on "medium >> size" flashes, it does not need to remap. In that case, don't all >> references just go through the MMU and get translated? The VA != PA, but >> from the point of view of u-boot, the VA is the only address that >> matters. The AVR32 certainly does not map flash dynamically, so it would >> not matter on that CPU. >> > > OK. > > >> The issue with the CFI driver on the AVR32 is that it needs to >> disable cache on the address space of the flash memory when it is >> writing to the flash. This apparently is not trivial to do, but there is >> > > Actually this is not specific to the AVR32, and so far most systems > simply do not enable caches at all on the flash memor
Re: [U-Boot] Virtual addresses, u-boot, and the MMU
I have followed the recent discussions about problems in the CFI driver caused by the need to change the attributes of the address at which the flash is mapped. This discussion has raised some questions in my mind regarding the assumptions u-boot makes regarding the behavior of the addresses used in the code. I am writing this message for comments that may correct any mis-understandings I have. First, the addresses used by the u-boot program to reference memory and code are all "virtual" addresses (VA) because they go through the MMU to become a physical address(PA). If there is no MMU, the virtual address and the physical address are identical. The "normal", or legacy,policy for u-boot is to arrange a memory map such that all physical addresses are mapped to some virtual address. Originally, the mapping were such that the VA was actually == the PA, but today on some CPUs, this is not possible. When the size of the physical address space exceeds the size of the virtual address space, the VA may not =- the PA numerically, but there is a one-to-one correspondence. It MAY also acceptable to map the same PA so that it appears more than once in the address space (VA), but if this is done, any references inside u-boot (such as in drivers) must be consistent, i.e. the flash chip must exist at exactly one virtual address from u-boots viewpoint. This is required because the u-boot drivers maintain pointers into the flash, and it is important to be able to compare these pointers to input virtual addresses. It is also necessary that the VA of the flash chip not change after the driver has been set up, or else any saved pointers will be wrong. Becky Bruce "re-wrote the driver to use VAs instead of PAs." I am not exactly sure what this means, but I assume it meant allowing the VA referencing the flash to be distinct from the PA where the flash "lives" (and may require 36 bits on a PPC system to represent the PA). Does the driver re-map portions of the flash in order to access them? If the flash is really large, I can certainly see a need to do so. However, I assume on "medium size" flashes, it does not need to remap. In that case, don't all references just go through the MMU and get translated? The VA != PA, but from the point of view of u-boot, the VA is the only address that matters. The AVR32 certainly does not map flash dynamically, so it would not matter on that CPU. The issue with the CFI driver on the AVR32 is that it needs to disable cache on the address space of the flash memory when it is writing to the flash. This apparently is not trivial to do, but there is a second mapping that does have the cache off. Wolfgang has recommended the creation of a function to turn off the cache for the flash area, and also (presumably) one to turn the flash back on when the write is complete. Haavard has at present a function that returns an alternate address with the cache already off that addresses the same memory. This wouldn't cause a problem if the mapping happened immediately before the actual copy operation took place and was used for nothing else. However, if it happens early on in the driver, the address will not match the structure set up by the rest of the flash code using the non-translated address. Therefore, I have the following questions: If the map_physmem() macro is removed from the driver on the AVR32, does the driver work if it is told that the flash PA=VA = the un-cached address? If not, why not? Shouldn't this be just like any CFI on an un-cached PPC address? The driver will be somewhat slower reading but otherwise it should work. If/when it does work, couldn't a map_in_cache() macro be placed directly in front of the read code that copies data from flash to other buffers. The macro would return an address of the same data referenced through a cached address if it exists. This address would go nowhere else and never be stored anywhere. This would speed up the copy operation for situations where it matters, and is applicable to all platforms that can do such a thing. The most general solution would be a call to map_in_cache/map_in_not_cache for both reads and writes in the CFI driver. These routines would return a "substitute" address (or the same input one), and may actually add another mapping dynamically, use a pre-existing appropriate mapping, just turn on/turn off data cache globally, or do nothing). At the end of the copy, map_restore() would put the map back By default, the assumption would be that the flash is not cached and the macros do nothing. Sounds simple to me, what have I overlooked? Bill Campbell > Haavard Skinnemoen wrote: > > > > >> Right...I'm beginning to doubt that anyone is familiar enough with the >> u-boot code, since everyone seems to have their own opinion about how >> things are supposed to work. >> >> To summarize, here are the possible ways to fix the problem as I see it: