Re: [U-Boot] [PATCH] nvme: add more cache flushes

2019-10-17 Thread J. William Campbell

Hi All,
On 10/17/2019 12:12 AM, Patrick Wildt wrote:

On Thu, Oct 17, 2019 at 03:08:59PM +0800, Bin Meng wrote:

Hi Patrick,

On Thu, Oct 17, 2019 at 2:44 PM Patrick Wildt  wrote:

On Thu, Oct 17, 2019 at 10:55:11AM +0800, Bin Meng wrote:

Hi Patrick,

On Wed, Oct 16, 2019 at 11:35 PM Patrick Wildt  wrote:

On Wed, Oct 16, 2019 at 06:11:23PM +0800, Bin Meng wrote:

On Mon, Oct 14, 2019 at 7:11 PM Patrick Wildt  wrote:

On an i.MX8MQ our nvme driver doesn't completely work since we are
missing a few cache flushes.  One is the prp list, which is an extra
buffer that we need to flush before handing it to the hardware.  Also
the block read/write operations needs more cache flushes on this SoC.

Signed-off-by: Patrick Wildt 
---
  drivers/nvme/nvme.c | 15 +--
  1 file changed, 9 insertions(+), 6 deletions(-)

diff --git a/drivers/nvme/nvme.c b/drivers/nvme/nvme.c
index 2444e0270f..69d5e3eedc 100644
--- a/drivers/nvme/nvme.c
+++ b/drivers/nvme/nvme.c
@@ -123,6 +123,9 @@ static int nvme_setup_prps(struct nvme_dev *dev, u64 *prp2,
 }
 *prp2 = (ulong)dev->prp_pool;

+   flush_dcache_range((ulong)dev->prp_pool, (ulong)dev->prp_pool +
+  dev->prp_entry_num * sizeof(u64));
+
 return 0;
  }

@@ -717,9 +720,10 @@ static ulong nvme_blk_rw(struct udevice *udev, lbaint_t 
blknr,
 u16 lbas = 1 << (dev->max_transfer_shift - ns->lba_shift);
 u64 total_lbas = blkcnt;

-   if (!read)
-   flush_dcache_range((unsigned long)buffer,
-  (unsigned long)buffer + total_len);
+   flush_dcache_range((unsigned long)buffer,
+  (unsigned long)buffer + total_len);

Why we need this for read?

I'm no processor engineer, but I have read (and "seen") the following
on ARMs.  If I'm wrong. please correct me.

Since the buffer is usually allocated cache-aligned on the stack,
it is very possible that this buffer was previously still used
as it's supposed to be used: as stack.  Thus, the caches can still
be filled, and are not yet evicted/flushed to memory.  Now it is
possible that between the DMA access from the device and our cache
invalidation, the CPU cache writes back its contents to memory,
overwriting whatever the NVMe just gave us.

OK, this makes sense. So if we allocate the buffer from the heap, we
should only care about flush on write, right? Can you test this?

If you're talking about having a bounce buffer:  You'd need to flush
it once upon allocation, because that part of the heap could have also
be used earlier by someone else.


I was not talking about bounce buffer. I mean the buffer allocated
from help and use that directly for DMA.

Regards,
Bin

If you allocate a buffer from the heap, you still need to make sure
to flush it once, since there's still the chance that you have used
it shortly earlier.  But it's less of an issue as on the stack.


The "rules" for cache management of DMA buffers for non-cache-coherent 
CPUs are immutable. It doesn't matter where the memory came from 
(static, heap, or stack). It may be in cache, and it may be dirty. You 
can't know  it is for sure "clean". It is assumed that the DMA buffers 
are allocated to begin on a cache line boundary and are a multiple of a 
cache line in length. If this is not the case, references by the CPU to 
locations outside the buffer can make the cache state of the buffer 
dirty, which is an error. It is also required that there be no accesses 
to the DMA buffer by the CPU while DMA is in progress. This is normally 
true by default, and if it isn't true, it is an error. The rules are 
then as follows:


On write, before DMA is started. the cache corresponding to the DMA 
buffer addresses MUST be flushed to ensure the desired content is 
transferred from cache to RAM before write DMA begins.


On read, before DMA is started, the cache corresponding to the DMA 
buffer addresses MUST be either invalidated or flushed to ensure that no 
cache system initiated writes to RAM will occur while read DMA is in 
progress. Normally, invalidate is preferred because it is faster. 
However, for programming simplicity some drivers choose to flush before 
both read or write DMA is started. If that is the case, it is good 
practice to note that choice in a comment.


On write, after DMA is completed, no additional cache actions are required.

On read, after DMA is completed, the cache corresponding to the DMA 
buffer addresses MUST be invalidated. This is necessary to ensure that 
stale data in the cache will not be used instead of the new read data in 
RAM. If one elected to invalidate the cache before the read DMA started, 
one may wonder why a second invalidate is necessary.  Since the buffer 
is not allowed to be referenced by the program in the interim, the cache 
should not contain any data from the DMA buffer area. The reason is that 
modern CPUs, may load data from the DMA buffer into cache while DMA is 
in progress. This can be

Re: [U-Boot] [PATCH 1/3] spl: ram: Do not memcpy() identical buffers

2019-05-28 Thread J. William Campbell

On 5/28/2019 4:19 AM, Tom Rini wrote:

On Tue, May 28, 2019 at 05:24:34AM +0200, Marek Vasut wrote:

On 5/28/19 5:04 AM, Tom Rini wrote:

On Tue, May 28, 2019 at 04:44:52AM +0200, Marek Vasut wrote:

On 5/28/19 4:42 AM, Tom Rini wrote:

On Tue, May 28, 2019 at 04:07:44AM +0200, Marek Vasut wrote:

On 5/28/19 4:06 AM, Tom Rini wrote:

On Tue, May 28, 2019 at 03:49:13AM +0200, Marek Vasut wrote:


If the source and destination buffer address is identical, there is
no need to memcpy() the content. Skip the memcpy() in such a case.

Signed-off-by: Marek Vasut 
Cc: Michal Simek 
Cc: Tom Rini 

Shouldn't memcpy catch that itself?


memcpy(3) says
The memcpy() function copies n bytes from memory area src to
memory area dest.  The memory areas must not overlap.  Use memmove(3) if
the memory areas do overlap.

OK, and shouldn't memcpy optimize that case?  Does it usually?

As the manpage says "The memory areas must not overlap." , I would
expect it does not have to ?

I guess I'm not being clear enough, sorry.  Go look at how this is
implemented in a few places please and report back to us.  Someone else,
or many someone else, have probably already figured out if optimizing
this case in general, in memcpy, is a good idea or not.  Thanks!

If even [1] says the behavior is undefined, then what does it matter
whether some implementation added an optimization for such undefined
behavior? We cannot depend on it, can we ?

[1] https://pubs.opengroup.org/onlinepubs/009695399/functions/memcpy.html

Yes, yes it would be worth seeing if other groups that have looked into
the performance impact here have also decided to optimize this undefined
behavior or not, thanks.


I don't think this is an optimization question, really. Calling memcpy 
with overlapping areas is an error. The result is explicitly undefined. 
It may well be that all the existing implementations effectively do 
nothing, either by explicit check or by actually copying the data over 
itself. However, to rely on that behavior is asking for trouble down the 
line. Undefined behavior is exactly that. Don't do it.





___
U-Boot mailing list
U-Boot@lists.denx.de
https://lists.denx.de/listinfo/u-boot



___
U-Boot mailing list
U-Boot@lists.denx.de
https://lists.denx.de/listinfo/u-boot


Re: [U-Boot] [PATCH] libfdt: Initialize the stack variable

2017-09-04 Thread J. William Campbell

On 9/4/2017 8:41 PM, Chee, Tien Fong wrote:

On Rab, 2017-08-30 at 06:31 -0700, J. William Campbell wrote:

On 8/29/2017 10:15 PM, tien.fong.c...@intel.com wrote:

From: Tien Fong Chee 

Report Coverity log:
The code uses a variable that has not
been initialized, leading to unpredictable
or unintended results.

Reported-by: Coverity (CID: 60519)
Signed-off-by: Tien Fong Chee 
---
   lib/libfdt/fdt_wip.c |2 +-
   1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/lib/libfdt/fdt_wip.c b/lib/libfdt/fdt_wip.c
index 45fb964..01adad0 100644
--- a/lib/libfdt/fdt_wip.c
+++ b/lib/libfdt/fdt_wip.c
@@ -115,7 +115,7 @@ int fdt_find_regions(const void *fdt, char *
const inc[], int inc_count,
     struct fdt_region region[], int max_regions,
     char *path, int path_len, int
add_string_tab)
   {
-   int stack[FDT_MAX_DEPTH];
+   int stack[FDT_MAX_DEPTH] = { 0 };

It seems to me that one of three things must be true. 1) Coverity
can't
correctly analyze the code and stack[] is not used in an un-
initialized
manner, 2) stack is used in an un-initialized manner but the result
is
not used in that case and is a "don't care" or 3) there is a bug in
the
code. It seems that just initializing the variable to 0 is a "Bad
Idea(tm)". If it is case 1 or 2, there should be a Coverity code
annotation comment added to that effect, and if it is case 3, it
should
be fixed. Initializing this variable makes the binary larger to no
purpose unless there is a bug already.

Best Regards,
J. William Campbell

Yeah, i agree with you, state machine design should ensure stack[] is
not used in a uninitialized manner. Hence, i need input from whom
familiar with this function, whether this warning fall in anyone of
these conditions. If we just direct init the stack[], and this solution
will make extra 128 bytes in binary, but having variable with default
value is also good pratice from software quality perspective.
Yes, if the default value has a rationale. On the surface, there is no 
way to know that 0 is a "good" initial value. There may be a reason that 
it is, but if we don't know for sure, it is just a "random" number. I 
hope whoever wrote this will speak up and say why the variable is never 
used before it is initialized. Thank you for being so diligent.


Best Regards,
Bill Campbell



    char *end;
    int nextoffset = 0;
    uint32_t tag;



___
U-Boot mailing list
U-Boot@lists.denx.de
https://lists.denx.de/listinfo/u-boot


Re: [U-Boot] [PATCH] libfdt: Initialize the stack variable

2017-08-30 Thread J. William Campbell

On 8/29/2017 10:15 PM, tien.fong.c...@intel.com wrote:

From: Tien Fong Chee 

Report Coverity log:
The code uses a variable that has not
been initialized, leading to unpredictable
or unintended results.

Reported-by: Coverity (CID: 60519)
Signed-off-by: Tien Fong Chee 
---
  lib/libfdt/fdt_wip.c |2 +-
  1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/lib/libfdt/fdt_wip.c b/lib/libfdt/fdt_wip.c
index 45fb964..01adad0 100644
--- a/lib/libfdt/fdt_wip.c
+++ b/lib/libfdt/fdt_wip.c
@@ -115,7 +115,7 @@ int fdt_find_regions(const void *fdt, char * const inc[], 
int inc_count,
 struct fdt_region region[], int max_regions,
 char *path, int path_len, int add_string_tab)
  {
-   int stack[FDT_MAX_DEPTH];
+   int stack[FDT_MAX_DEPTH] = { 0 };
It seems to me that one of three things must be true. 1) Coverity can't 
correctly analyze the code and stack[] is not used in an un-initialized 
manner, 2) stack is used in an un-initialized manner but the result is 
not used in that case and is a "don't care" or 3) there is a bug in the 
code. It seems that just initializing the variable to 0 is a "Bad 
Idea(tm)". If it is case 1 or 2, there should be a Coverity code 
annotation comment added to that effect, and if it is case 3, it should 
be fixed. Initializing this variable makes the binary larger to no 
purpose unless there is a bug already.


Best Regards,
J. William Campbell

char *end;
int nextoffset = 0;
uint32_t tag;



___
U-Boot mailing list
U-Boot@lists.denx.de
https://lists.denx.de/listinfo/u-boot


Re: [U-Boot] [PATCH 2/2] arm: config: enforce -fno-pic for SPL and normal U-Boot

2017-07-03 Thread J. William Campbell

On 7/3/2017 8:12 PM, Peng Fan wrote:



-Original Message-
From: Tom Rini [mailto:tr...@konsulko.com]
Sent: Tuesday, July 04, 2017 10:47 AM
To: Peng Fan 
Cc: Simon Glass ; Philipp Tomsich
; albert.u.b...@aribaud.net; u-
b...@lists.denx.de
Subject: Re: [U-Boot] [PATCH 2/2] arm: config: enforce -fno-pic for SPL and
normal U-Boot

On Tue, Jul 04, 2017 at 01:09:36AM +, Peng Fan wrote:

Hi Tom,


-Original Message-
From: Tom Rini [mailto:tr...@konsulko.com]
Sent: Tuesday, July 04, 2017 12:17 AM
To: Peng Fan ; Simon Glass ;
Philipp Tomsich 
Cc: albert.u.b...@aribaud.net; u-boot@lists.denx.de
Subject: Re: [U-Boot] [PATCH 2/2] arm: config: enforce -fno-pic for
SPL and normal U-Boot

On Mon, Jul 03, 2017 at 09:14:08PM +0800, Peng Fan wrote:


If not pass -fno-pic to toolchains, some toolchains may generate
.got and .got.plt sections, but when generate binaries, we did not
take .got and .got.plt into consideration, then SPL or normal
U-Boot boot failure because image corrupted.

Need to pass -fno-pic to disable generating .got and .got.plt
sections.

Signed-off-by: Peng Fan 
Cc: Albert Aribaud 
Cc: Tom Rini 
---
  arch/arm/config.mk | 3 ++-
  1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/arch/arm/config.mk b/arch/arm/config.mk index
1a9..66ae403 100644
--- a/arch/arm/config.mk
+++ b/arch/arm/config.mk
@@ -130,9 +130,10 @@ ALL-y += checkarmreloc  # instruction.
Relocation is not supported for that case, so disable  # such
usage by requiring word relocations.
  PLATFORM_CPPFLAGS += $(call cc-option, -mword-relocations)
-PLATFORM_CPPFLAGS += $(call cc-option, -fno-pic)  endif

+PLATFORM_CPPFLAGS += $(call cc-option, -fno-pic)
+
  # limit ourselves to the sections we want in the .bin.
  ifdef CONFIG_ARM64
  OBJCOPYFLAGS += -j .text -j .secure_text -j .secure_data -j
.rodata -j .data \

Something is "up" here and I need you to dig harder and perhaps see
if we're missing something in the linker scripts?  The very next line of

context here is:

 -j .u_boot_list -j .rela.dyn -j .got -j .got.plt

Meaning that we intentionally copy .got / .got.plt into the
resulting binary.  And I see that we took in 397d7d5a1be1 from you
back in 2016 saying that we needed this in SPL.  But 5a942a152776
put the got/got.plt sections (for 32bit
ARM) in intentionally as some relocations do need it.  And in
4b0d506ed3b4 Philipp seems to have seen the same problem you have,
but fixed it with adding got/got.plt to the sections list we copy in (the above

hunk of context).

If pass -fno-pic to compiler, there will be no .got and .got.plt sections.
The .got and .got.plt is usually used for dynamic link, such as linux "*.so" 
file.
We need to pass -fno-pic to compiler to remove .got and .got.plt sections.

"Usually" isn't the same as "always" or "only".  And this reminded me that we

 From https://gcc.gnu.org/onlinedocs/gcc/Code-Gen-Options.html#Code-Gen-Options,
when -fpic is used for ARM, there will be GOT. The dynamic loader resolves the 
GOT entries
when the program starts (the dynamic loader is not part of GCC; it is part of 
the
operating system).


had to deal with e391b1e64b0b because yes, we're not making a shared library
but we do have position independent code.  So, in the case of SPL, since we

For position independent code, but not making a shared library, we could use 
-fpie.


can get away with -fno-pic (and get some space
savings) that's just not true of U-Boot itself.  We're enforcing -fpic on other
architectures, so what exactly is going on with what you're seeing?  Where

If not passing -fno-pic to gcc, the android toolchain will generate .got and 
.got.plt
section. I think these sections are for dynamic link, not for static link.


would we need to be taking .got/.got.plt into consideration and why would it
be a bad thing to do so?  Thanks!

We could keep .got and .got.plt, but personally I prefer use -fno-pic.
I think fpic is for shared library and fpie is for position independt code for 
static link.

Thanks,
Peng.
To clarify things, as I understand it, -pie produces the same code and 
segments as -pic does, plus a loader to load the code correctly into 
memory. u-boot probably doesn't want that loader. The loader is required 
to make the executable work under the OS, but stand-alone probably 
doesn't work right in general.  For u-boot, the advantage of -pic code 
is that there are no TEXTREL relocations in the binary. The code is 
generated to take care of bss (and code segment when required) 
references that require relocation. Things in the .got may need 
relocating by where bss ends up, but not the code. So relocation is 
short and simple.  Non pic code requires relocation processing that may 
be more complex than pic code, and requires a larger relocation table. 
If you are loading the code and bss at the location they were linked 
for, neither type requires relocation. If the code and or bss is located 
at a different address, -pic requires a little relocation and no 

Re: [U-Boot] [PATCH v2] NS16550: buffer reads

2011-10-24 Thread J. William Campbell
On 10/24/2011 3:22 PM, Graeme Russ wrote:
> Hi Wolfgang,
>
> On Tue, Oct 25, 2011 at 8:59 AM, Wolfgang Denk  wrote:
>> Dear Graeme Russ,
>>
>> In message<4ea5cd39.2070...@gmail.com>  you wrote:
 Assume we have a simple device with a small Rx FIFO - say, 8 bytes
 only. Guess what the chances are that we will overrun this FIFO (and
 then lose characters) before the other side stops sending?
>>> Should be zero (if you disable the hardware Tx buffer) unless you have an
>>> incredibly slow Rx routine (in which case, your baud rate is too high). And
>>> even with a Tx buffer in play the solution is fairly simple - After sending
>>> an XOFF, flush the Rx buffer (and remote Tx buffer) into an internal buffer
>>> until tstc() returns false - use this buffer for getc() until it is empty.
>> Things are becoming more and more complicated, it seems.
>>
>> If you ask me: forget about all this, and stick with single line
>> input.  Do the processing on the sender's side (always wait for a
>> prompt before sending the next line).  This is MUCH easier.
> Oh, I wholehearedly agree that if we impose such a limitation, life will
> be much easier for us all ;)
>
> But realistically, the solution is actually very trivial and comprises
> just two parts:
>
> 1a) When the command line interpreter (which is just the really simple
>  function detecting new-line or command delimiter) dispatches a
>  command string to the command interpreter (which looks up the
>  command table and if it finds a command runs it) send an XOFF
> 1b) When the command interpreter returns control back to the command
>  line interpreter, send an XON
> 1c) If an XOFF has been recently sent (i.e. no XON since) send an XON
>  in getc()
>
> 2) After sending XOFF, flush the transmitter's buffer into a small
> (16 bytes chould sufffice) buffer. Grab characters from this buffer
> before grabing from the UART when getc() is called
Hi All,
   The problem with all this is that the transmitting CPU may send 
"several" characters after the XOFF is transmitted, where several will 
be at least two if the transmitting CPU has no output fifo, up to 
possibly the output fifo length + 2. Furthermore, this "overage" can 
theoretically increase with each x-on x-off sequence, as each command 
may be, in theory,  shorter than the output fifo length. So the input 
buffering is needed in all cases, i.e. the receive fifo must be "run 
dry" before executing a command. If the receive fifo is shorter than the 
transmit fifo +2, you are doomed. If the receive fifo is larger than 
this, you are ok, the transmitting CPU will stop before overrunning the 
input fifo and data being lost. However, a 16 byte buffer will not be 
enough if there are lots of "short" commands in the input stream.

Best Regards,
Bill Campbell
>
> The second part should not even be necessary if you have a half-sane
> serial port with a half-decent hardware buffer.
>
> It really would not be that hard (maybe a few dozen lines of code at most),
> would work for all UARTS and should (hopefully) resolve all 'dropped
> character' issues.
>
> Now 1c) makes the bold assumption that any command which calls getc()
> 'knows what it is doing' and will consume all input characters at a fast
> enough rate AND will not invoke a delay between receiving the last
> character and returning back to the command line interpreter. If the
> command knows it received characters and knows it will invoke a delay (some
> kind of post-processing) then in order to not lose characters, the command
> MUST issue it's own XOFF - nothing else aside from an interrupt driven
> serial driver will solve the problem in this case.
>
> Regards,
>
> Graeme
> ___
> U-Boot mailing list
> U-Boot@lists.denx.de
> http://lists.denx.de/mailman/listinfo/u-boot
>
>

___
U-Boot mailing list
U-Boot@lists.denx.de
http://lists.denx.de/mailman/listinfo/u-boot


Re: [U-Boot] [RFC] Timer API (again!)

2011-09-24 Thread J. William Campbell
On 9/16/2011 4:53 AM, Graeme Russ wrote:
 > Hi All,
 >
 > Well, here we are again, a Timer API discussion
 >
 > All things considered, I don't think the Linux approach is right for 
U-Boot
 > - It is designed to cater for way more use-cases than U-Boot will 
ever need
 > to deal with (time queues and call-backs in particular)
Hi Graeme,
Glad you are tackling this again.
 >
 > To summarize, in a nutshell, what _I_ think U-Boot need from a Timer API
 >
 >   1. A reliable udelay() available as early as possible that is OK to 
delay
 >  longer than requested, but not shorter - Accuracy is not implied or
 >  assumed
I think you don't really mean this. SOME attempt at accuracy is to be 
expected. Not perfect is allowed, but to be totally bogus is not 
allowed. That is worse than nothing. Longer by 20% or so is possibly ok, 
2% isn't.
 >   2. A method of obtaining a number of 'time intervals' which have 
elapsed
 >  since some random epoch (more info below)
 >   3. A nice way of making A->B time checks simple within the code
 >
 > OK, some details:
 >
 > 1. I'm starting to thing this should be a function pointer or a flag 
in gd.
 > Why oh why would you do such a thing I hear you ask... udelay() is 
needed
 > _EARLY_ - It may well be needed for SDRAM or other hardware 
initialisation
 > prior to any reliable timer sub-system being available. But early 
udelay()
 > is often inaccurate and when the timer sub-system gets fully 
initialised,
 > it can be used for an accurate udelay(). x86 used to have an global data
 > flag that got set when the timer-subsystem got initialised. If the 
flag was
 > not set, udelay() would use one implementation, but if it was set, 
udelay()
 > would use a more accurate implementation. In the case of the eNET 
board, it
 > had an FPGA implementation of a microsecond timer which was even more
 > accurate that the CPU, so it had it's own implementation that should 
have
 > duplicated the 'if (gd->flags&  TIMER_INIT)' but never did (this was OK
 > because udelay() was never needed before then)
 >
 > I think a function pointer in gd would be a much neater way of 
handling the
 > fact that more accurate (and efficient) implementations of udelay() may
 > present themselves throughout the boot process
I think this won't be popular, but I am not against it on the face of it.
 >
 > An other option would be to have the gd flag and make udelay() a lib
 > function which performs the test - If the arch timer has better than 1us
 > resolution it can set the flag and udelay() will use the timer API
 >
 > 2a (random epoch) - The timer sub-system should not rely on a particular
 > epoch (1970, 1901, 0, 1, since boot, since timer init, etc...) - By 
using
 > the full range of an unsigned variable, the epoch does not matter
 >
 > 2b (time interval) - We need to pick a base resolution for the timer 
API -
 > Linux uses nano-seconds and I believe this is a good choice. Big Fat 
Note:
 > The underlying hardware DOES NOT need to have nano-second resolution. 
The
 > only issue with nano-seconds is that it mandates a 64-bit timer - A few
 > people are against this idea - lets discuss. A 32-bit microsecond timer
 > provides ~4300 seconds (~1.2 hours) which _might_ be long enough, but
 > stand-alone applications doing burn-in tests may need longer. Going
 > milli-seconds means we cannot piggy-back udelay() on the timer API.
Well, as you probably know, I am "one of those" against a 64 bit timer 
API to figure out if my disk is ready, especially if my hardware 
timebase does not support nanosecond resolution anyway. I think having a 
time base in milliseconds like we have now is perfectly adequate for 
"sane" bootloader usage. The requirement for the timer API would then be 
to provide the "current time" in millisecond resolution, and the time in 
as close to microsecond resolution as possible. This would be a "helper" 
function for udelay.  Udelay could then always be implemented using 
microseconds to ticks and getting the subtracting delta ticks from the 
target value.
If you can provide the first function, you can provide the second. They 
can usually be the same code with different constants involved. The gd-> 
call may actually be inside udelay, with the timer init replacing the 
pointer used by udelay to get the current time in "microseconds". The 
helper function for udelay should not be called by other parts of 
u-boot, to prevent abuse of this capability. Note that a 32 bit 
millisecond delay is like 49 days, so that should be plenty. I think it 
must be kept in mind that u-boot is used on lots of "small" systems. 
Using up lots of resources for 64 bit arithmetic that we don't need 
elsewhere is , IMHO, not a good idea. If it really did something that we 
need, all well and good, I am for it. But in this case, I don't think it 
really helps us any. But it does make things harder for some CPUs, and 
that is not a good thing.
 >
 > My preference is a 64-bit nano-second timer with ud

Re: [U-Boot] [PATCH v2] ARM926ejs: Add routines to invalidate D-Cache

2011-08-05 Thread J. William Campbell
On 8/5/2011 4:51 AM, Aneesh V wrote:
> Hi Albert,
>
> On Friday 05 August 2011 04:33 PM, Albert ARIBAUD wrote:
>> Hi Aneesh,
>>
>> On 05/08/2011 12:47, Aneesh V wrote:
>>> Hi Eric,
>>>
>>> On Friday 05 August 2011 04:03 PM, Hong Xu wrote:
 Hi Aneesh,
>>> [snip ..]
> IMHO, Hong's approach is correct. If the buffer that is invalidated is
> not aligned to cache-line, one cache-line at the respective boundary
> may have to be flushed to make sure the invalidation doesn't affect
> somebody else's memory.
>
> The solution is for drivers to ensure that any buffer that needs to be
> invalidated is aligned to cache-line boundary at both ends. The above
> approach puts this onus on the driver. I have documented the alignment
> requirement in my recent patch series for fixing arm cache problems.
 I have not noticed the patch series. ;-)
 If we put the alignment burden to the driver, I'm afraid many drivers
 which make use of something like a DMA controller have to modify the
 code heavily. This sounds not good. :)
>>> We have a fundamental problem when it comes to invalidating an
>>> un-aligned buffer. Either you flush the boundary lines and corrupt your
>>> buffer at boundaries OR you invalidate without flushing and corrupt
>>> memory around your buffer. Both are not good! The only real solution is
>>> to have aligned buffers, if you want to have D-cache enabled and do DMA
>>> at the same time.
>> Plus, there should not be *heavy* modifications; DMA engines tend to use
>> essentially two types of memory-resident objects: data buffers and
>> buffer descriptors. There's only a small handful of places in the driver
>> code to look at to find where these objects are allocated and how.
>>
>> So I stand by my opinion: since the cache invalidation routine should
>> only be called with cache-aligned objects, there is no requirement to
>> flush the first (resp. last) cache line in case of unaligned start
>> (resp.stop), and I don't want cache operations performed when they are
>> not required.
> IMHO, flushing is better, because the person who commits the
> mistake of invalidating the un-aligned buffer is the one who is
> affected and is likely to fix the issue soon. If we didn't flush, the
> resulting corruption will cause totally random errors that will be hard
> to debug. Doing an extra flush operation for a maximum of 2 lines
> doesn't cost us anything. This is the approach followed by the kernel
> too.

Hi All,
 I am interested in this last statement in particular, that 
Linux allows non-cache aligned buffers for DMA. In a previous discussion 
series, we demonstrated why it was IMPOSSIBLE for a non-cache aligned 
DMA buffer to work in conjunction with a "write back" type cache. To be 
clear, by write-back, we mean a cache system that has a single dirty bit 
per cache line, and that if there are stores to any addresses in that 
line, the ENTIRE LINE will be written back into memory, not just the 
changed data. I also seem to recall that the ARM V7 caches were defined 
as write back, but I am not an ARM person so I don't know for sure what 
kind of cache we are talking about here. If it is write-back, there is 
only one possible solution that always works. Write-through will work 
with un-aligned buffers if the correct flushes and invalidates are 
present. In that case, buffer alignment is not so important. However, if 
the same driver is to be used in both cases, it must use cache-aligned 
buffers only.

Best Regards,
Bill Campbell


> br,
> Aneesh
> ___
> U-Boot mailing list
> U-Boot@lists.denx.de
> http://lists.denx.de/mailman/listinfo/u-boot
>
>

___
U-Boot mailing list
U-Boot@lists.denx.de
http://lists.denx.de/mailman/listinfo/u-boot


Re: [U-Boot] i.MX51: FEC: Cache coherency problem?

2011-07-23 Thread J. William Campbell
On 7/23/2011 6:04 AM, Albert ARIBAUD wrote:
> Le 21/07/2011 08:48, David Jander a écrit :
>
>>> However, it is still correct that copying from an non-cached area is
>>> slower than from cached areas, because of burst reads vs. individual
>>> reads. However, I doubt that the u-boot user can tell the 
>>> difference, as
>>> the network latency will far exceed the difference in copy time.
>
> That's assuming cache is only for networking. There can be DMA engines 
> in a lot of other peripherals which do not have the same latency as 
> network (and then, even for networking, TFTP can be done from a very 
> nearby server, possibly even on the same Ethernet segment).
Hi All,
 Yes, there are other uses of DMA. On a network, unless you have 
a Gigabit network, your memory access speed is at least an order of 
magnitude faster than the network, probably more. Plus, there is a 
latency due to sending the ack and request for the next record that 
undoubtedly swamps out any reduction in memory speed due to the single 
copy that takes place. In the case of other devices, like for example 
disks, the percentage effect is probably greater, but since these 
devices are so fast anyway that the human-perceived speed reduction is 
essentially nil. If we were talking about a CPU running Linux and doing 
all kinds of I/O all day long, the reduction in throughput  performance 
might be 10% and that might matter. In a boot loader that does I/O 
mostly to read in a program to replace itself, I would argue that nobody 
will notice the difference between cached and un-cached buffers. 
Counter-examples welcome however.
>
>>> The
>>> question is, which is easier to do, and that is probably a matter of
>>> opinion. However, it is safe to say that so far a cached solution has
>>> eluded us. That may be changing, but it would still be nice to know how
>>> to allocate a section of un-cached RAM in the ARM processor, in so far
>>> as the question has a single answer! That would allow easy portability
>>> of drivers that do not know about caches, of which there seems to be 
>>> many.
>
> That is one approach, which I think prevents cache from being used 
> beyond caching pure CPU-used DRAM.
You are certainly correct there. However, I think the pure CPU-used ram 
case is the one that matters most. Uncompressing and checksumming of 
input data are typical u-boot functions that take significant time. The 
performance increase due to cache hits in these cases is huge, and 
easily perceptible by the user.
>
>> I agree. Unfortunately, my time is up for now, and I can't go on with 
>> trying
>> to fix this driver. Maybe I'll pick up after my vacation.
>> As for now I settled for the ugly solution of keeping dcache disabled 
>> while
>> ethernet is being used :-(
>
> Make sure you flush before disabling. :)
>
>> IMHO, doing cache maintenance all over the driver is not an easy or nice
>> solution. Implementing a non-cached memory pool in the MMU and a 
>> corresponding
>> dma_malloc() sounds like much more universally applicable to any driver.
>
> I think cache maintenance is feasible if one makes sure the cached 
> areas used by the driver are properly aligned, which simplifies things 
> a lot: you don't have to care for flush-invalidate or just-in-time 
> invalidate, you just have to flush before sending and invalidate 
> before reading.
I do agree it can be done. However, most (I think?) of the CPUs to which 
u-boot have been ported have cache-coherent DMA. As a result, cache 
issues for  these CPUs are not addressed in the driver at all. Often 
this means that cache support is done after the fact by somebody other 
than the original author who may not totally understand the original 
driver. If DMA buffers were always allocated from cache-coherent memory, 
either because the memory is un-cached or because the CPU is DMA cache 
coherent, no changes would be necessary to get the driver working 
correctly.  If performance ever became an issue in the un-cached case, 
then more work would be required, but in most cases, I expect nobody 
will notice.

Best Regards,
Bill Campbell
>
>> Best regards,
>
> Amicalement,

___
U-Boot mailing list
U-Boot@lists.denx.de
http://lists.denx.de/mailman/listinfo/u-boot


Re: [U-Boot] i.MX51: FEC: Cache coherency problem?

2011-07-20 Thread J. William Campbell
On 7/20/2011 7:35 AM, Albert ARIBAUD wrote:
> Le 20/07/2011 16:01, J. William Campbell a écrit :
>> On 7/20/2011 6:02 AM, Albert ARIBAUD wrote:
>>> Le 19/07/2011 22:11, J. William Campbell a écrit :
>>>
>>>> If this is true, then it means that the cache is of type write-back 
>>>> (as
>>>> opposed to write-thru). From a (very brief) look at the arm7 
>>>> manuals, it
>>>> appears that both types of cache may be present in the cpu. Do you 
>>>> know
>>>> how this operates?
>>> Usually, copyback (rather than writeback) and writethough are modes of
>>> operation, not cache types.
>> Hi Albert,
>> One some CPUs both cache modes are available. On many other CPUs (I
>> would guess most), you have one fixed mode available, but not both. I
>> have always seen the two modes described as write-back and
>> write-through, but I am sure we are talking about the same things.
>
> We are. Copy-back is another name for write-back, not used by ARM but 
> by some others.
>
>> The
>> examples that have both modes that I am familiar with have the mode as a
>> "global" setting. It is not controlled by bits in the TLB or anything
>> like that. How does it work on ARM? Is it fixed, globally, globally
>> controlled, or controlled by memory management?
>
> Well, it's a bit complicated, because it depends on the architecture 
> version *and* implementation -- ARM themselves do not mandate things, 
> and it is up to the SoC designer to specify what cache they want and 
> what mode it supports, both at L1 and L2, in their specific instance 
> of ARM cores. And yes, you can have memory areas that are write-back 
> and others that are write-through in the same system.
>
>> If it is controlled by memory management, it looks to me like lots of
>> problems could be avoided by operating with input type buffers set as
>> write-through. One probably isn't going to be writing to input buffers
>> much under program control anyway, so the performance loss should be
>> minimal. This gets rid of the alignment restrictions on these buffers
>> but not the invalidate/flush requirements.
>
> There's not much you can do about alignment issues except align to 
> cache line boundaries.
>
>> However, if memory management
>> is required to set the cache mode, it might be best to operate with the
>> buffers and descriptors un-cached. That gets rid of the flush/invalidate
>> requirement at the expense of slowing down copying from read buffers.
>
> That makes 'best' a subjective choice, doesn't it? :)
Hi All,
 Yes,it probably depends on the usage.
>
>> Probably a reasonable price to pay for the associated simplicity.
>
> Others would say that spending some time setting up alignments and 
> flushes and invalidates is a reasonable price to pay for increased 
> performance... That's an open debate where no solution is The Right 
> One(tm).
>
> For instance, consider the TFTP image reading. People would like the 
> image to end up in cached memory because we'll do some checksumming on 
> it before we give it control, and having it cached makes this step 
> quite faster; but we'll lose that if we put it in non-cached memory 
> because it comes through the Ethernet controller's DMA; and it would 
> be worse to receive packets in non-cached memory only to move their 
> contents into cached memory later on.
>
> I think properly aligning descriptors and buffers is enough to avoid 
> the mixed flush/invalidate line issue, and wisely putting instruction 
> barriers should be enough to get the added performance of cache 
> without too much of the hassle of memory management.
I am pretty sure that all the drivers read the input data into 
intermediate buffers in all cases. There is no practical way to be sure 
the next packet received is the "right one" for the tftp. Plus there are 
headers involved, and furthermore there is no way to ensure that a tftp 
destination is located on a sector boundary. In short, you are going to 
copy from an input buffer to a destination.
However, it is still correct that copying from an non-cached area is 
slower than from cached areas, because of burst reads vs. individual 
reads. However, I doubt that the u-boot user can tell the difference, as 
the network latency will far exceed the difference in copy time. The 
question is, which is easier to do, and that is probably a matter of 
opinion. However, it is safe to say that so far a cached solution has 
eluded us. That may be changing, but it would still be nice to know how 
to allocate a section of un-cached RAM in the ARM processor, in so far 
as the question has a single answer! That would allow easy portability 
of drivers that do not know about caches, of which there seems to be many.

Best Regards,
Bill Campbell
>
>> Best Regards,
>> Bill Campbell
>
> Amicalement,

___
U-Boot mailing list
U-Boot@lists.denx.de
http://lists.denx.de/mailman/listinfo/u-boot


Re: [U-Boot] i.MX51: FEC: Cache coherency problem?

2011-07-19 Thread J. William Campbell
On 7/19/2011 11:14 AM, Anton Staaf wrote:
> On Tue, Jul 19, 2011 at 7:36 AM, J. William Campbell
>   wrote:
>> On 7/19/2011 2:05 AM, Albert ARIBAUD wrote:
>>> Le 19/07/2011 10:43, Aneesh V a écrit :
>>>
>>>>>> You would have to flush (before sending packets / starting external
>>>>>> memory-to-device DMA) and invalidate (before reading received packets /
>>>>>> after external device-to-memory DMA is done); using MMU and mapping
>>>>>> cached/non-cached areas is IMO overkill, and will hurt CPU accesses to
>>>>>> the xmit/receive buffers and descriptors.
>>>>> So, you say actually what I did while exploring the problem would have
>>>>> been a
>>>>> correct way of solving this problem?
>>>>>
>>>>> Like this:
>>>>>
>>>>> 587 flush_cache(&fec->tbd_base[fec->tbd_index], 4);
>>>> This is what is needed assuming the below is initiating a memory to
>>>> peripheral DMA. Is your buffer only 4 bytes long?
>>> Generally:
>>>
>>> - for sending data through a device that has its own, external, DMA
>>> engine, you'll obviously need to flush the data buffer(s) but also any
>>> DMA descriptors used by the engine, before you start the engine;
>>>
>>> - for rceiving, if you have to set up receive descriptors, you must
>>> flush that before telling the device to enter receive mode (so that the
>>> device reads the descriptors as you wrote them), and you should
>>> invalidate the receive buffers at the latest when the device signals
>>> that data has been received,
>> Hi All,
>>
>>>or preferably long before (at the same time
>>> you flushed the read descriptor, so that cache-related actions are
>>> grouped in the same place in the code).
>> I think this last statement is incorrect. You should invalidate the
>> cache for the receive buffers just before you intend to reference them.
>> If you do it right after receive mode is entered, subsequent access to
>> items NEAR the receive buffer may reload the cache with receive buffer
>> data before the dma is done, re-creating the problem you are trying to
>> avoid. Also, I don't know if ARM cache is write-back or write -thru, but
>> if it is write-back, the only way to avoid problems is to allocate the
>> receive buffers on cache line boundaries, so no "nearby" writes can
>> cause something in the DMA buffer to be corrupted. If all receive
>> buffers are allocated on cache-line boundaries (both start and end of
>> each buffer), you can invalidate the cache "early" under the assumption
>> that there will be no read accesses to the read buffers themselves until
>> after DMA is complete. IMHO it is better, even in this case., to
>> invalidate cache after dma is done but before referencing the read data.
> This is a critical observation, and one I was going to make if I had
> made it to the end of the thread and no one had already pointed it
> out.  In fact, there is no way with the current implementation (that I
> can see) of the v7_dcache_inval_range function to correctly implement
> a cache enabled DMA driven driver without aligning buffers to cache
> line sizes.  Below is the commit message from a recent patch I made
> locally to fix the Tegra MMC driver.  I wanted to start a discussion
> on the list about forcing all buffers to be aligned to cache line
> sizes.  The problem is that many buffers are stack allocated or part
> of structs that are of unknown alignment.
>
>  mmc: Tegra2: Enable dcache support by bouncing unaligned requests.
>
>  Dcache support was disabled due to dcache/dma interactions with
>  unaligned buffers.  When an unaligned buffer is used for DMA the
>  first and last few bytes of the buffer will be clobbered by the
>  dcache invalidate call that is required to make the contents of
>  the buffer visible to the CPU post DMA.  The reason that these
>  bytes are clobbered is that the dcache invalidate code (which is
>  the same as the linux kernels) checks for unaligned invalidates
>  and first flushes the cache lines that are being partially
>  invalidated.  This is required because the cache lines may be
>  shared with other variables, and may be dirty.
Hi Anton,
>This flush however
>  writes over the first and last few bytes of the unaligned buffer
>  with whatever happened to be in the cache.
If this is true, then it means that the cache is of type write-back (as 
opposed to write-thru). From a (very brief) look at the arm

Re: [U-Boot] i.MX51: FEC: Cache coherency problem?

2011-07-19 Thread J. William Campbell
On 7/19/2011 2:05 AM, Albert ARIBAUD wrote:
> Le 19/07/2011 10:43, Aneesh V a écrit :
>
 You would have to flush (before sending packets / starting external
 memory-to-device DMA) and invalidate (before reading received packets /
 after external device-to-memory DMA is done); using MMU and mapping
 cached/non-cached areas is IMO overkill, and will hurt CPU accesses to
 the xmit/receive buffers and descriptors.
>>> So, you say actually what I did while exploring the problem would have
>>> been a
>>> correct way of solving this problem?
>>>
>>> Like this:
>>>
>>> 587 flush_cache(&fec->tbd_base[fec->tbd_index], 4);
>> This is what is needed assuming the below is initiating a memory to
>> peripheral DMA. Is your buffer only 4 bytes long?
> Generally:
>
> - for sending data through a device that has its own, external, DMA
> engine, you'll obviously need to flush the data buffer(s) but also any
> DMA descriptors used by the engine, before you start the engine;
>
> - for rceiving, if you have to set up receive descriptors, you must
> flush that before telling the device to enter receive mode (so that the
> device reads the descriptors as you wrote them), and you should
> invalidate the receive buffers at the latest when the device signals
> that data has been received,
Hi All,

>   or preferably long before (at the same time
> you flushed the read descriptor, so that cache-related actions are
> grouped in the same place in the code).
I think this last statement is incorrect. You should invalidate the 
cache for the receive buffers just before you intend to reference them. 
If you do it right after receive mode is entered, subsequent access to 
items NEAR the receive buffer may reload the cache with receive buffer 
data before the dma is done, re-creating the problem you are trying to 
avoid. Also, I don't know if ARM cache is write-back or write -thru, but 
if it is write-back, the only way to avoid problems is to allocate the 
receive buffers on cache line boundaries, so no "nearby" writes can 
cause something in the DMA buffer to be corrupted. If all receive 
buffers are allocated on cache-line boundaries (both start and end of 
each buffer), you can invalidate the cache "early" under the assumption 
that there will be no read accesses to the read buffers themselves until 
after DMA is complete. IMHO it is better, even in this case., to 
invalidate cache after dma is done but before referencing the read data.

Best Regards,
Bill Campbell

>
> Amicalement,

___
U-Boot mailing list
U-Boot@lists.denx.de
http://lists.denx.de/mailman/listinfo/u-boot


Re: [U-Boot] [PATCH v1 (WIP) 00/16] [Timer]API Rewrite

2011-07-15 Thread J. William Campbell
On 7/15/2011 11:34 AM, Wolfgang Denk wrote:
> Dear "J. William Campbell",
>
> In message<4e208227.6010...@comcast.net>  you wrote:
>> If the I2C protocol must be available before interrupts are
>> available, then udelay must be used. In the above examples, there are
>> some loops in i2c and spi that appear to be waiting a full second. I
>> assume they are using udelay because the get_timer feature is not yet
>> available to them. I also assume that the example in common/usb.c uses
> No, this is usually not the case.  This long delay is the error case,
> which most probably will never happen.  For the normal case, you want
> a tight spinning loop that introduces as little additional delay as
> possible.
Hi All,
 I understand why one would want to use udelay in those cases 
where there are other conditions that should cause the loop to exit. 
However, in the 16 examples you cited,  7 of those usages are for delays 
of multiple milliseconds with no tests or escape conditions included. I 
therefore pointed out that I assume that it was being done that way (as 
opposed to using  get_timer) because this code would be executed 
"early", before get_timer was available.
>> True, although I expect you will find the statement "on all the other
>> architectures" to be false. Many other architectures, yes, all, no.
>> These other architectures just don't have spi or I2C yet, or if they do,
>> they don't use it "early".
> Are you aware of another architecture that cannot provide
> sub-microsecond timer resolution?  Which is it?

  No, but I am aware of some that do not provide udelay at all until 
after interrupts are enabled and the get_timer interface is available. 
Several ARM implementations have this property. There are one or two 
others (SPARC) that are similar as I recall. There were also some that 
called reset_timer, which would mess up nested timing loops, but this 
has already been fixed. There also quite a few udelay implementations 
that do not actually provide 1 µs resolution.  For instance, the first 
thing that happens in arm720t/interrupts.c  __udelay is that the input 
microseconds is divided by 1000 and the remainder discarded. A 
udelay(10) won't work so well here. The lh7a40x/timer.c uses a 508 kHz 
clock, so delays are accurate to at best 2 µs. There are other examples. 
For instance, I am pretty sure some u-boot based CPUs use a 32kHz clock 
as the timebase for udelay. The resolution in that case is 30 µs, which 
would be a problem for your example code. I am sure further digging will 
reveal other resolution issues.

These udelay issues would need fixing if all u-boot options are to work 
"out of the box". In some cases of low resolution, the fix is not easy, 
or maybe not even possible. However, this problem has not prevented the 
use of u-boot on these CPUs. It does mean that some things do not work 
"out of the box", and these non-compliant udelays should be fixed 
wherever practical. In any case, this shouldn't stop progress on 
get_timer and the delay_at_least (or whatever it ends up being called) 
function being adopted. That is independent of the udelay issues. udelay 
must be implemented in a "compliant" manner, which may or may not use 
the same hardware as get_timer() does. It certainly can't use the same 
code (although some ARM version in effect do, resulting in a 
non-compliant, low resolution  udelay).

It might be interesting to utilize something like "jiffies" for the loop 
delays that are looking for error exits. In those cases, the error case 
should be infrequent. It may not matter that the total delay is quite a 
bit longer than the user requested, as long as it isn't vastly shorter. 
If a 10 µsec delay became a 30 µsec delay inside the loop examples where 
we expect to exit due to another cause, it probably wouldn't matter 
much. It would matter a lot though if the delay were setting a baud rate 
on a "bit-banged" output, so jiffies may not be worth it. I put it out 
there for consideration, as it is easier to produce than a udelay that 
has a guaranteed low absolute error.

Best Regards,
Bill Campbell
>
> Best regards,
>
> Wolfgang Denk
>

___
U-Boot mailing list
U-Boot@lists.denx.de
http://lists.denx.de/mailman/listinfo/u-boot


Re: [U-Boot] [PATCH v1 (WIP) 00/16] [Timer]API Rewrite

2011-07-15 Thread J. William Campbell
On 7/15/2011 12:17 AM, Wolfgang Denk wrote:
> Dear "J. William Campbell",
>
> In message<4e1f8127.8030...@comcast.net>  you wrote:
>> I am pretty sure that is long enough for someone to notice. . I would be
>> interested in seeing an example of such code as you refer to. Could you
>> point me to one, because it seems to me that the only reason to code
>> such a delay is that for some reason the user didn't want to keep
>> looking at the termination condition so often. I think that that
> arch/powerpc/cpu/mpc512x/i2c.c:
>
>   80 while (timeout--&&  (status&  I2C_BB)) {
>   ...
>   88 udelay (1000);
>   89 status = mpc_reg_in (®s->msr);
>   90 }
>
> 103 while (timeout--&&  !(*status&  I2C_IF)) {
> 104 udelay (1000);
> 105 *status = mpc_reg_in (®s->msr);
> 106 }
> arch/powerpc/cpu/mpc512x/pci.c:
>
>   87 /* We need to wait at least a 1sec based on PCI specs */
>   88 for (i = 0; i<  1000; i++)
>   89 udelay(1000);
>
> arch/powerpc/cpu/mpc5xx/spi.c:
>
> 258 for (i = 0; i<  1000; i++) {
> ...
> 265 udelay(1000);
> 266 }
>
> 390 for (tm=0; tm<1000; ++tm) {
> ...
> 394 udelay (1000);
> 395 }
>
> arch/powerpc/cpu/mpc5xxx/i2c.c:
>
> 102 while (timeout--&&  (status&  I2C_BB)) {
> ...
> 111 udelay(15);
> 112 status = mpc_reg_in(®s->msr);
> 113 }
>
> 125 while (timeout--&&  !(*status&  I2C_IF)) {
> 126 udelay(15);
> 127 *status = mpc_reg_in(®s->msr);
> 128 }
>
> And just that you don't think this is in a single CPU only:
>
> arch/powerpc/cpu/mpc8260/pci.c:
>
> 343 for (i = 0; i<  1000; ++i)
> ...
> 345 udelay (1000);
>
> Or in board code:
>
> board/altera/common/cfide.c:
>
>   29 /* wait 500 ms for power to stabilize */
>   30 for (i = 0; i<  500; i++)
>   31 udelay(1000);
>
> board/amcc/bamboo/bamboo.c:
>
> 1019 for (i=0; i<500; i++)
> 1020 udelay(1000);
>
> or common code:
>
> common/cmd_fdc.c:
>
> 213 while((read_fdc_reg(FDC_SRA)&0x80)==0) {
> 214 timeout--;
> 215 udelay(10);
> 216 if(timeout==0) /* timeout occured */
> 217 return FALSE;
> 218 }
>
> 228 while((read_fdc_reg(FDC_MSR)&0xC0)!=0xC0) {
> 229 /* direction out and ready */
> 230 udelay(10);
> 231 timeout--;
> 232 if(timeout==0) /* timeout occured */
> 233 return -1;
> 234 }
> etc.etc.
>
> 375 for(val=0;val<255;val++)
> 376 udelay(500); /* wait some time to start motor */
>
> 418 for(i=0;i<100;i++)
> 419 udelay(500); /* wait 500usec for fifo overrun */
>
> 600 for(i=0; i<255; i++) /* then we wait some time */
> 601 udelay(500);
>
>
> common/usb.c:
>
>93 inline void wait_ms(unsigned long ms)
>94 {
>95 while (ms-->  0)
>96 udelay(1000);
>97 }
>
> etc. etc.  Note this last example which "spreads" the effect in a not
> so nice way.
>
> Note that there are even places where udelay() is part of the protocol
> timing implementation - like in the soft-I2C and soft-SPI drivers.
>
Hi All,
   Thanks for the pointers Wolfgang. From your examples, it seems we 
are talking exclusively about udelay. As I said in my previous post, on 
NIOS2, udelay needs to be mechanized on NIOS2 by a timed loop, not by 
using the same mechanism as the timer, as there is no way one can 
approximate microsecond resolution by the hardware timer. It is my 
opinion that there are many CPUs on which the udelay implementation must 
not be/should not be based on the same timer code as the get_time() 
operation, because in many cases the get_time() operation requires 
interrupts to be functional. As you have also pointed out, udelay needs 
to be available "early", before interrupts are available. Therefore, the 
udelay and get_timer must use the hardware  at least somewhat 
differently. The must also not interfere with each other, which on some 
existing boards they did.
   If the I2C protocol must be available before interrupts are 
available, then udelay must be used. In the above examples, t

Re: [U-Boot] [PATCH v1 (WIP) 00/16] [Timer]API Rewrite

2011-07-14 Thread J. William Campbell
On 7/14/2011 12:41 PM, Wolfgang Denk wrote:
> Dear "J. William Campbell",
>
> In message<4e1cf2e0.1030...@comcast.net>  you wrote:
>> Yes, this is true. However, the time_elapsed_since routine can do
>> this dynamically (i.e. add twice the timer resolution) . I think you had
> That would IMHO be a very bad idea.  We have a number of places where
> we have to deal with pretty long timeouts (usually because of protocol
> specifications that require this - often in the order of several
> seconds), where the normal path is very fast.  The typical approach is
> to break the timeout into a large number of very short loops.
> Sometimes we use udelay() for this, other places use get_timer().
>
> So assume we have a timeout of 5 seconds, and implement this as 50,000
> loops of 100 microseconds.  If you silently turn each of these into 20
> milliseconds on NIOS, the timeout would become 1,000 seconds instead
> of 5 - users would return boards as broken and report "it just
> freezes" because nobody expects that it will wake up again after some
> 16 minutes.
Hi All,
   If such a condition existed, that is indeed what would 
happen. However, at present, such code is not being used on NIOS. We 
know this because the current "work-around" of resetting the timer at 
the start of any timeout operation extends the timeout to a minimum of 
10 milliseconds. So we would be waiting 8 minutes currently, not 16, and 
I am pretty sure that is long enough for someone to notice. . I would be 
interested in seeing an example of such code as you refer to. Could you 
point me to one, because it seems to me that the only reason to code 
such a delay is that for some reason the user didn't want to keep 
looking at the termination condition so often. I think that that 
equivalent  operation can be produced by a pretty simple re-coding of 
the loop. In any case, NIOS does not have the problem at present, so the 
suggested new work-around would be no worse than the present situation.
  It is also true that the hardware timer cannot be used in a 
reasonable version of udelay, as most of the desired delays may be very 
short relative to the timebase. A calibrated timing loop would be the 
best approach to the udelay problem.
>> another function name (at_least) involved, but you can define
>> time_elapsed_since as always compensating for the resolution. That will
>> fix any resolution questions in a processor-specific way. It is either
>> that or the ifdefs. One way or another, the resolution must be
>> addressed. Up to now, the implicit resolution has been 1 ms, but we now
>> know that is not general enough.
> It's not as simple as this.  You have to change a lot of code to make
> this work for such slow clock systems.
In general, that is true. There may be a few cases where a delay of less 
than the resolution is essential to make something work. There are 
probably lots of other cases where we can easily remove the restriction 
on the resolution. We cannot fix the first kind, no matter what we do, 
to work on a lower resolution timer. The second kind we can and probably 
should fix, because they are coded in an overly-restrictive manner. In 
any case, we don't absolutely have to fix them until somebody decides to 
use the code on a CPU with a low resolution timer. Practically speaking, 
the suggested solution will therefore work on all extant cases.

Best Regards,
Bill Campbell
>
> Best regards,
>
> Wolfgang Denk
>

___
U-Boot mailing list
U-Boot@lists.denx.de
http://lists.denx.de/mailman/listinfo/u-boot


Re: [U-Boot] [PATCH v1 (WIP) 00/16] [Timer]API Rewrite

2011-07-12 Thread J. William Campbell
On 7/12/2011 5:33 PM, Graeme Russ wrote:
> Hi Reinhard,
>
> On 13/07/11 02:08, Reinhard Meyer wrote:
>> Dear J. William Campbell, All
> [snip]
>
>> Lets just keep the current functions udelay(us) and u32 get_timer(), the
>> latter maybe without parameter. Remove all *masked() and *reset() functions
> This is happening and has Wolfgang's 100% support. Patches are done, just
> need to be rebased and resent
>
>> and lets change the timeout values used in code that also NIOS uses to
>> be 20ms more than the hardware requires. It does not really matter if a
>> *timeout* of 20ms is blown up to 40ms.
> Can't do that - That would mean winding out common code (CFI, Network,
> Serial, FPGA etc, etc) timeouts to global worst-case scenario unless you
> want to add a #define and start littering the code with macro's - Urgh!
Hi All,
   Yes, this is true. However, the time_elapsed_since routine can do 
this dynamically (i.e. add twice the timer resolution) . I think you had 
another function name (at_least) involved, but you can define 
time_elapsed_since as always compensating for the resolution. That will 
fix any resolution questions in a processor-specific way. It is either 
that or the ifdefs. One way or another, the resolution must be 
addressed. Up to now, the implicit resolution has been 1 ms, but we now 
know that is not general enough.

Best Regards,
Bill Campbell

>
> Regards,
>
> Graeme
> ___
> U-Boot mailing list
> U-Boot@lists.denx.de
> http://lists.denx.de/mailman/listinfo/u-boot
>
>

___
U-Boot mailing list
U-Boot@lists.denx.de
http://lists.denx.de/mailman/listinfo/u-boot


Re: [U-Boot] [PATCH v1 (WIP) 00/16] [Timer]API Rewrite

2011-07-12 Thread J. William Campbell
On 7/12/2011 8:23 AM, Scott McNutt wrote:
> Dear Wolfgang
>
> Wolfgang Denk wrote:
>
>> What exactly is the reason that we cannot have better timer
>> resolutions in NIOS?
> You _can_ have better timer resolutions in Nios. However, there
> are legacy systems that implement timer(s) with a fixed period
> of 10 msec. The use of such implementations is very common.
> How do I know? Because my customers have such implementations.
> Why not upgrade? Because most sane engineers are loathe to
> change their rtl just so they can fix a simple software bug
> or add a simple custom feature.
>
> My obvious preference is to continue to use the latest u-boot in
> such systems ... without having to create a custom branch for
> customers with older Nios implementations, where I use the old
> timer interfaces ... simply because we "improved" the API.
>
>> Scott, maybe you can comment here?
> I have only one comment: Out of pure frustration, I have been
> avoiding any discussions regarding this timer re-write effort.
>
>> At the moment I'm trying to understand if we really have a problem on
>> NIOS2 that cannot be fixed in a way that is compatible with our
>> current plans.
> This is where my frustration begins. I've said this several times
> before, and I don't know how else to say this ... but I'll give it
> one more try:
>
>   This is not a Nios problem.
>
> If I can't use a 10 msec timer interrupt to detect a simple timeout
> condition when I'm programming a flash memory, then the timer API
> design is garbage.
Hi All,
   I agree with the idea that mandating a particular timer 
resolution is an un-desired limitation. You may need it for some cases, 
but not for all. Proof of this concept is the number of legacy UNIX 
systems running on 10 ms timer ticks, or for that matter the number of 
systems running on line cycle interrupts (16.666 ms in the US). It is 
also however true that one cannot ignore the timer resolution. That is 
what produced the problem in the first place. Asking for a delay that is 
less than the resolution isn't going to work as expected. This problem 
can be easily fixed by always adding twice the timer resolution to any 
requested delay, to ensure you waited at least two ticks , or by having 
a routine that does this for you (like the suggested API does). Right 
now, there are many places in the u-boot code that wait a certain small 
number of milliseconds under the assumption that the timer resolution is 
either 1) not an issue at all, or 2) is 1 ms. Both cases are wrong, the 
first case more so than the second. I wonder how many delays in the 
program that say "wait  5 ms" are aware that you could exit the loop 
after 4ms plus epsilon every so often, even with 1 ms resolution? 
Perhaps the tolerance is built into the number 5, perhaps not. However, 
it is surely not if the timer resolution is 16.666 ms.
>   And quite frankly, the enormous amount of
> discussion in this matter reminds me of an undergraduate science
> project.
Or any government project. But when you will effect lots of people, you 
often must kill a few trees to make sure nobody (such as yourself and 
other NIOS users) somehow gets a bad deal.

I am now returning to "real work" and will be offline for a while. Good 
Luck All.

Best Regards,
Bill Campbell
>> We also have to deal with decrementing counters, but this is just aan
>> unimportant detail.  And it appears that we actually can have this,
>> even on NIOS.
> Wow! Even on Nios! Arrggghh!
>
>> The only architecture I remember that seemed prolematic was NIOS
> Really? There have been occasional issues that have required a patch
> here and there. But I consider most of them far from "problematic."
>
> This is exhausting. As I recall, this entire fuss was born out
> of an issue with nested calls to get_timer() ... and that begat
> removing reset_timer ... and that begat ... and so on. And we're
> still spilling lots of intellectual seed here. And now we have
> what? A jack-of-all-trades API that can do everything with
> exacting precision ... other than handling a 10 msec interrupt?
>
> Best Regards,
> --Scott
>
> ___
> U-Boot mailing list
> U-Boot@lists.denx.de
> http://lists.denx.de/mailman/listinfo/u-boot
>
>

___
U-Boot mailing list
U-Boot@lists.denx.de
http://lists.denx.de/mailman/listinfo/u-boot


Re: [U-Boot] [PATCH v1 (WIP) 00/16] [Timer]API Rewrite

2011-07-12 Thread J. William Campbell
On 7/12/2011 6:10 AM, Wolfgang Denk wrote:
> Dear Graeme Russ,
>

> Do we?  What exactly is the needed resolution of the underlying
> hardware timer?  So far, it appears sufficient to have it ticking with
> 1000 Hz or more.  Are there really systems that cannot provide that?
> The only architecture I remember that seemed prolematic was NIOS - but
> then the NIOS documentation suggests that this might actually be
Hi All,
   The NIOS2 "systems" are a FPGA based architecture ("soft" core). 
They CAN have all different kinds of timers. However, there are many 
already out in the world that were built (generated) with a timer that 
has 10 ms resolution. We can't retro-actively change these. We can 
declare them not supported in future releases, but we can't fix the 
resolution problem retroactively.
   I have two comments regarding this discussion so far. First, I 
think using the "time" function name at all is a VERY BAD idea. People 
will confuse it with the "normal" c library function that returns the 
time of day since the epoch. One may say that they should RTFM, but it 
is lots easier to just use another name and avoid all such confusion. 
Second, I think forcing all systems to use 64 bit time stamps is not a 
good idea. The main purpose of the timer API is to provide hardware 
timeouts. 64 bits is vast overkill for such a purpose. Mandating 64 bit 
capability for all u-boots is I think counter-productive. A 32 bit 
timestamp in millisecond resolution plus udelay covers all practical 
cases. The 1 ms resolution is probably inadequate for performance 
measurement, but for every u-boot system that is sometimes used for 
performance measurement, there are 100 (probably even more, like 1000, 
but I don't really have the figures) systems that are just used to boot 
Linux or another stand alone program. So we may need a different API for 
performance measurement that uses higher resolution. On most "larger" 
CPUs, there are hardware timers that can already do this with minimal 
work. On some less-capable CPUs, it may be lots harder, but if you 
aren't doing performance measurement, you don't need it. So I suggest 
that mandating a 64 bit API for some of the processors (68K comes to 
mind) is not really a good idea. It seems counter to the elegant  
simplicity of the rest of u-boot to stick a 64 bit requirement on 
timestamps used to determine if the boot delay is expired.

Best Regards,
J. William Campbell

>   Best regards,
>
>
> Wolfgang Denk
>

___
U-Boot mailing list
U-Boot@lists.denx.de
http://lists.denx.de/mailman/listinfo/u-boot


Re: [U-Boot] [RFC][Timer API] Revised Specification - Implementation details

2011-05-28 Thread J. William Campbell
On 5/27/2011 10:53 PM, Graeme Russ wrote:
> Hi Bill,
>
> On 28/05/11 00:23, J. William Campbell wrote:
>> On 5/27/2011 12:35 AM, Graeme Russ wrote:
>>> Hi Wolfgang,
>>>
>>> On 27/05/11 17:13, Wolfgang Denk wrote:
>>>> Dear Graeme Russ,
>>>>
>>>> In message   you wrote:
>>>>> I think we will need to define get_timer() weak - Nios will have to
>>>>> override the default implementation to cater for it's (Nios') limitations
>>>> Please don't  - isn't the purpose of this whole discussion to use
>>>> common code for this ?
>>>>
>>> Yes, but Nios is particularly bad - It has a 10ms tick counter :(
>> Hi All,
>>   And a hardware timer that you can't read to subdivide the 10 ms. Note
>> that this is not necessarily a problem with all NIOS implementations. The
>> timer characteristics can be controlled when you generate the bitstream for
>> the FPGA. You can make the counter both faster and readable if you want. It
>> just uses a bit more silicon. Sad to say, it probably will require  per
>> board get_ticks routine. For the "old" nios2 timers however, overriding
>> get_timer with a /board routine is probably the only way to go.
>>
> Actually, using the logic you presented in the "Remove calls to
> [get,reset]_timer outside arch/" thread, we could have a common get_timer()
> implementation which deals with all timer resolution issues:
>
> #if !defined(CONFIG_MIN_TIMER_RESOLUTION)
> #define CONFIG_MIN_TIMER_RESOLUTION 1
> #endif
>
> u32 get_timer(u32 base)
> {
>   if (base != 0) {
>   if (timer - base<  (CONFIG_MIN_TIMER_RESOLUTION * 2))
>   return 0;
>   else
>   return timer - base - CONFIG_MIN_TIMER_RESOLUTION;
>   } else {
>   return timer;
>   }
> }
>
> This would also guarantee that timeout loops run for _at least_ the timeout
> period
>
Hi Graeme,
   You are almost correct. The problem is that if you are calling 
get_timer early on in start-up, the timer can actually be 0. The loop 
initializing get_timer(0) will then return 0. If you can initialize the 
timer to a number !=0, then the above code works. If the timer is 
totally hardware derived, such an initialization may not be easy. For 
nios2, it is easy since the counter is software interrupt driven.  It 
can just be initialized to 1. The 10 ms updates will then run like 
normal. It also won't work if the user is doing his own timer arithmetic 
by subtracting get_timer(0) values, but there is nothing we can do about 
that case anyway so that case is moot. Good job of spotting this approach.

Best Regards,
Bill Campbell
> Regards,
>
> Graeme
>
>

___
U-Boot mailing list
U-Boot@lists.denx.de
http://lists.denx.de/mailman/listinfo/u-boot


Re: [U-Boot] [RFC][Timer API] Revised Specification - Implementation details

2011-05-27 Thread J. William Campbell
On 5/27/2011 8:13 AM, Simon Glass wrote:
> On Fri, May 27, 2011 at 8:00 AM, J. William Campbell
>   wrote:
> [snip]
>> Hi All,
>>  A more precise statement of the problem is that all timer delays
>> may be shortened by the timer resolution. So this means that if you have
>> a timeout of 1 ms in your get_time(0) {   } while ( ...<  1), then your
>> actual delay may be anywhere between 0 and 1 ms. The problem arises when
>> some piece of common code uses a delay of say 8 millisec, expecting the
>> actual delay to be between 7 and 8. If the resolution is 10 ms, the
>> delay will be between 0 and 10 ms, 0 being particularly bad. This can be
>> fixed in get_timer, making the 8 ms delay  become a minimum of 10 ms at
>> the expense of it becoming up to 20 ms sometimes. Since these delays are
>> used mostly for error conditions, making them longer will probably be
>> ok, and doesn't require changing any of the common code. It probably
>> will not make things slower either, because the error timeouts should
>> not be reached. The reset of the hardware timer would cause all "short"
>> delays to become 10 ms. This reset approach is bad in that it prevents
>> proper nesting of timing loops. However, in this case it isn't so bad,
>> in that the nested loops are just extended, not shortened. Note that if
>> the reset is only resetting the HARDWARE interrupt generator, not the
>> actual timestamp itself, we are just extending all existing timeouts by
>> 0 to 10 ms.. So this just lengthens all pending timeouts. The other fix
>> is in my opinion nicer, because it affects the nest loops less. If the
>> inner loop is executed 100 times, with the reset, the outer loop timeout
>> is extended by up to 1000 ms.
>>
>> Best Regards,
>> Bill Campbell
> Hi Bill,
>
> Yes I agree that this is ugly - I didn't realize that this is what
> reset_timer() does, but I think these 10ms platforms should have to
> live with the fact that timeouts will be 0-10ms longer than hoped.
> Perhaps reset_timer() should become a non-standard board thing that is
> deprecated. Really if you have a 10ms timer and are asking for a 10ms
> timeout you are being a bit hopeful.
Hi All,
 Yes, but the person writing the driver was writing "common" code. 
He probably didn't even know there was a timer whose resolution was not 
1 ms.
> But perhaps this argues for a function to check timeouts - at the
> moment get_timer() returns the time since an event and it is used at
> the start of the loop and the end. Perhaps we should have:
>
> #define TIMEOUTMS 2000
>
> stop_time = get_future_time(TIMEOUT_MS);  // Returns current time +
> TIMEOUT_MS + (resolution of timer)
> while (get_timer(stop_time)<  0) // (I would much prefer while
> (!timed_out(stop_time))
> wait for something
> }
>
> Regards,
> Simon
In the existing system, you can get the same result by running the while 
loop  with a condition of (get_timer(base) < TIMEOUTMS + TIMER_RESOLUTION).
We could just make TIMER_RESOLUTION a mandatory define for all u-boots. 
Then common code would be wrong if the TIMER_RESOLUTION were omitted. 
For all I know, there may be such a define already. Anybody know of one?

Best Regards,
Bill Campbell

___
U-Boot mailing list
U-Boot@lists.denx.de
http://lists.denx.de/mailman/listinfo/u-boot


Re: [U-Boot] [RFC][Timer API] Revised Specification - Implementation details

2011-05-27 Thread J. William Campbell
On 5/27/2011 8:44 AM, Scott McNutt wrote:
> J. William Campbell wrote:
>> On 5/27/2011 6:07 AM, Scott McNutt wrote:
>>> Graeme Russ wrote:
>>>> Hi Wolfgang
>>>>
>>>> On Friday, May 27, 2011, Wolfgang Denk  wrote:
>>>>> Dear Graeme Russ,
>>>>>
>>>>> In message  
>>>>> you wrote:
>>>>>> Besides, Nios can return an increment of 10 (presumably ms) between
>>>>>> two immediately consecutive calls. This causes early timeouts in CFI
>>>>>> driver
>>>>> Now this in turn is a bug in the timer implementation that needs 
>>>>> to be
>>>>> fixed.
>>>
>>> And this is what reset_timer() corrected.
>>>
>>>> Agreed, but that is not something I can achieve - I don't want to hold
>>>> up this whole show that we have all put so much effort into for the
>>>> sake of one weak function
>>>
>>> And I don't want to see something that currently works become broken
>>> because we "improved" a feature ... simply because the resolution of
>>> the timestamp is 10 msec rather than 1 msec.
>>>
>>> And just to be clear. This is not a Nios issue. Currently, if the
>>> timestamp is incremented via a fixed period interrupt, and the period
>>> of the interrupt is longer that 1 msec, calls to get_timer() may
>>> produce early timeouts ... regardless of platform.
> 
>> This can be fixed in get_timer, making the 8 ms delay  become a 
>> minimum of 10 ms at the expense of it becoming up to 20 ms sometimes.
>
> Ok. Now I get it. Thanks.
>
> 
>> This reset approach is bad in that it prevents proper nesting of 
>> timing loops.
>
> In my particular case, because reset_timer() resets the timestamp
> to zero rather than simply restarting the timer. I believe leaving
> the timestamp alone would solve the nesting problem.
>
> 
>> The other fix is in my opinion nicer, because it affects the nest 
>> loops less. If the inner loop is executed 100 times, with the reset, 
>> the outer loop timeout is extended by up to 1000 ms.
>
> Bill, thank you for explaining -- probably for the nth time -- but it
> did finally sink in.
Hi Scott,
  Glad to help, I finally think I understand it myself in looking 
into it further! I think we have a good way ahead that should keep 
everything working. We will get you an alpha copy of whatever we do as 
soon as possible so you can verify we didn't break nios2!
Best Regards,
Bill Campbell
>
> Regards,
> --Scott
>
>

___
U-Boot mailing list
U-Boot@lists.denx.de
http://lists.denx.de/mailman/listinfo/u-boot


Re: [U-Boot] [RFC][Timer API] Revised Specification - Implementation details

2011-05-27 Thread J. William Campbell
On 5/26/2011 11:54 PM, Graeme Russ wrote:
> On Fri, May 27, 2011 at 4:33 PM, J. William Campbell
>   wrote:
>> On 5/26/2011 9:33 PM, Graeme Russ wrote:
>>> Hi Bill,
>>>
>> 
> [massive snip]
>
> OK, you have my ears pricked - Can you give me code samples for:
>
>   - get_ticks()
>   - sync_timbase() (no need to implement the whole lot if that is too
> much effort right now)
>   - timer_isr()
>
> that handle the following hardware tick counter scenarios:
>
> a) 64-bit up counter
> b) 64-bit down counter
Hi Graeme,
> c) 32-bit up counter, wraps at 65000
Do you mean 32 bits or 16 bits? doesn't make much difference, but just 
checking.
> d) 16-bit microsecond up counter 0-999 which wraps and triggers a 16-bit
> millisecond up counter. Reading milliseconds latched microseconds and
> clears milliseconds (look in arch/x86/cpu/sc520/timer.c for example)
> e) 24-bit counter occupying bits 2-25 of a 32-bit word (just to be
> difficult)
> f) Any other option anyone cares to throw ;)
>
> All of these options must be covered using:
>   - Minimal global data (we would like it to work before relocation, but
> not mandatory - GD footprint would be nice)
>   - All use the same sync_timebase function
>   - Demonstrate using an ISR NOT synced to the hardware tick counter and
> an ISR that is
>   - Parameters to get_ticks() and sync_timer() are permitted, but not for
> timer_isr() (naturally)
>

 OK! Once again, I accept the challenge. One Caveat. My real work 
has been sliding due to the time I have been spending on this. I am 
flying from San Francisco to Sydney tonight , (to work, not to play),  
so I will be off-grid for 14 hours+. You will not get this code for a 
few days, like probably 3 days. I have looked at the requirements, and I 
see no real problems that I don't know how to solve.
>> I don't' see any reason to push this down to a lower level. It is just one
>> more thing to get messed up across implementations.
> Agreed
>
>>>> detection in the non-interrupt case to sync_timebase as well.
>>>> Sync_timebase
>>>> can also invert the down-counting counters, removing that from get_ticks.
>>>> The wrap detection code can be #ifdef out if one is using interrupts and
>>> Urghh - Why are you adding unnecessary ugliness - #ifdef in the middle of
>>> code is usually a sign you are doing something wrong
>> As I said, this is an optional optimization. I do not agree that an #ifdef
>> in the middle of code indicates you have a bad design. Lots and Lots of
>> ifdefs certainly indicates a bad design. An ifdef to eliminate code if some
>> option is not selected is hardly such a strange thing, especially only a
>> single #ifdef. However, feel free to not have it if you so desire.
> OK, I'll let this slide for the moment - please include in above example
Will Do.
>>>> offended by it's presence. Thanks for pointing this out and compelling me
>>>> to
>>>> reduce the number of cases! Making get_ticks more lightweight is a good
>>>> idea
>>>> in my opinion.
>>>>> Lets say you have a platform with a 32-bit tick counter running at a
>>>>> reasonably long rollover time so you decide not to use interrupts. Then
>>>>> you create a new platform with the same tick counter, but it runs much
>>>>> faster and you realise you need to implement the interrupt routine to
>>>>> make get_timer() work for long enough periods - Fine, you add an ISR
>>>>> for the new platform that calls sync_timebase - No other changes are
>>>>> required.
>>>>>
>>>>> The last thing we want is for the 64-bit tick counter to be conceptually
>>>>> different across platforms
>>>>>
>>>>> I just realised - the ISR _does not need to call the sync_timebase at
>>>>> all_
>>>>> The ISR only needs to call get_ticks(), so it will be fast anyway
>>>> The problem is that the way we previously detected wrapping does not work
>>>> if
>>>> the interrupt rate is == to the counter wrap time, which it essentially
>>>> always is. If get_ticks is trying to update the wrap count when an
>>>> interrupt
>>> Is it, always, on every platform?
>> Yes, pretty much. You get a terminal count/counter underflow interrupt and
>> that is it.
> Not on sc520 - The micro/millisecond counter cannot be used to driver an
> interrupt - you need to setup a seperate timer. I think the x86 internal
> performance counters are the same
What is true, as 

Re: [U-Boot] [RFC][Timer API] Revised Specification - Implementation details

2011-05-27 Thread J. William Campbell
On 5/27/2011 6:07 AM, Scott McNutt wrote:
> Graeme Russ wrote:
>> Hi Wolfgang
>>
>> On Friday, May 27, 2011, Wolfgang Denk  wrote:
>>> Dear Graeme Russ,
>>>
>>> In message  you 
>>> wrote:
 Besides, Nios can return an increment of 10 (presumably ms) between
 two immediately consecutive calls. This causes early timeouts in CFI
 driver
>>> Now this in turn is a bug in the timer implementation that needs to be
>>> fixed.
>
> And this is what reset_timer() corrected.
>
>> Agreed, but that is not something I can achieve - I don't want to hold
>> up this whole show that we have all put so much effort into for the
>> sake of one weak function
>
> And I don't want to see something that currently works become broken
> because we "improved" a feature ... simply because the resolution of
> the timestamp is 10 msec rather than 1 msec.
>
> And just to be clear. This is not a Nios issue. Currently, if the
> timestamp is incremented via a fixed period interrupt, and the period
> of the interrupt is longer that 1 msec, calls to get_timer() may
> produce early timeouts ... regardless of platform.
Hi All,
 A more precise statement of the problem is that all timer delays 
may be shortened by the timer resolution. So this means that if you have 
a timeout of 1 ms in your get_time(0) {   } while ( ... < 1), then your 
actual delay may be anywhere between 0 and 1 ms. The problem arises when 
some piece of common code uses a delay of say 8 millisec, expecting the 
actual delay to be between 7 and 8. If the resolution is 10 ms, the 
delay will be between 0 and 10 ms, 0 being particularly bad. This can be 
fixed in get_timer, making the 8 ms delay  become a minimum of 10 ms at 
the expense of it becoming up to 20 ms sometimes. Since these delays are 
used mostly for error conditions, making them longer will probably be 
ok, and doesn't require changing any of the common code. It probably 
will not make things slower either, because the error timeouts should 
not be reached. The reset of the hardware timer would cause all "short" 
delays to become 10 ms. This reset approach is bad in that it prevents 
proper nesting of timing loops. However, in this case it isn't so bad, 
in that the nested loops are just extended, not shortened. Note that if 
the reset is only resetting the HARDWARE interrupt generator, not the 
actual timestamp itself, we are just extending all existing timeouts by 
0 to 10 ms.. So this just lengthens all pending timeouts. The other fix 
is in my opinion nicer, because it affects the nest loops less. If the 
inner loop is executed 100 times, with the reset, the outer loop timeout 
is extended by up to 1000 ms.

Best Regards,
Bill Campbell
>
> --Scott
>
>
>

___
U-Boot mailing list
U-Boot@lists.denx.de
http://lists.denx.de/mailman/listinfo/u-boot


Re: [U-Boot] [RFC][Timer API] Revised Specification - Implementation details

2011-05-27 Thread J. William Campbell
On 5/27/2011 12:35 AM, Graeme Russ wrote:
> Hi Wolfgang,
>
> On 27/05/11 17:13, Wolfgang Denk wrote:
>> Dear Graeme Russ,
>>
>> In message  you wrote:
>>> I think we will need to define get_timer() weak - Nios will have to
>>> override the default implementation to cater for it's (Nios') limitations
>> Please don't  - isn't the purpose of this whole discussion to use
>> common code for this ?
>>
> Yes, but Nios is particularly bad - It has a 10ms tick counter :(
Hi All,
  And a hardware timer that you can't read to subdivide the 10 ms. 
Note that this is not necessarily a problem with all NIOS 
implementations. The timer characteristics can be controlled when you 
generate the bitstream for the FPGA. You can make the counter both 
faster and readable if you want. It just uses a bit more silicon. Sad to 
say, it probably will require  per board get_ticks routine. For the 
"old" nios2 timers however, overriding get_timer with a /board routine 
is probably the only way to go.

l
> I don't see reason for hamstring other platforms when a simply weak
> function can get around it
Agree.
Best Regards,
Bill Campbel
> Regards,
>
> Graeme
>
>

___
U-Boot mailing list
U-Boot@lists.denx.de
http://lists.denx.de/mailman/listinfo/u-boot


Re: [U-Boot] [RFC][Timer API] Revised Specification - Implementation details

2011-05-27 Thread J. William Campbell
On 5/27/2011 12:33 AM, Wolfgang Denk wrote:
> Dear "J. William Campbell",
>
> In message<4ddf2072.5090...@comcast.net>  you wrote:
> ...
>> The problem is that the way we previously detected wrapping does not
>> work if the interrupt rate is == to the counter wrap time, which it
>> essentially always is. If get_ticks is trying to update the wrap count
> You ignore the fact that this is only ever a problem when the rollover
> cannot signal through an interrupt or similar.  Also, some processors
> allow daisy-chaning of timers, etc.
>
> Again, I would really like to know about how many exotic systems we
> are talking that fulfil your worst-case expectations.  I bet the
> overwhelming majority behaves absolutely harmless.
Hi Wolfgang,
 I think that in fact the opposite is true. The problems occur if 
both the main program and the interrupt routine are trying to update the 
timer msb using the same code, as we were originally talking about. 
There is no problem if only the interrupt routine detects the rollover. 
That is the correct way to go if your interrupts work. There was nothing 
particularly exotic required. It was the "normal" case. Take a look at 
what would happen on the PPC is the main program was reading the 
decrementer, detecting wraps and increasing the timestamp while the 
interrupt routine was also incrementing the timestamp. Every so often 
you get a double increment. Why were we doing this? Because I was trying 
to re-use exactly the same code in the interrupt case and the 
non-interrupt case. Not a good idea, in fact a bad idea as it turns out.

Best Regards,
Bill Campbell
> Best regards,
>
> Wolfgang Denk
>

___
U-Boot mailing list
U-Boot@lists.denx.de
http://lists.denx.de/mailman/listinfo/u-boot


Re: [U-Boot] [RFC][Timer API] Revised Specification - Implementation details

2011-05-27 Thread J. William Campbell
On 5/27/2011 12:28 AM, Wolfgang Denk wrote:
> Dear "J. William Campbell",
>
> In message<4ddefdbc.7050...@comcast.net>  you wrote:
>> I really STRONGLY disagree with this statement. If you actually needed
>> 64 bit variables, fine use them. But as I have already shown, you do not
>> need them in general.  We are computing a 32 bit result. There is some
>> entropy argument that says you shouldn't need 64 bits to do so. Another
>> way to look at it is that converting the top 32 bit word and the bottom
>> 32 bit word to ms separately can be easier than doing them both together
>> at once.  However, as we will see below, I do agree we need two 32 bit
>> words to make this process go smoothly. I just don't agree that they
>> should/will constitute a 64 bit binary word. See below.
> And I disagree with this.
>
Hi Wolfgang,
 OK, I hear you.
>> Yes, that is the problem. I have come to the view that  two 32 bit words
>> are the best approach. Note that the lsb may actually not fill the full
>> 32 bits. The top 32 bits are the rollover count and the bottom 32 bits
>> are the current counter. If the counter is a full 32 bits, so much the
>> better. Again, one could put them together inside the interrupt routine
>> , but it is easier to check for a changed value if you don't do this.
> It's even easier if you use a single 64 bit variable, because then you
> can simply use ==.
>

 In general, no you can't, or at least you probably don't want to. . 
If you are reading a 64 bit performance counter, it is quite likely that 
you cannot read it twice without the "clock" having "ticked". If the CPU 
executes 1 instruction (or fewer(if an SPR/memory reference  is 
involved?) per performance counter tick, which is the goal of the 
performance counter, == is an infinite loop A similar condition 
exists if you are combining a software counter with a fairly fast 
hardware counter. It might require flipping the hardware counter (if it 
is a down counter) and a 64 bit multiply add, which must be done in 
software/a subroutine if the cpu has no 64 by 64 multiply. By the time 
that is done, the timer LSB may have ticked. Consider the m68K case.
>> Otherwise, you have to check both words. It also makes the isr faster.
> Drop any thoughts about "make FOO faster" for now, please.  Especially
> at this stage it is much more important to have a simple and clean
> design.  If split in two variables, even a simple read access will
> turn into code like
>
>   do {
>   upper  = timebase_upper;
>   lower  = timebase_lower;
>   } while (upper != timebase_upper);
>
> This is not exactly as simple as you claimed.

 True, but if you look at a lot of 64 bit performance counters, that 
is EXACTLY what the handbook book recommends on how to read them. There 
is no atomic way to read them both at once, and reading one half doesn't 
freeze the other half. This code is also required if timebase_upper is 
altered in the interrupt routine.  YMMV, but in a lot, dare I say most, 
cases this is required anyway.  And while the code is more complex than 
a simple assignment statement, it is not very complex.

Best Regards,
Bill Campbell

> Best regards,
>
> Wolfgang Denk
>

___
U-Boot mailing list
U-Boot@lists.denx.de
http://lists.denx.de/mailman/listinfo/u-boot


Re: [U-Boot] [RFC][Timer API] Revised Specification - Implementation details

2011-05-26 Thread J. William Campbell
On 5/26/2011 9:33 PM, Graeme Russ wrote:
> Hi Bill,
>

> get_ticks() does not care about the clock rate - It simply looks at the
> current value of the hardware tick counter and the value of the hardware
> tick counter the last time get_ticks() was called, calculates the difference
> and adds that to the 64-bit software tick-counter
> I don't see how it being a down counter makes that any more difficult
>
>> This is neither simple nor maintainable. Further, it is un-necessary, as the
>> sync_timer routine is just going to convert the number from whatever radix
>> we converted it to into millisec.  If we leave the two numbers as split, all
>> that complexity is removed from get_ticks and sent upward to the common
>> routine that converts the answer into ms anyway. This makes the system more
>> maintainable by placing the minimum requirement on get_ticks. The "tick"
>> should be opaque to anybody but sync_timebase anyway.
> But how is the common routine going to know if it is a split timer, up
> timer, down timer, little endian, big endian, etc, etc.
>
> get_ticks() abstracts the hardware implementation of the hardware timer
> from sync_timer()
Hi All,
 I understand your point. I prefer a higher level of abstraction. 
You are correct that there are some aspects of the tick counter that are 
very hardware quirky, and these attributes are hard to abstract. If the 
timer is embedded into a bit field with other variables, it is 
reasonable to expect get_ticks to extract the bit field and right 
justify the number. If there are endian issues, the get_ticks routine 
must return the number in the "natural" endianness of the platform. 
However, after that point, the values are extremely "regular". The fact 
that a counter is a down counter can be expressed in a data structure as 
a boolean. The high limit of the hardware counter is a number. The 
number of ticks per millsecond is obtainable from usec2ticks(1000), or 
1 if we want to avoid some roundoff. From these values, sync_timer 
can take the two part ticks value and convert it to millisec. Trust me 
on this. I have the routines to do it. This puts as much of the 
abstraction of the two numbers into ONE COMMON ROUTINE, sync_timer. Now 
it is clearly possible to move some of the abstraction down a level into 
sync_timer. For instance you could move inverting the counter down to 
that level, and then multiply the msb by the maximum value of the lsb 
counter and add in the msb. It is clearly possible to move ALL of 
sync_timer down into get_ticks, if one wanted to. It is clearly possible 
to replace general values in gd-> with platform specific constant 
values. However, if you do that, you end up with a lot of duplicate, or 
almost duplicate, code running around. That has proven to be error 
prone, and it has left new ports of u-boot to sort of fend for 
themselves in figuring out how things should work. I prefer to abstract 
that all up into sync_timer. That way, all the math is in one place, and 
is table driven so it is easy to change.
 The top 32 bits are the rollover count and the bottom 32 bits are the
 current counter. If the counter is a full 32 bits, so much the better.
>>> Ah - Lets keep it that way
>>>
 Again, one could put them together inside the interrupt routine , but it
 is
 easier to check for a changed value if you don't do this. Otherwise, you
 have to check both words. It also makes the isr faster. It is just an
>>> As I said before - Simple First, Fast Later
>> I am in favor of simple. That is why I want get_ticks to do as little as
>> possible. It should just read the hardware register and the overflow counter
>> if it is separate. Make sure the overflow didn't change while we were
>> reading. This is redundant if we are not using interrupts but we can leave
>> the code in. It just won't do anything.  We can also move the rollover
>> detection to sync_timebase. It will be redundant if we are using interrupts,
>> because time will never "back up". But we can do it this way. This
>> centralizes the overflow detection, which is a good thing.
> That does not sound simple to me. This, however, does:
>
> u64 get_ticks(void)
> {
>   static u64 last_hw_tick_count;
>   static u64 last_sw_tick_count;
>
>   /* Now for the platform specific stuff - Read hardware tick counter */
>   u64 current_hw_tick_count = /* Read hw registers etc */
>
>   /* Deal with hardware weirdness - errata, stupid hw engineers etc */
>
>   u64 elapsed_ticks = current_hw_tick_count - last_hw_tick_count;
>   last_sw_tick_count += elapsed_ticks;
>
>   return last_sw_tick_count;
> }
>
> The '/* Read hw registers etc */' bit will always be the same, no matter
> what way you do it
Agree.
> The '/* Deal with hardware weirdness - errata, stupid hw engineers etc */'
> bit is where we are truly abstracting the hardware away - This is the
> bit you propose to leave mangled and deal with in sync_time?
Not totally. The get_

Re: [U-Boot] [RFC][Timer API] Revised Specification - Implementation details

2011-05-26 Thread J. William Campbell
On 5/26/2011 6:51 PM, Graeme Russ wrote:
> Hi Bill,
>
> On Fri, May 27, 2011 at 11:26 AM, J. William Campbell
>   wrote:
>> On 5/26/2011 4:28 PM, Graeme Russ wrote:
>>> Why mess around with bit shifting (which you would then have to cludge
>>> into
>>> your platform code) when carting around a 64-bit value is relatively
>>> cheap,
>>> transparent and poratble (all all supported up-to-date tool chains)
>>>
>> I really STRONGLY disagree with this statement. If you actually needed 64
>> bit variables, fine use them. But as I have already shown, you do not need
>> them in general.  We are computing a 32 bit result. There is some entropy
>> argument that says you shouldn't need 64 bits to do so. Another way to look
>> at it is that converting the top 32 bit word and the bottom 32 bit word to
>> ms separately can be easier than doing them both together at once.  However,
>> as we will see below, I do agree we need two 32 bit words to make this
>> process go smoothly. I just don't agree that they should/will constitute a
>> 64 bit binary word. See below.
>>>>>   - void timer_isr()
>>>>>  - Optional (particularly if tick counter rollover period is
>>>>>exceptionally log which is usually the case for native 64-bit tick
>>>>>counters)
>>>>>  - Simply calls sync_timebase()
>>>>>  - For platforms without any tick counter, this can implement one
>>>>>(but accuracy will be harmed due to usage of disable_interrupts()
>>>>> and
>>>>>enable_interrupts() in U-Boot
>>>>>
>>>>> So to get the new API up and running, only two functions are mandatory:
>>>>>
>>>>> get_ticks() which reads the hardware tick counter and deals with any
>>>>> 'funny
>>>>> stuff' including rollovers, short timers (12-bit for example), composite
>>>>> counters (16-bit 0-999 microsecond + 16-bit millisecond) and maintains a
>>>>> 'clean' 64-bit tick counter which rolls over from all 1's to all 0's.
>>>>> The
>>>> I think it is the task of get_ticks to return the hardware tick counter
>>>> as
>>>> an  increasing counter, period.  The counter may wrap at some final count
>>>> that is not all ones. That is ok. Sync_timebase deals with the rollovers
>>>> if
>>> The hardware tick counter may, the 64-bit software tick counter maintained
>>> by get_ticks() may not
>>>> necessary. get_ticks is very lightweight. get_ticks should deal with
>>>> decrementing counters by returning the complement of the counter.  The
>>>> sc520
>>>> case is a bit more complex if you intend to use the 0-999 and 16 bit
>>>> millisec registers, in that you do need to add them to the previous value
>>>> to
>>> As I mentioned in another post, this is a problem for the platform
>>> maintainer and is abstracted away throught the platform specific
>>> implementation of get_ticks()
>>>
>>>> make an increasing counter. Sync_timebase "likes" short counters in that
>>>> they are easy to convert to millisec and tick remainders.
>>> The compiler should handle using 64-bit rather than 32-bit transparently
>> True enough.  But you don't need 64 bit variables at this point two 32 bit
>> ones work just fine, in fact better in most cases.
> Remember, we are not dealing with a high performance OS here. The primary
> goal is portability - Performance optimisations (which do not break
> portability) can be performed later
>
>>>>> 64-bit tick counter does not need to be reset to zero ever (even on
>>>>> startup
>>>>> - sync_timebase tacks care of all the details)
>>>> True, but sync_timebase does have to be initialized (as does the timer
>>>> itself in most cases, so this is not a restriction).
>>> This can be done in timer_init() via a call to sync_timebase() after the
>>> timer has been configured. This should bring everything into line
>>>
>>>>> ticks_per_millisecond() simply return the number of ticks in a
>>>>> millisecond
>>>>> - This may as simple as:
>>>>>
>>>>> inline u64 ticks_per_millisecond(void)
>>>>> {
>>>>> return CONFIG_SYS_TICK_PER_MS;
>>>>> }
>>>>>
>>>>> But it may be trickier if you have a programmable tick frequ

Re: [U-Boot] [RFC][Timer API] Revised Specification - Implementation details

2011-05-26 Thread J. William Campbell
On 5/26/2011 4:28 PM, Graeme Russ wrote:
> Hi Bill,
>
> On Fri, May 27, 2011 at 2:56 AM, J. William Campbell
>   wrote:
>> On 5/26/2011 6:27 AM, Graeme Russ wrote:
>>> Hello Everyone,
>>>
>>> OK - Starting a new thread to discuss implementation details. This is a
>>> heads-up for arch/platform maintainers - Once this is a bit more stable, I
>>> will put it on the wiki
>>>
>>> Assumed Capabilities of the Platform
>>>   - Has a 'tick counter' that does not rely on software to increment
>> Hi All,
>>The nios2 with the most basic timer does not meet this requirement. It
>> will not count at all without the 10 ms interrupt. I don't think this
>> requirement matters anyway. We need a 'tick counter' that 'ticks'. If it
>> takes software to make it tick, we don't much care. There may be problems
>> with early use of udelay in that case, but that is a different issue.
> I think we will need to define get_timer() weak - Nios will have to
> override the default implementation to cater for it's (Nios') limitations
Hi All,
 Yes, that will probably be required here.
>>>   - tick interval may by a fixed constant which cannot be controlled
>>> via software, or it could be programmable (PIT)
>>>
>>> API Functions (/lib/timer.c)
>>>   - u32 get_timer(u32 start)
>>>  - Returns the number of elapsed milliseconds since 'start'
>>>
>>> API Functions (/arch/...)
>>>   - void udelay(u32 delay)
>>>  - Used for 'short' delays (generally up to several seconds)
>>>  - Can use the tick counter if it is fast enough
>>>  - MUST NOT RESET THE TICK COUNTER
>> There is a requirement that udelay be available before relocation and before
>> the BSS is available. One can use the tick counter to provide udelay as long
>> as sync_timebase is not called OR sync timebase does not use BSS. It appears
>> many implementations ignore this requirement at present. We should try to
>> fix this, but is should not be a requirement.
> If you really wanted to, sync_timebase() could use global data (it doesn't
> have many static variables) in which case all timer functions would be
> available before relocation
Yes, my implementation of the sync_timebase routine was written that 
way, using gd-> for the required variables.
>>> 'Helper' Functions (/lib/timer.c)
>> I think this function should be weak, so that it is possible for people to
>> override it with a "custom" function. The fully general sync_timebase has
>> lots of code in it that can be simplified in special cases. We want and need
>> a fully general function to be available, but other users who are real tight
>> on space may want a cut down version. We should make that easily possible.
> Agree
>
>>>   - void sync_timebase(void)
>>>  - Updates the millisecond timer
>>>  - Utilises HAL functions to access the platform's tick counter
>>>  - Must be called more often than the rollover period of the
>>>platform's tick counter
>>>  - Does not need to be called with a regular frequency (immune
>>>to interrupt skew)
>>>  - Is always called by get_timer()
>>>  - For platforms with short tick counter rollovers it should
>>>be called by an ISR
>>>  - Bill Campbell wrote a good example which proved this can be common
>>>and arbitrary (and optionally free of divides and capable of
>>>maintaining accuracy even if the tick frequency is not an even
>>>division of 1ms)
>>>
>>> HAL Functions (/arch/... or /board/...)
>>>   - u64 get_ticks(void)
>> For what it's worth, I would like to propose using a (gasp!) typedef here.
>> It seems to me there are a whole lot of cases where the max number of ticks
>> is a u32 or less. In those cases, the wrap at 32 bits helps things a lot. If
>> the tick counter is really 64 bits, the function of sync_timebase  is simply
>> to convert the tick value  to millisec, and that is it. Otherwise, if it is
>> 32 bits or less then some other actions will be required. I think this is
>> one of those times where a typedef would help, We could define a type called
>> timer_tick_t to describe this quantity. That would allow a pure 32 bit
>> implementation where appropriate.
>>
>> Another suggestion is that perhaps we want a u32 get_ticks_lsb(void) as well
>> as a regular get_ticks. The lsb version would be used for udelay and could
>&

Re: [U-Boot] [RFC][Timer API] Revised Specification - Implementation details

2011-05-26 Thread J. William Campbell
On 5/26/2011 1:27 PM, Wolfgang Denk wrote:
> Dear "J. William Campbell",
>
> In message<4ddeafe0.8060...@comcast.net>  you wrote:
>> I certainly agree using 64 bits for all calculations is vast overkill.
>> In fact, I think using 64 bit calculations on systems that have only a
>> 32 bit or less timer register is probably overkill. :-) However, to
>> date,AFAIK,  no processor has exceeded the u32 in ticks per second. As I
> Not yet. But it makes no sense to start a new design with settings
> already in critical range, especially since there is zero problem
> with breaking it down by a factor of 1000 or 1e6.
>
>> pointed out in a previous e-mail, if they ever do this, we can just drop
>> one or 2 bits off the 64 bit counter and in millisecond resolution,
>> nobody will ever know.  Also as previously pointed out, usec2ticks is
> No. I will not accept a design that is so close on the edge of
> breaking.
>
> What is your exact problem with the existing interfaces ticks2usec()
> and usec2ticks() ?
>
>> not present yet in lots of implementations. Also, if the fundamental
>> clock frequency is 32 kHz  (or anything less than 1 MHz), usec2ticks is
>> 0! This probably rules out using it to get ticks per millisec or ticks
>> per sec.
> The statement "usec2ticks is 0" makes absolutely no sense as long as
> you don't say which argument you pass in.  You get a return value of
> 0 even for a tick rate in the GHz range if you pass 0 as argument.
>
> Hoewver, usec2ticks(1000) or maybe usec2ticks(10) will probably
> return pretty useful values.
>
> [Note that by passing properly scaled arguments you can also avoid a
> number of rounding errors.]
Hi Wolfgang,
   Yes, you are correct. I was thinking usec2ticks(1), which is 
certainly not the way to do it. I am happy with usec2ticks and 
ticks2usec. That works for me. Sorry for the noise.

How about the first part of my response? Are you still thinking about it 
or is it just too bad for words :-) ?

Best Regards,
Bill Campbell

>
> Best regards,
>
> Wolfgang Denk
>

___
U-Boot mailing list
U-Boot@lists.denx.de
http://lists.denx.de/mailman/listinfo/u-boot


Re: [U-Boot] [RFC][Timer API] Revised Specification - Implementation details

2011-05-26 Thread J. William Campbell
On 5/26/2011 12:16 PM, Wolfgang Denk wrote:
> Dear "J. William Campbell",
>
> In message<4ddea165.9010...@comcast.net>  you wrote:
>>>> I think it is the task of get_ticks to return the hardware tick counter
>>>> as an  increasing counter, period.  The counter may wrap at some final
>>>> count that is not all ones. That is ok. Sync_timebase deals with the
>>> NO!  We want to be able to compute time differences using simple
>>> unsigned arithmentics, even after a rollover of the counter.  For this
>>> it is mandatory that the counter always gets only incremented until it
>>> wraps around at te end of it's number range, and never gets reset
>>I agree that that is what must happen, but it should happen inside
>> sync_timebase. Sync_timebase does what is needed to convert the
>> less-than-fully capable counters into a fully capable one. You could
> I think you also want this behaviour for get_ticks().
Hi Wolfgang,
 I understand why that might be nice. But to do that with common 
code would require get_ticks to call a generic routine (say sync_ticks) 
that would expand the counter to 64 bits. Note that this is not always 
totally trivial, as the timer may roll over at 10 ms or some other 
not-so-nice number. Then sync_timer would convert the 64 bit number to 
milliseconds. That approach will work. However, I think that is 
overkill, as we really want the result in milliseconds. If you look at 
the prototype sync_timer routine, you can see an example of how this is 
possible without resorting to 64 bit math. I think that avoiding the 64 
bit math on processors that don't have a 64 bit tick counter (and are 
therefore probably less capable) is worthwhile. I also think that the 
purpose of the get_time routine abstracting the time into milliseconds 
is to avoid dealing with ticks anywhere except in the timer routines. 
Presumably, nobody but sync_timer would ever have reason to call 
get_ticks. If that is not your thinking,  fine, we just disagree on that 
point, and having a sync_ticks to expand the tick counter certainly can 
be done.
>> To date, it has been shown conclusively that this process cannot be
>> relied upon, or we wouldn't be having this discussion.  If we put that
>> functionality inside sync_timebase, it is in one place and debuggable
>> once. All sync_timebase requires to do this is ticks per second and
>> maximum tick value. I do request that counters that decrement be negated
>> in the get_ticks routine, but beyond that, it should be a simple read of
>> the tick register(s).
> I think using ticks per second is not a good idea. It may exceed
> ULONG_MAX, and having to use 64 bit for all calculations is probably
> overkill.  The existing ticks2usec/usec2ticks work fine so far.
I certainly agree using 64 bits for all calculations is vast overkill. 
In fact, I think using 64 bit calculations on systems that have only a 
32 bit or less timer register is probably overkill. :-) However, to 
date,AFAIK,  no processor has exceeded the u32 in ticks per second. As I 
pointed out in a previous e-mail, if they ever do this, we can just drop 
one or 2 bits off the 64 bit counter and in millisecond resolution, 
nobody will ever know.  Also as previously pointed out, usec2ticks is 
not present yet in lots of implementations. Also, if the fundamental 
clock frequency is 32 kHz  (or anything less than 1 MHz), usec2ticks is 
0! This probably rules out using it to get ticks per millisec or ticks 
per sec.

Best Regards,
Bill Campbell

> Best regards,
>
> Wolfgang Denk
>

___
U-Boot mailing list
U-Boot@lists.denx.de
http://lists.denx.de/mailman/listinfo/u-boot


Re: [U-Boot] [RFC][Timer API] Revised Specification - Implementation details

2011-05-26 Thread J. William Campbell
On 5/26/2011 10:53 AM, Wolfgang Denk wrote:
> Dear "J. William Campbell",
>
> In message<4dde8639.3090...@comcast.net>  you wrote:
>
>> I think it is the task of get_ticks to return the hardware tick counter
>> as an  increasing counter, period.  The counter may wrap at some final
>> count that is not all ones. That is ok. Sync_timebase deals with the
> NO!  We want to be able to compute time differences using simple
> unsigned arithmentics, even after a rollover of the counter.  For this
> it is mandatory that the counter always gets only incremented until it
> wraps around at te end of it's number range, and never gets reset
Hi All,
  I agree that that is what must happen, but it should happen inside 
sync_timebase. Sync_timebase does what is needed to convert the 
less-than-fully capable counters into a fully capable one. You could 
move that functionality down into get_ticks, but if you do, you end up 
much like the present scheme where the multiple different get_ticks 
routines expected to cope with expanding the hardware counter properly.  
To date, it has been shown conclusively that this process cannot be 
relied upon, or we wouldn't be having this discussion.  If we put that 
functionality inside sync_timebase, it is in one place and debuggable 
once. All sync_timebase requires to do this is ticks per second and 
maximum tick value. I do request that counters that decrement be negated 
in the get_ticks routine, but beyond that, it should be a simple read of 
the tick register(s).

Best Regards,
Bill Campbell

> before.
>
>> You will have to call the routine that initializes sync_timebase. This
>> routine should have a name, like void init_sync_timebase(void)?
> init_timebase().
>
>
> Best regards,
>
> Wolfgang Denk
>

___
U-Boot mailing list
U-Boot@lists.denx.de
http://lists.denx.de/mailman/listinfo/u-boot


Re: [U-Boot] [RFC][Timer API] Revised Specification - Implementation details

2011-05-26 Thread J. William Campbell
On 5/26/2011 6:27 AM, Graeme Russ wrote:
> Hello Everyone,
>
> OK - Starting a new thread to discuss implementation details. This is a
> heads-up for arch/platform maintainers - Once this is a bit more stable, I
> will put it on the wiki
>
> Assumed Capabilities of the Platform
>   - Has a 'tick counter' that does not rely on software to increment
Hi All,
The nios2 with the most basic timer does not meet this 
requirement. It will not count at all without the 10 ms interrupt. I 
don't think this requirement matters anyway. We need a 'tick counter' 
that 'ticks'. If it takes software to make it tick, we don't much care. 
There may be problems with early use of udelay in that case, but that is 
a different issue.
>   - tick interval may by a fixed constant which cannot be controlled
> via software, or it could be programmable (PIT)
>
> API Functions (/lib/timer.c)
>   - u32 get_timer(u32 start)
>  - Returns the number of elapsed milliseconds since 'start'
>
> API Functions (/arch/...)
>   - void udelay(u32 delay)
>  - Used for 'short' delays (generally up to several seconds)
>  - Can use the tick counter if it is fast enough
>  - MUST NOT RESET THE TICK COUNTER
There is a requirement that udelay be available before relocation and 
before the BSS is available. One can use the tick counter to provide 
udelay as long as sync_timebase is not called OR sync timebase does not 
use BSS. It appears many implementations ignore this requirement at 
present. We should try to fix this, but is should not be a requirement.
> 'Helper' Functions (/lib/timer.c)
I think this function should be weak, so that it is possible for people 
to override it with a "custom" function. The fully general sync_timebase 
has lots of code in it that can be simplified in special cases. We want 
and need a fully general function to be available, but other users who 
are real tight on space may want a cut down version. We should make that 
easily possible.
>   - void sync_timebase(void)
>  - Updates the millisecond timer
>  - Utilises HAL functions to access the platform's tick counter
>  - Must be called more often than the rollover period of the
>platform's tick counter
>  - Does not need to be called with a regular frequency (immune
>to interrupt skew)
>  - Is always called by get_timer()
>  - For platforms with short tick counter rollovers it should
>be called by an ISR
>  - Bill Campbell wrote a good example which proved this can be common
>and arbitrary (and optionally free of divides and capable of
>maintaining accuracy even if the tick frequency is not an even
>division of 1ms)
>
> HAL Functions (/arch/... or /board/...)
>   - u64 get_ticks(void)
For what it's worth, I would like to propose using a (gasp!) typedef 
here. It seems to me there are a whole lot of cases where the max number 
of ticks is a u32 or less. In those cases, the wrap at 32 bits helps 
things a lot. If the tick counter is really 64 bits, the function of 
sync_timebase  is simply to convert the tick value  to millisec, and 
that is it. Otherwise, if it is 32 bits or less then some other actions 
will be required. I think this is one of those times where a typedef 
would help, We could define a type called timer_tick_t to describe this 
quantity. That would allow a pure 32 bit implementation where appropriate.

Another suggestion is that perhaps we want a u32 get_ticks_lsb(void) as 
well as a regular get_ticks. The lsb version would be used for udelay 
and could possibly come from another timer if that was 
necessary/desirable. See the requirement for early udelay early 
availability.
>  - Returns a tick count as an unsigned 64-bit integer
>  - Abstracts the implementation of the platform tick counter
>(platform may by 32-bit 3MHz counter, might be a 16-bit
>0-999 microsecond plus 16-bit 0-65535 millisecond etc)
>   - u64 ticks_per_millisecond()
>  - Returns the number of ticks (as returned by get_ticks()) per
>millisecond
I think ticks_per_sec would be a better choice. First, such a function 
already exists in almost all u-boots. Second, if one wants the best 
accuracy for things like udelay, you need better accuracy than  
millisec. Since this function is used only infrequently, when things are 
initialized, a divide to get ticks_per_millsec (if that is what you 
really want) is no big deal. Lastly, I think this function can remain 
u32. Yes, there is a 4 GHz limit on the clock rate. If this ever becomes 
an issue, we can change the type to timer_tick_t. When the CPU clock 
rate gets quite high, it is an advantage to divide it down for 
performance measurement anyway. The AMD/Intel chips already do this. If 
the hardware doesn't do it, shift the timer value right two bits. I 
doubt you will miss the 0.4 nanoseconds resolution loss from your 10 GHz 
timestamp.
>   - void timer_isr()
>  - Optional (particularly if tick counter

Re: [U-Boot] [RFC] Review of U-Boot timer API

2011-05-25 Thread J. William Campbell
On 5/25/2011 9:40 PM, Graeme Russ wrote:
> On Thu, May 26, 2011 at 2:19 PM, Reinhard Meyer
>   wrote:
>> Dear Graeme Russ,
>>> On closer inspection, some do, some don't. All ARMv7 (OMAP, S5P, Tegra2)
>>> do. at91 is odd - It looks like it uses interrupts, but get_timer() and
>>> udelay() both end up calling get_timer_raw() (with udelay only having
>>> millisecond resolution it seems). Some others can be configured to
>>> increment the timer using an interrupt. ARM is, quite frankly, a complete
>>> mess - It has a mass of *_timer_masked() functions which the core timer
>>> functions are 'wafer thin' wrapper around, udelay() silently resets
>>> the timebase trashing get_timer() loops etc.
>> Please look at current master for at91.
>>
>> http://git.denx.de/?p=u-boot.git;a=blob;f=arch/arm/cpu/arm926ejs/at91/timer.c;h=a0876879d3907af553d832bea187a062a22b9bd4;hb=5d1ee00b1fe1180503f6dfc10e87a6c6e74778f3
>>
>> AT91 uses a 32 bit hardware register that by means of a prescaler is made
>> to increment at a rate in the low megahertz range.
> Yes, I see that now
>
>> This results in a wrap approximately every 1000 seconds.
>> Actually this would be sufficient for all known uses of udelay() and 
>> get_timer()
>> timeout loops. However, this hardware register is extended to 64 bits by 
>> software
>> every time it is read (by detecting rollovers).
> Which makes it 100% compatible with my proposed solution - The software
> prescaler will trigger the 64-bit extension and rollover detection
>
>> Since a wrap of that 64 bit "tick" would occur after the earth has ended,
>> it is simple to obtain milliseconds from it by doing a 64 bit division.
> Which would be done in the common prescaler in /lib/
>
> Currently, most ARM specific utilisations of get_timer() enforce a reset
> of the tick counter by calling reset_timer() - Subsequent calls to
> get_timer() then assume a start time of zero. Provided the internal timer
> rolls over currectly, the initial call of get_timer(0) will reset the ms
> timer and remove and 'glitch' present due to not calling the 'extender'
> function between 32-bit rollovers which makes the reset_timer() call
> unneccessary - I believe at91 behaves correctly in this regard.
>
> In any case, the underlying assumption made by the ARM timer interface
> (call reset_timer() first always) is inherently broken as not all users
> of the timer API do this - They assume a sane behaviour of:
>
>   start = get_timer(0);
>   elapsed_time = get_timer(start);
>
> Add to this udelay() resetting the timer make the following very broken:
>
>   start = get_timer(0);
>   while(condition) {
>   udelay(delay);
>   }
>   elapsed_time = get_timer(start);
>
> NOTE: In this case, if udelay() also calls the prescaler then no interrupt
> triggered every 1000s would be required in the above example to get
> correct elapsed_time even if the loop ran for several hours (provided
> udelay() is called at least every 1000s
>
> However, to allow timing of independent events with no intervening
> udelay() or get_timer() calls, an 1000s interrupt to kick the prescaler is
> all that is needed to make this particular implementation behave correctly.
Hi All,
   True, if the processor supports timer interrupts. The problem is 
that the existing u-boots in many cases do not. I think that is really 
the crux of the problem. So what are we going to do? I am open to ideas 
here.

Best Regards,
Bill Campbell

> Of course disabling interruts and not calling get_timer() or udelay() will
> break the timer - But there is nothing that can be done about that)
>
> Regards,
>
> Graeme
>
>

___
U-Boot mailing list
U-Boot@lists.denx.de
http://lists.denx.de/mailman/listinfo/u-boot


Re: [U-Boot] [RFC] Review of U-Boot timer API

2011-05-25 Thread J. William Campbell
On 5/25/2011 9:19 PM, Reinhard Meyer wrote:
> Dear Graeme Russ,
>> On closer inspection, some do, some don't. All ARMv7 (OMAP, S5P, Tegra2)
>> do. at91 is odd - It looks like it uses interrupts, but get_timer() and
>> udelay() both end up calling get_timer_raw() (with udelay only having
>> millisecond resolution it seems). Some others can be configured to
>> increment the timer using an interrupt. ARM is, quite frankly, a 
>> complete
>> mess - It has a mass of *_timer_masked() functions which the core timer
>> functions are 'wafer thin' wrapper around, udelay() silently resets
>> the timebase trashing get_timer() loops etc.
>
> Please look at current master for at91.
>
> http://git.denx.de/?p=u-boot.git;a=blob;f=arch/arm/cpu/arm926ejs/at91/timer.c;h=a0876879d3907af553d832bea187a062a22b9bd4;hb=5d1ee00b1fe1180503f6dfc10e87a6c6e74778f3
>  
>
>
> AT91 uses a 32 bit hardware register that by means of a prescaler is made
> to increment at a rate in the low megahertz range.
>
> This results in a wrap approximately every 1000 seconds.
>
> Actually this would be sufficient for all known uses of udelay() and 
> get_timer()
Hi All
   Yes, you are correct. It would be sufficient for all known uses 
of udelay and get_timer(). However, Wolfgang has indicated a strong 
desire that the performance of the new API work properly over the full 
32 bit range of the millisecond delay time. That has been the basic 
issue for some time here.
> timeout loops. However, this hardware register is extended to 64 bits 
> by software
> every time it is read (by detecting rollovers).
Yes, but this extension ONLY happens if get_ticks is called via udelay 
or get_timer. It doesn't happen if you are sitting at the command prompt 
or off executing a downloaded stand alone program. You might ask "who 
cares", and I would answer that Wolfgang cares, at least to some level. 
If the timer overflow triggered an interrupt, we could call get_ticks to 
update the extended time inside the interrupt routine. But, as far as I 
know, it doesn't. There are some other ARM processors that have a 32 bit 
clock derived from a 32 kHz crystal, The will work much as you example 
does up to 134217 seconds, in fact much longer than your AT91 example 
does. However, that doesn't touch the 4294967 seconds that the PPC can 
manage. Without interrupts, the 32 bit (or smaller) counters will NEVER 
get to the PPC standard if their tick rate exceeds 1 msec. It may be 
that we need a lower standard, perhaps saying 1000 seconds is enough. 
But that is not my decision to make.
>
> Since a wrap of that 64 bit "tick" would occur after the earth has ended,
> it is simple to obtain milliseconds from it by doing a 64 bit division.
True, but moot because of the above.
Best Regards,
Bill Campbell
>
> Best Regards,
> Reinhard
>
>

___
U-Boot mailing list
U-Boot@lists.denx.de
http://lists.denx.de/mailman/listinfo/u-boot


Re: [U-Boot] [RFC] Review of U-Boot timer API

2011-05-25 Thread J. William Campbell
On 5/25/2011 4:13 PM, Graeme Russ wrote:
> Hi Wolfgang
>
> On Thu, May 26, 2011 at 7:16 AM, Wolfgang Denk  wrote:
>> Dear Graeme Russ,
>>
>> In message<4ddd7066.4000...@gmail.com>  you wrote:
 No, not at all. And I already answered this. For example on PPC, just
 reading the timebase would be perfectly sufficient, and simpler and
 more reliable than the current interrupt based approach.
>>> I assume by 'timebase' you mean the 64-bit tick counter. If so, that is
>> By timebase I mean the timebase register, implemented as two 32 bit
>> registers tbu and tbl, holding the upper and the lower 32 bits of the
>> free-running 64 bit counter, respective.
> And remember, not all platforms have this implementation. The AMD sc520
> for example has a microsecond register which counts 0-999 that ticks a
> 16-bit millisecond register and resets to zero. And the millisecond
> register latches the value of the microsecond register and resets
> (the millisecond register) back to zero.
>
> The thing is, this can all be abstracted away via get_tick() which
> (provided it is called every 65 seconds or so) can maintain a software
> version of the timebase register. So, every 65 seconds, the prescaler
> needs to be kicked. Now, if all we want to use get_timer() for is to
> monitor a timeout (which I think might be every single use in U-Boot
> to date) then the while (get_timer(start)<  timeout) loop will work. If
> get_timer() is needed to measure time between two arbitrary events (which
> I 100% agree it should be able to do) then the prescaler will need to be
> kicked (typically by an interrupt)
>
>>> _exactly_ what I am suggesting we do (and what does already happen on ARM).
>> I don't think so.
Hi All,
   Just to be clear, while ARMv7 has a 64 bit performance counter, 
it is not presently used by get_time. This is a change we want to make 
correct?
> On closer inspection, some do, some don't. All ARMv7 (OMAP, S5P, Tegra2)
> do. at91 is odd - It looks like it uses interrupts, but get_timer() and
> udelay() both end up calling get_timer_raw() (with udelay only having
> millisecond resolution it seems).
I am not sure why you say at91 appears to use interrupts. There is a 
comment in arch/arm/cpu/arm930t/at91/timer.c that says "timer without 
interrupts" (line 73). There is the same comment in 
arch/arm/cpu/arm930t/at91rm9200/timer.c Nothing in either routine refers 
to interrupts, so I would say the timer doesn't use them. I could be 
wrong of course.
>   Some others can be configured to
> increment the timer using an interrupt. ARM is, quite frankly, a complete
> mess - It has a mass of *_timer_masked() functions which the core timer
> functions are 'wafer thin' wrapper around, udelay() silently resets
> the timebase trashing get_timer() loops etc.
I sure agree with this last part. The only arm timer I found that 
clearly thought it could use interrupts was in arch/arm/cpu/ixp, and 
that was conditional, not mandatory.
> So let's wind back and distill the approach I am suggesting:
>
>   1) A common prescaler function in /lib/ - It's purpose is to maintain
>  a 1ms resolution timer (if the platform cannot otherwise do so)[1]
>  The prescaler utilises a platform provided get_ticks()[2]
>   2) A get_ticks() function provided by the platform - This function must
>  return an unsigned counter which wraps from all 1's to all 0's - It
>  DOES NOT have to be initialised to zero at system start. get_ticks()
>  hides the low-level tick counter implementation - The sc520 example
>  above is a classic example, so is your PPC tbu/tbl example.
>   3) [Optional]An ISR which calls the prescaler[3]
>
> Now there is an optimisation if your tick counter has a 1ms resolution
> and is not small (i.e. 64-bits) - The prescaler is defined weak, so in
> the platform code, re-implement the prescaler to simply copy the tick
> counter to the timer variable.
>
> And what are the specific implementation types (in decending order of
> preference)? I think:
>   1) A 64-bit micro-second tick counter[5]
>- No interrupts needed
>- Can be used by udelay() and get_timer() trivially
>   2) A 64-bit sub-micro-second tick counter
>- Interrupts most likely undeeded unless the tick frequency is
>  insanely high
>- Can be used by udelay() and get_timer() trivially
>   3) A 64-bit milli-second tick counter
>- No interrupts needed
>- No prescaler needed
>- Can be used by get_timer() trivially
>- udelay() needs another tick source (if available) or be reduced
>  to millisecond resolution
>   4) A 32-bit milli-second tick counter
>- No prescaler needed[6]
>- Max 'glitch free' duration is ~50 days
>- ISR needed to kick prescaler if events longer than 50 days need
>  to be timed
>- Can be used by get_timer() trivially
>- udelay() needs another tick source (if available) or be reduced
>  to millisecond 

Re: [U-Boot] [RFC] Review of U-Boot timer API

2011-05-25 Thread J. William Campbell
On 5/25/2011 12:46 PM, Wolfgang Denk wrote:
> Dear Graeme Russ,
>
> In message  you wrote:
>> I hope to get an implementation agreed upon that does not require
>> interrupts at all, provided a tick counter with sufficiently long roll
>> over time is available (longer than the maximum expected period
>> between 'related' get_timer() calls, for example calls to get_timer()
>> in a timeout testing while loop). This 'bare minimum' implementation
>> can be optionally augmented with an ISR which kicks the timer
>> calculation in the background (typically by just calling get_timer())
>>
>> It really is quite simple in the end.
> The length of this thread shows that it is not as simple as you want
> to make us believe.
>
>
>
> If all you have in mind are timeouts, then we don't need get_timer()
> at all.  We can implement all types of timeout where we wait for some
> eent to happen using udelay() alone.
>
> Need a 10 second timeout? Here we go:
>
>   int cnt = 0;
>   int limit = 10 * 1000;
>   while (!condition) {
>   usleep(1000); /* wait 1 millisec */
>   if (++cnt>  limit)
>   break;
>   }
>   if (cnt>  limit)
>   error("timeout");
>
> get_timer() comes into play only if we want to calculate a time
> difference, for example if we want to run some code where we don't
> know how long it runs, and later come back and check if a certain
> amount of time has passed.
>
> When we don't know how long this code runs, we also cannot know (and
> espeically not in advance) wether or not this time will be longer or
> not than the rollover time of the tick counter.
>
>
> Your plan to require that get_timer() gets called "often enough" to
> prevent or detect tick counter overflows is putting things on their
> head.  It should be the opposite:  The implementation of get_timer()
> should be such that it becomes independent from such low level
> details.
>
HI all,
   We also cannot know if this code disables interrupts. If it does, 
the existing PPC design is broken. If interrupts are disabled during the 
entire code being executed, the elapsed run time will be 0. You can say 
that disabling interrupts in the code is not allowed. Fine, then that 
becomes a constraint on the code. Calling get_timer explicitly often 
enough is also a constraint on the code. One constraint is acceptable to 
you, the other is not. Ok, that's fine too. If the interrupt routine 
calls get_timer, then get_timer is called "often enough" and everything 
works the same as it presently does on PPC, It is not different than 
requiring the interrupt routine to be executed "often enough". The two 
methods are functionally identical. The only problems arise on systems 
that don't support interrupts and don't have any timers with enough bits 
available to meet the 4294967 seconds maximum interval requirement. 
Those systems will be "broken" no matter what we do, as we have all agreed.

 Right now, almost all ARM cases are broken, because they have short 
timers and don't use interrupts. In some cases, there are actual bugs 
involved. We can make these cases less broken than they now are with a 
common get_timer approach as outlined previously. However, we cannot fix 
them to the standard Wolfgang is stating here, to be not "broken". So, 
back to what I was asking before, is the improvement worth the effort if 
the result is still broken?

Best Regards,
Bill Campbell
> I have stated this before:  I consider any design that requires
> get_timer() to be called "often enough" broken.
>
> Best regards,
>
> Wolfgang Denk
>

___
U-Boot mailing list
U-Boot@lists.denx.de
http://lists.denx.de/mailman/listinfo/u-boot


Re: [U-Boot] [RFC] Review of U-Boot timer API

2011-05-25 Thread J. William Campbell

On 5/24/2011 10:17 PM, Wolfgang Denk wrote:

Dear "J. William Campbell",

In message<4ddc31eb.6040...@comcast.net>  you wrote:
...

A tick is defined as the smallest increment of system time as measured by a
computer system (seehttp://en.wikipedia.org/wiki/System_time):

System time is measured by a system clock, which is typically
implemented as a simple count of the number of ticks that have
transpired since some arbitrary starting date, called the
epoch.

Unfortunately, this definition is obsolete, and has been for quite some

Do you have any proof for such a claim?

Hi Wolfgang,
 Well, yes, in fact the same reference you used. Note that the 
statement "A tick is defined as the smallest increment of system time as 
measured by a computer system" is NOT found in 
http://en.wikipedia.org/wiki/System_time. That page is defining system 
time, and what we are discussing is the definition of a "tick". In fact, 
on the same wiki page you cite, there IS the statement "Windows NT 
<http://en.wikipedia.org/wiki/Windows_NT> counts the number of 
100-nanosecond ticks since 1 January 1601 00:00:00 UT as reckoned in the 
proleptic Gregorian calendar 
<http://en.wikipedia.org/wiki/Proleptic_Gregorian_calendar>, but returns 
the current time to the nearest millisecond". Here 100 nanosecond ticks 
clearly does not refer to any "hardware" 100 ns clock that exists on the 
pc. The 100 ns is a computed (dare I say virtual) "tick" value. The 
point here is that the definition of tick is yours, not wikipedia.org's. 
(Although we all are aware if it is wikipedia, it MUST be so).  Further, 
http://en.wikipedia.org/wiki/Jiffy_%28time%29#Use_in_computing contains 
the statement "In computing <http://en.wikipedia.org/wiki/Computing>, a 
jiffy is the duration of one tick of the system timer 
<http://en.wikipedia.org/wiki/System_time> interrupt 
<http://en.wikipedia.org/wiki/Interrupt>.". If a tick is the smallest 
increment of system time as measured by the computer system, the "of the 
system timer interrupt" part of the statement would be unnecessary. The 
fact it IS present indicates there are other kinds of ticks present in 
the universe.
AFAIK, all timers "tick", and the definition of the tick rate is 
1/timer resolution. The concept of "timer ticks" and "clock ticks" has 
been around forever, and exists independent of System Time. For example, 
there may be a watchdog timer on a system that does not measure time at 
all, yet the watchdog still ticks. When you read a value from a timer 
that was not the "system timer", what do you call the value you read? I 
would call it ticks, and I bet just about everybody else would too.
The only reason I feel this point matters at all is that when one 
refers to a routine called "get_ticks", it is not obvious to me which 
timer ticks are being referenced. You are saying that, by definition, it 
refers to the "system clock".  My feeling is that it is not obvious why 
that is so on a system that has many clocks. The name of the function or 
an argument to the function should, IMNSHO,  specify which timer ticks 
are being returned.


Best Regards,
Bill Campbell

years.  When computers had a single timer, the above definition worked,
but it no longer does, as many (most?) computers have several hardware
timers. A "tick" today is the time increment of any particular timer of
a computer system. So, when one writes a function called get_ticks on a
PPC, does one mean read the decrementer, or does one read the RTC or
does one read the TB register(s) A similar situation exists on the
Blackfin BF531/2/3, that has a preformance counter, a real-time clock,
and three programmable timers. Which tick do you want? For each u-boot

Please re-read the definition.  At least as far as U-Boot and Linux
are concerned, there is only a single clock source used to implement
the _system_time_.  And I doubt that other OS do different.


implementation, we can pick one timer as the "master" timer, but it may
not be the one with the most rapid tick rate. It may be the one with the
most USEFUL tick rate for get_timer. If you take the above definition at
face value, only the fastest counter value has ticks, and all other
counters time increments are not ticks. If they are not ticks, what are
they?

Clocks, timers?


Best regards,

Wolfgang Denk



___
U-Boot mailing list
U-Boot@lists.denx.de
http://lists.denx.de/mailman/listinfo/u-boot


Re: [U-Boot] [RFC] Review of U-Boot timer API

2011-05-24 Thread J. William Campbell
On 5/24/2011 5:17 PM, Graeme Russ wrote:
> On Wed, May 25, 2011 at 5:19 AM, Wolfgang Denk  wrote:
>> Dear Graeme Russ,
>>
>> In message<4ddbe22d.6050...@gmail.com>  you wrote:
> Why must get_timer() be used to perform "meaningful time measurement?"
 Excellent question!  It was never intended to be used as such.
>>> Because get_timer() as it currently stands can as it is assumed to return
>>> milliseconds
>> Yes, but without any guarantee for accuracy or resolution.
>> This is good enough for timeouts, but nothing for time measurements.
> Out of curiosity, are there any platforms that do not use their most
> accurate source(*) as the timebase for get_timer()? If a platform is using
> it's most accurate, commonly available, source for get_timer() the the
> whole accuracy argument is moot - You can't get any better anyway so
> why sweat the details.
Hi All,
Well, it is not quite that simple. The "accuracy" of the 1 ms 
interrupt rate is controlled in all cases I know about by the resolution 
of the programmable divider used to produce it. It appears that the x86 
uses a 1.19318 MHz crystal oscillator to produce the nominal 1 ms timer 
tick. (There is a typo in line 30 of arch/x86/lib/pcat_timer.c that says 
1.9318. I couldn't make any of the numbers work until I figured this 
out). The tick is produced by dividing the 1.19318 rate999.313 by 1194, 
which produces an interrupt rate of 999.3 Hz, or about 0.068% error. 
However, the performance counter on an x86 is as exact as the crystal 
frequency of the CPU is. FWIW, you can read the performance counter with 
rdtsc on a 386/486 and the CYCLES and CYCLES2 registers on later 
Intel/AMD chips. So yes, there is at least one example of a cpu that 
does not use it's most accurate (or highest resolution) time source.
> (*)I'm actually referring to what is commonly available for that platform,
> and not where a board has a high precision/accuracy source in addition to
> the common source.
>
> As a followup question, how many platforms use two completely independent
> sources for udelay() and get_timer() - x86 does, but I plan to change this
> so the interrupt kicks the new prescaler which can be done at>>  1ms period
> and udelay() and get_timer() will use the same tick source and therefore
> have equivalent accuracy.
Are you sure of this? From what I see in arch/x86/lib/pcat_timer.c, the 
timer 0 is programmed to produce the 1 kHz rate timer tick and is also 
read repeatedly in __udelay to produce the delay value. They even 
preserve the 1194 inaccuracy, for some strange reason. I see that the 
sc520 does appear to use different timers for the interrupt source, and 
it would appear that it may be "exact", but I don't know what the input 
to the prescaler is so I can't be sure. Is the input to the prescaler 
really 8.3 MHz exactly? Also, is the same crystal used for the input to 
the prescaler counter and the "software timer millisecond count". If 
not, then we may have different accuracies in this case as well.

Also of note, it appears that the pcat_timer.c, udelay is not available 
intil interrupts are enabled. That is technically non-compliant, 
although it obviously seems not to matter.

Best Regards,
Bill Campbell
>>> OK, let's wind back - My original suggestion made no claim towards changing
>>> what the API is used for, or how it looks to those who use it (for all
>>> practical intents and purposes). I suggested:
>>>   - Removing set_timer() and reset_timer()
>>>   - Implement get_timer() as a platform independent function
>> Trying to remember what I have read in this thread I believe we have
>> an agreement on these.
>>
>>> Exposing ticks and tick_frequency to everyone via a 'tick' HAL
>> I skip this.  I don't even read it.
> Hmmm, I think it is worthwhile at least comparing the two - What is the
> lesser of two evils
>
>   1. Exposing 'ticks' through a HAL for the prescaler
>   2. Duplicating a function with identical code 50+ times across the source
>  tree
>
> I personally think #2 is way worse - The massive redundant duplication and
> blind copying of code is what has get us into this (and many other) messes
>
>>> ===
>>> Not exposing ticks and tick_frequency to everyone
>>>
>>> In /lib/timer.c
>>>
>>> void prescaler(u32 ticks, u32 tick_frequency)
>>> {
>>>u32 current_ms;
>>>
>>>/* Bill's algorithm */
>>>
>>>/* result stored in gd->timer_in_ms; */
>>> }
>>>
>>> In /arch/cpu/soc/timer.c or /arch/cpu/timer.c or /board//timer.c
>>>
>>> static u32 get_ticks(void)
>> Currently we have unsigned long long get_ticks(void)  which is better
>> as it matches existing hardware.
> Matches PPC - Does it match every other platform? I know it does not match
> the sc520 which has a 16-bit millisecond and a 16-bit microsecond counter
> (which only counts to 999 before resetting to 0)
>
> Don't assume every platform can implement a 64-bit tick counter. But yes,
> we should cater for those platforms that can
>

Re: [U-Boot] [RFC] Review of U-Boot timer API

2011-05-24 Thread J. William Campbell
On 5/24/2011 12:19 PM, Wolfgang Denk wrote:
> Dear Graeme Russ,
>
> In message<4ddbe22d.6050...@gmail.com>  you wrote:
 Why must get_timer() be used to perform "meaningful time measurement?"
>>> Excellent question!  It was never intended to be used as such.
>> Because get_timer() as it currently stands can as it is assumed to return
>> milliseconds
> Yes, but without any guarantee for accuracy or resolution.
> This is good enough for timeouts, but nothing for time measurements.
>
>> OK, let's wind back - My original suggestion made no claim towards changing
>> what the API is used for, or how it looks to those who use it (for all
>> practical intents and purposes). I suggested:
>>   - Removing set_timer() and reset_timer()
>>   - Implement get_timer() as a platform independent function
> Trying to remember what I have read in this thread I believe we have
> an agreement on these.
>
>> Exposing ticks and tick_frequency to everyone via a 'tick' HAL
> I skip this.  I don't even read it.
>
>> ===
>> Not exposing ticks and tick_frequency to everyone
>>
>> In /lib/timer.c
>>
>> void prescaler(u32 ticks, u32 tick_frequency)
>> {
>>  u32 current_ms;
>>
>>  /* Bill's algorithm */
>>
>>  /* result stored in gd->timer_in_ms; */
>> }
>>
>> In /arch/cpu/soc/timer.c or /arch/cpu/timer.c or /board//timer.c
>>
>> static u32 get_ticks(void)
> Currently we have unsigned long long get_ticks(void)  which is better
> as it matches existing hardware.
>
> Note that we also have void wait_ticks(u32) as needed for udelay().
Hi All,
I didn't comment before on the definition of ticks, but I fear 
that was unwise. The stated definition  was:

A tick is defined as the smallest increment of system time as measured by a
computer system (seehttp://en.wikipedia.org/wiki/System_time):

System time is measured by a system clock, which is typically
implemented as a simple count of the number of ticks that have
transpired since some arbitrary starting date, called the
epoch.


Unfortunately, this definition is obsolete, and has been for quite some 
years.  When computers had a single timer, the above definition worked, 
but it no longer does, as many (most?) computers have several hardware 
timers. A "tick" today is the time increment of any particular timer of 
a computer system. So, when one writes a function called get_ticks on a 
PPC, does one mean read the decrementer, or does one read the RTC or 
does one read the TB register(s) A similar situation exists on the 
Blackfin BF531/2/3, that has a preformance counter, a real-time clock, 
and three programmable timers. Which tick do you want? For each u-boot 
implementation, we can pick one timer as the "master" timer, but it may 
not be the one with the most rapid tick rate. It may be the one with the 
most USEFUL tick rate for get_timer. If you take the above definition at 
face value, only the fastest counter value has ticks, and all other 
counters time increments are not ticks. If they are not ticks, what are 
they?

This is one of the reasons I favor the performance monitoring system be 
separate from the get_timer timing methodology, as it will often use a 
different counter and time base anyway. That is also why I prefer to 
have a conversion routine that converts timer values to seconds and 
nano-seconds without reference to "tick" rates, so the user never has to 
deal with these ambiguities. Yes, "under the hood", somebody does, but 
that need not be specified in the external interface. )Nobody has yet 
commented on my proposed performance measuring functions, and I know 
this group well enough not to assume that silence implies consent! )

The prescaler function defined by Graeme needs to read some timer value. 
It turns out that a u32 value is the most appropriate value for the 
get_timer operation.  The value desired is usually not of hardware timer 
tick resolution. It should be the specific bits in the timer, such that 
the LSB resolution is between 1 ms and 0.5 ms. We use greater resolution 
that this only when the counter is short enough that we need lower 
significance bits to make the counter 32 bits. For that reason, the 
function should probably be call something like u32 
read_timer_value(void). I really don't much care what it is called as 
long as we understand what it does.

Best Regards,
Bill Campbell


>> static u32 get_tick_frequency(void)
>> {
>>  u32 tick_frequency;
>>
>>  /* Determine tick frequency */
>>
>>  return tick_frequency;
>> }
> Note that we also have u32 usec2ticks(u32 usec) and u32 ticks2usec(u32 ticks).
>
> Best regards,
>
> Wolfgang Denk
>

___
U-Boot mailing list
U-Boot@lists.denx.de
http://lists.denx.de/mailman/listinfo/u-boot


Re: [U-Boot] [RFC] Review of U-Boot timer API

2011-05-24 Thread J. William Campbell
On 5/24/2011 9:51 AM, Graeme Russ wrote:
> On 25/05/11 00:19, Wolfgang Denk wrote:
> =

Hi all,
 I have a few of questions.
First, it seems that the get_timer interface is expected to work 
properly only after relocation and only when bss is available. I say 
this because the PPC version uses an (initialized) variable, timestamp, 
to hold the time. If that is the case, there is no need to hold the 
timer static data in gd, as I have been doing up to  now. Am I correct ?
  Second, udelay is expected to be available before  bss is 
available, very early in the startup process. There a comments to that 
effect in several places in the code. Therefore, udelay cannot use 
global or static variables. Is this true?
 And third, udelay is only expected/required to work correctly for 
"short", like say 10 seconds. I say this based on comments contained in 
powerpr/lib/time.c. Please ack or nak this as well.

I have a couple of "small" changes to the previously submitted code, 
based partly on the above
> Exposing ticks and tick_frequency to everyone via a 'tick' HAL
>
> In /lib/timer.c

ulong timestamp = 0;
/*
  This routine is called to initialize the timer after BSS is available
*/
void  __weak prescaler_init(void)
{
 u32 tick_frequency = get_tick_frequency();
  /* initialize prescaler variables */
}

> void __weak prescaler(void)
> {
>   u32 ticks = get_ticks();
 /* Bill's algorithm */
/* result stored in timestamp; */
> }
>
> u32 get_timer(u32 base)
> {
#if defined(CONFIG_SYS_NEED_PRESCALER)
>   prescaler();
#endif
>   return timestamp - base;
> }
>
> In /arch/cpu/soc/timer.c or /arch/cpu/timer.c or /board//timer.c
>
> u32 get_ticks(void)
> {
>   u32 ticks;
>
>   /* Get ticks from hardware counter */
>
>   return ticks;
> }
>
> u32 get_tick_frequency(void)
> {
>   u32 tick_frequency;
>
>   /* Determine tick frequency - likely very trivial */
>
>   return tick_frequency;
> }
or instead the user may override prescaler_init and prescaler if, for 
some reason, a highly optimized version is desired. Note also that if 
the configuration variable CONFIG_SYS_NEED_PRESCALER is not defined, the 
additional prescaler routines will not be called anywhere so the 
routines should not be loaded. Yes, it it a #define to manage, but it 
should allow the existing u-boots to be the same size as before, with no 
unused code. This size matters to some people a lot!
> ===
> Not exposing ticks and tick_frequency to everyone
>
> In /lib/timer.c
>
> void prescaler(u32 ticks, u32 tick_frequency)
> {
>   u32 current_ms;
>
>   /* Bill's algorithm */
>
>   /* result stored in gd->timer_in_ms; */
> }
>
> In /arch/cpu/soc/timer.c or /arch/cpu/timer.c or /board//timer.c
>
> static u32 get_ticks(void)
> {
>   u32 ticks;
>
>   /* Get ticks from hardware counter */
>
>   return ticks;
> }
>
> static u32 get_tick_frequency(void)
> {
>   u32 tick_frequency;
>
>   /* Determine tick frequency */
>
>   return tick_frequency;
> }
>
> u32 get_timer(u32 base)
> {
>   u32 ticks = get_ticks();
>   u32 tick_frequency = get_tick_frequency();
>
>   prescaler(ticks, tick_frequency);
>
>   return gd->timer_in_ms - base;
> }
>
> ===
> I personally prefer the first - There is only one implementation of
> get_timer() in the entire code and the platform implementer never has to
> concern themselves with what the tick counter is used for. If the API gets
> extended to include get_timer_in_seconds() there is ZERO impact on
> platforms. Using the second method, any new feature would have to be
> implemented on all platforms - and we all know how well that works ;)
>
> And what about those few platforms that are actually capable of generating
> a 1ms timebase (either via interrupts or natively in a hardware counter)
> without the prescaler? Well, with prescaler() declared weak, all you need
> to do in /arch/cpu/soc/timer.c or /arch/cpu/timer.c or
> /board//timer.c is:
>
> For platforms with a 1ms hardware counter:
> void prescaler(void /* or u32 ticks, u32 tick_frequency*/)
> {
>   gd->timer_in_ms = get_milliseconds();
> }
>
> For platforms with a 1ms interrupt source:
> void timer_isr(void *unused)
> {
>   gd->timer_in_ms++;
> }
>
> void prescaler(void /* or u32 ticks, u32 tick_frequency*/)
> {
> }
>
>
> And finally, if the platform supports interrupts but either the hardware
> counter has better accuracy than the interrupt generator or the interrupt
> generator cannot generate 1ms interrupts, configure the interrupt generator
> to fire at any rate better than the tick counter rollover listed in
> previous post and:
>
> void timer_isr(void *unused)
> {
>   /*
>* We are here to stop the tick counter rolling over. All we
>* need to do is kick the prescaler - get_timer() does that :)
>*/
>   get_timer(0);
> }
>
> In summary, platform s

Re: [U-Boot] [PATCH] [Timer]Remove calls to [get, reset]_timer outside arch/

2011-05-24 Thread J. William Campbell
On 5/23/2011 10:13 PM, Graeme Russ wrote:
> On Tue, May 24, 2011 at 7:02 AM, Graeme Russ  wrote:
>
> [snip]
>> Well, we have no control over the argument in cfi driver (unless you plan
>> to put #ifdef NIOS all over the place)
>>
>> Maybe we could round up the parameter inside get_timer() itself?
> Wow, what was I on! - Oh, thats right, no coffee ;)
>
> The parameter to get_timer() is not a timeout, it is a reference epoch
Hi Graeme,
No, you were not on drugs.  If base != 0 then { if (timer - base 
< 20) return 0;  return timer - base -10} return timer.  That will make 
the subsequent comparison do what you wanted. Differences of less than 
20 are unreliable because of quantization, so we return
0. If the difference was > 20, return a conservative number, the 
difference -10. If the input base was 0, return the timer, the user is 
just trying to read it.

Best Regards
Bill Campbell
> So I think adding reset_timer() to the NIOS get_timer() when the parameter
> is zero is the only option
>
> Regards,
>
> Graeme
>
>

___
U-Boot mailing list
U-Boot@lists.denx.de
http://lists.denx.de/mailman/listinfo/u-boot


Re: [U-Boot] [RFC] Review of U-Boot timer API

2011-05-24 Thread J. William Campbell
On 5/24/2011 7:12 AM, Wolfgang Denk wrote:
> Dear Albert ARIBAUD,
>
> In message<4ddb4c1c.7030...@aribaud.net>  you wrote:
>> Not sure I still follow what the two options are -- a heads up is welcome.
>> However, I do like the simplicity in having a single time unit (ticks)
>> for the timer API -- asuming it covers all needs -- and providing other
>> time units only as helper functions.
> I don't think using ticks is a good idea.  You would need to change
> all definitiuons of timeouts and delays and such.
>
> Why not using a scaled unit like microsecods or the currently used
> milliseconds?
>
> I wonder why we suddenly have to change everything that has been
> working fine for more than a decade (ignoring the large number of
> incorrect, incomplete or broken implementations).  But so far we
> really never needed anything else but  udelay()  for the typical short
> device related timeouts, and  get_time()  for longer, often protocol
> defined,  timeouts.
>
> Is there anything wrong with these two solutions, based on standard
> units (us and ms) ?
Hi All,
 After really looking into this, I think I agree with Wolfgang 
that using ms for a get_timer timebase is the best way to go. This 
thinking is heavily influenced (in my case anyway) by the fact that in 
the interrupt driven cases (and these are the ONLY fully compliant cases 
ATM I think), the "cost" of using ms is 0, because that is the "native" 
unit in which the timer ticks. This makes everything real simple. We 
can, right today, produce an API that supports u32 get_time_ms(void) for 
all CPUs in use. This would allow u32 get_timer(u32 base) to continue to 
exist as-is. These implementations would still be technically "broken" 
in the non-interrupt case, but they would work at least as well as they 
presently do. In fact, they would operate better because they would all 
use a single routine, not a bunch of different routines (some of which I 
am pretty sure have errors). Wolfgang would need to accept the fact that 
we are not yet "fixing" all the non-interrupt cases. This needs to be 
done, but is a different problem (I hope). In the non-interrupt case 
there is some cost in converting to ms from the native units, but we are 
in a timout loop, so a few extra instructions do not matter much. It is 
only a few 32 bit multiplies, which these days are real fast. If we can 
figure out how to use interrupts eventually, even this will go away.

 Then, we can ADD an optional performance measuring API like 
Scott suggests. This API would be something similar to the following:
struct ticks
{
 u32tickmsb;
 u32   ticklsb;
} ;
struct time_value
{
   u32 seconds;
   u 32 nano_seconds;
};


and the functions would be get_ticks(struct ticks * p) , 
get_tick_delta(struct ticks * minuend, struct ticks * subtrahend),  
and   cvt_tick_delta(struct time_value * result, struct ticks *input). I 
didn't use u64 on purpose, because I don't want any un-necessary 
functions pulled from the library. get_ticks reads a "hi precision" 
counter. How high the precision is depends on the hardware. 
get_tick_delta subtracts two tick values, leaving the difference in the 
first operand. Yes, this is may be simple to do in open code but it is 
better to hide the details. (What if msb is seconds and lsb is 
nanoseconds, then it is not so simple). cvt_tick_delta converts from 
ticks to seconds and nano_seconds. We also need a u32 
get_tick_resolution, which would return the tick resolution in ns. The 
user never needs to do any arithmetic on ticks, so that makes his life 
much easier. However, the user may want to know the resolution of these 
measurements. All these functions are quite fast except possibly 
cvt_tick_delta, which is only needed for printing anyway.
 If the hardware has a performance monitoring counter, 
implementing these functions is quite simple. The PPC and the x86 
certainly do (well, any x86 pentium and above anyway). This is a whole 
lot of chips off our plate. In cases where no such counter exists, we 
will use whatever counters there are available already. Some of the ARM 
counters are running at 32 kHz, so their resolution won't be great, but 
it is what it is. If we find out that some of these other CPUs have a 
performance counter, we will use it. This API will be completely 
optional in u-boot and can be removed by changing a #define.
Thoughts welcome.

Best Regards,
Bill Campbell

> Best regards,
>
> Wolfgang Denk
>

___
U-Boot mailing list
U-Boot@lists.denx.de
http://lists.denx.de/mailman/listinfo/u-boot


Re: [U-Boot] [PATCH] [Timer]Remove calls to [get, reset]_timer outside arch/

2011-05-23 Thread J. William Campbell
On 5/23/2011 2:02 PM, Graeme Russ wrote:
> On 24/05/11 06:49, J. William Campbell wrote:
>> On 5/23/2011 1:10 PM, Graeme Russ wrote:
>>> On 24/05/11 04:29, Scott McNutt wrote:
>>>> Hi Bill,
>>>>
>>>> J. William Campbell wrote:
>>>>> On 5/23/2011 6:12 AM, Scott McNutt wrote:
>>>>>> Dear Graeme,
>>>>>>
>>>>>> Graeme Russ wrote:
>>>>>>> On 23/05/11 22:19, Scott McNutt wrote:
>>>>>>>> Hi Graeme,
>>>>>>>>
>>>>>>>> Graeme Russ wrote:
>>>>>>>>> There is no need to use get_timer() and reset_timer() and there are
>>>>>>>>> build
>>>>>>>> I must have missed something WRT reset_timer() -- my apologies
>>>>>>>> if I'm covering old ground.
>>>>>>>>
>>>>>>>> When the timestamp is incremented using an interrupt that occurs with
>>>>>>>> a period greater than 1 ms, we can get early timeouts. reset_timer()
>>>>>>>> solved the problem. What's the recommended approach for dealing with
>>>>>>>> this without reset_timer() ?
>>>>> Hi Scott,
>>>>> Are you saying that the interrupt frequency is greater than
>>>>> 1000 times per second, or as I read it, the frequency is less than 1000
>>>>> per second (period greater than 1 ms). If anything, that should make the
>>>>> timer run slow, not fast.
>>>>>I wonder if it is a resolution issue. What are the typical delays in ms
>>>>> you are using?
>>>> Some older nios2 implementations have _fixed_ 10 msec timers.
>>>> Basically, the timestamp is incremented asynchronous to get_timer(0).
>>>> So a  10 msec timeout can occur, for example, almost immediately if
>>>> the timer isn't reset just prior to calling get_timer(0). There are
>>>> more details in the comments for the following commits:
>>>>
>>>> nios2: Reload timer count in reset_timer():
>>>> d8bc0a2889700ba063598de6d4e7d135360b537e
>>>>
>>>> cfi_flash: reset timer in flash status check:
>>>> 22d6c8faac4e9fa43232b0cf4da427ec14d72ad3
>>>>
>>>> I'm totally in favor of cleaning this stuff up. It caused some
>>>> headaches (and wasted time) about 13 months ago. My primary concern
>>>> is to avoid breaking things that currently work for us nios2
>>>> weenies ... at least for any length of time.
>>>>
>>>> Things are a bit tight for me until next week or so. I'll probably
>>>> come up for air around June 1st ... and I'll be glad to help out.
>>>>
>>> Is there any reason why we cannot silently perform a reset_timer() any time
>>> set_timer() is called with a parameter of 0?
>> Hi All,
>>   I assume you mean get_timer(0)?  In principle, you cannot do this
> Yes - it's early, no coffee yet ;)
>> because it could be inside another get_timer(0) loop that has already some
>> time elapsed before you hit the inner get_timer(0). I think what needs to
> Correct, but that is what is already happening for ALL arches in cfi due to
> the reset_timer() before get_timer(0) - I am suggesting sandboxing the
> problem to NIOS until we sort out the timer API properly
>
>> happen on the old NIOS with 10 ms resolution on the interrupt times is that
>> all timer intervals must have 10 ms added and then rounded up to the
>> nearest multiple of 10. Thus, if you wanted to wait for 1 millisecond, you
>> must use an argument of 20 ms to be sure you wait at all! If you use an
>> argument of 10, it won't help because you could get an interrupt right away
>> and exit. If these routines are nios2 specific, you could add a local
>> reset_timer, but I assume they are generic. . Note that if these routines
>> are not nios2 specific, is there any harm in waiting "too long"?
> Well, we have no control over the argument in cfi driver (unless you plan
> to put #ifdef NIOS all over the place)
>
> Maybe we could round up the parameter inside get_timer() itself?
Hi All,
That would probably be the best way to go for now. It might slow 
things down a bit though, if these delays are all desired to be "short", 
like 1 ms. We would expand the 1 ms delay to 15 ms (average) while the 
current (illegal) solution would expand a 1 ms delay to 10 ms always. It 
is worth trying I think. It is also true that any other delays in the 
program will suffer from the 10 ms resolution problem, so your idea is I 
think a good one.

Best Regards,
Bill Campbell
> Regards,
>
> Graeme
>
>
>

___
U-Boot mailing list
U-Boot@lists.denx.de
http://lists.denx.de/mailman/listinfo/u-boot


Re: [U-Boot] [PATCH] [Timer]Remove calls to [get, reset]_timer outside arch/

2011-05-23 Thread J. William Campbell
On 5/23/2011 1:10 PM, Graeme Russ wrote:
> On 24/05/11 04:29, Scott McNutt wrote:
>> Hi Bill,
>>
>> J. William Campbell wrote:
>>> On 5/23/2011 6:12 AM, Scott McNutt wrote:
>>>> Dear Graeme,
>>>>
>>>> Graeme Russ wrote:
>>>>> On 23/05/11 22:19, Scott McNutt wrote:
>>>>>> Hi Graeme,
>>>>>>
>>>>>> Graeme Russ wrote:
>>>>>>> There is no need to use get_timer() and reset_timer() and there are
>>>>>>> build
>>>>>> I must have missed something WRT reset_timer() -- my apologies
>>>>>> if I'm covering old ground.
>>>>>>
>>>>>> When the timestamp is incremented using an interrupt that occurs with
>>>>>> a period greater than 1 ms, we can get early timeouts. reset_timer()
>>>>>> solved the problem. What's the recommended approach for dealing with
>>>>>> this without reset_timer() ?
>>> Hi Scott,
>>>Are you saying that the interrupt frequency is greater than
>>> 1000 times per second, or as I read it, the frequency is less than 1000
>>> per second (period greater than 1 ms). If anything, that should make the
>>> timer run slow, not fast.
>>>   I wonder if it is a resolution issue. What are the typical delays in ms
>>> you are using?
>> Some older nios2 implementations have _fixed_ 10 msec timers.
>> Basically, the timestamp is incremented asynchronous to get_timer(0).
>> So a  10 msec timeout can occur, for example, almost immediately if
>> the timer isn't reset just prior to calling get_timer(0). There are
>> more details in the comments for the following commits:
>>
>> nios2: Reload timer count in reset_timer():
>>d8bc0a2889700ba063598de6d4e7d135360b537e
>>
>> cfi_flash: reset timer in flash status check:
>>22d6c8faac4e9fa43232b0cf4da427ec14d72ad3
>>
>> I'm totally in favor of cleaning this stuff up. It caused some
>> headaches (and wasted time) about 13 months ago. My primary concern
>> is to avoid breaking things that currently work for us nios2
>> weenies ... at least for any length of time.
>>
>> Things are a bit tight for me until next week or so. I'll probably
>> come up for air around June 1st ... and I'll be glad to help out.
>>
> Is there any reason why we cannot silently perform a reset_timer() any time
> set_timer() is called with a parameter of 0?
Hi All,
  I assume you mean get_timer(0)?  In principle, you cannot do this 
because it could be inside another get_timer(0) loop that has already 
some time elapsed before you hit the inner get_timer(0). I think what 
needs to happen on the old NIOS with 10 ms resolution on the interrupt 
times is that all timer intervals must have 10 ms added and then rounded 
up to the nearest multiple of 10. Thus, if you wanted to wait for 1 
millisecond, you must use an argument of 20 ms to be sure you wait at 
all! If you use an argument of 10, it won't help because you could get 
an interrupt right away and exit. If these routines are nios2 specific, 
you could add a local reset_timer, but I assume they are generic. . Note 
that if these routines are not nios2 specific, is there any harm in 
waiting "too long"?

Best Regards,
Bill Campbell
> Regards,
>
> Graeme
>
>

___
U-Boot mailing list
U-Boot@lists.denx.de
http://lists.denx.de/mailman/listinfo/u-boot


Re: [U-Boot] [RFC] Review of U-Boot timer API

2011-05-23 Thread J. William Campbell

On 5/23/2011 12:33 PM, Wolfgang Denk wrote:

Dear "J. William Campbell",

In message<4ddaa705.1040...@comcast.net>  you wrote:

   My apologies for being a little (perhaps more than a little)
dense. As they say, "after further review", I think the key aspect of
the PPC timer system is that it uses the decrementer register to
generate an interrupt at a 1 KHz rate. What I have been attempting here
is to produce a timer system that does not use interrupts at all. This
is a fundamental design question. Naturally, systems that can generate

No, it is not.  It is an implementation detail which is irrelevant to
almost all users of U-Boot.

Or do you actucally care if your UART driver uses polling or
interrupts?

Hi All,
  I might care a lot if I expect typeahead to work, or if I am 
sending a command script via a terminal emulator and I don't want to 
loose characters while u-boot is off executing a command. One might 
(correctly) say that that is too much to expect of a boot loader, and 
define polling as "good enough", which I advocate, or not. YMMV.

an interrupt at a 1 KHz rate (or at any (reasonable) higher rate for
that matter) using the decrementer register can produce a 1 ms
resolution software counter that updates "by magic". If my understanding
of this PPC  code is incorrect, somebody please stop me before I make a
further fool of myself!  Is it then a design requirement that the timer
system use interrupts? Is that what is meant by using the PPC system as

No, it is not a design requirement.  It is just one possible
implementation.  Any other method that achieves the same or similar
results is as good.  As noted before, on PowerPC we could have
probably avoided this and just base all timer services on the timebase
register.

[The reason for this dual implementation is historical.  When I wrote
this code, I did not know if we would ever need any fancy timer-
controlled callbacks or similar.  And I needed to implement interrupt
handling for a few other purposes (for example for use in standalone
applications; this was an explicit requirement at that time).  And the
timer was something that made a good and simple example.]


a model? If so, is it possible/reasonable on all the u-boots that are
out there to generate and process timer interrupts at some (hopefully
but not necessarily) programmable rate?

I consider this an implementation detail.  On all architectures it
should be possible to use interrupts, so if the hardware supports a
timer that can generate interrupts it should be possible to use this.
But it is not a requirement that all implementations must work like
  Ok, this is very nice to understand. I will attempt to summarize 
what I think this email and the previous one means. First, the required 
properties of the get_timer routine.


  1. It must have 1 millisecond resolution and accuracy (more or less).
 For instance, the old NIOS timer that incremented the timestamp by
 10 every 10 milliseconds in response to an interrupt is not compliant.
  2. The get_timer routine must have full period accuracy without any
 assumptions regarding what is going on in u-boot. This period is
 4294967 seconds or so.

I then suggest that the minimum system requirements to support the 
u-boot timer are as follows:


   * Either there exists a free-running timer whose period is > 4294967
 and whose resolution is 1 millisecond or better. This probably
 includes all 64 bit timestamp counters.
   * Or there exists a method of generating interrupts at a known rate
 of 1 millisecond or faster. This is a superset of  the current PPC
 method.
   * Or there exists a method of generating interrupts at a known fixed
 rate slower than once a millisecond AND there exists a readable
 free running counter whose period is longer or the same as the
 interrupt rate AND whose resolution is at least 1 ms. This would
 include N bit counters that generate an interrupt when they
 overflow, or some long timestamp counter with another, possibly 
 unrelated interrupt generating methodology that is faster than the

 counter overflow interval.

There are many systems that are able to do all three cases, or two of 
three, so they have a choice on how to implement get_timer(). I claim 
that these are sufficient conditions. I also claim they are necessary. 
If a hardware system does not meet these criteria, I claim it can't meet 
the get_timer requirements. Do such systems exist today that use u-boot? 
I think probably so, but maybe they are corner cases. Note that you 
cannot extend a 32 bit counter to a 64 bit counter reliably without 
using interrupts to do so when get_timer is not being called.


Systems using the first approach above have essentially all their logic 
in the get_timer routine, but MUST use at least some 64 bit arithmetic 
(actually 33 bit arithmetic if you don't count shifts) to do

Re: [U-Boot] [RFC] Review of U-Boot timer API

2011-05-23 Thread J. William Campbell
On 5/23/2011 6:19 AM, Wolfgang Denk wrote:
> Dear Graeme Russ,

> This is what PPC is doing. And I understand that Reinhard did the same
> in software for AT91.
Hi All,
  My apologies for being a little (perhaps more than a little) 
dense. As they say, "after further review", I think the key aspect of 
the PPC timer system is that it uses the decrementer register to 
generate an interrupt at a 1 KHz rate. What I have been attempting here 
is to produce a timer system that does not use interrupts at all. This 
is a fundamental design question. Naturally, systems that can generate 
an interrupt at a 1 KHz rate (or at any (reasonable) higher rate for 
that matter) using the decrementer register can produce a 1 ms 
resolution software counter that updates "by magic". If my understanding 
of this PPC  code is incorrect, somebody please stop me before I make a 
further fool of myself!  Is it then a design requirement that the timer 
system use interrupts? Is that what is meant by using the PPC system as 
a model? If so, is it possible/reasonable on all the u-boots that are 
out there to generate and process timer interrupts at some (hopefully 
but not necessarily) programmable rate?

Best Regards,
Bill Campbell
> Best regards,
>
> Wolfgang Denk
>

___
U-Boot mailing list
U-Boot@lists.denx.de
http://lists.denx.de/mailman/listinfo/u-boot


Re: [U-Boot] [RFC] Review of U-Boot timer API

2011-05-23 Thread J. William Campbell

On 5/23/2011 6:19 AM, Wolfgang Denk wrote:

Dear Graeme Russ,

In message<4dda5334.4060...@gmail.com>  you wrote:

  - A helper function in /lib/ u32 get_raw_ms() which uses get_raw_ticks()
and get_tick_rate() to correctly maintain the ms counter used by
get_timer() - This function can be weak (so next point)

Ditto.  What would that do?  If it gets milliseconds as the name
suggest, that's already the function needed for get_timer()?

OK, there appears to be a consensus that not all hardware actually supports
a free-running timer with 1ms resolution. To overcome this, the idea is to

Indeed.  I guess most of them do not.


create a common library function which maintains the free running counter.
The library function acts as a pre-scaler using a 'raw tick counter' and a
'raw tick rate' supplied by the low level architecture. We define this weak

What are "raw" ticks?  And what are "cooked" ticks, then?

Hi all,
 FWIW,  "cooked" ticks would be 1 ms ticks, although we never 
really use them as such.

so that if the architecture can provide a free running 1ms counter, there
is no (code size) penalty

Why do we need a free running 1ms counter at all?  Any free running
counter of at least millisecoind resolution should be good enough.
Correct. Any free running counter whose resolution is better than one 
millisecond and which is "long enough" that it will not overflow between 
calls to get_timer is sufficient.

This approach eliminates all the existing per-arch code which (attempts) to
manage the time base behind get time. So we simplify each arch down to it's
bare essentials - Provide a counter which increments at a natural fixed
rate and what the rate is - Let common library code deal with the rest.

Did you have a look at the PowerPC implementation?  I'd like to see
this used as reference.
I have looked at it as a reference. However, there is one disadvantage 
in using the PPC code as a reference. It has a 64 bit timestamp. Many 
systems do not have a 64 bit timestamp, but rather a 32 bit timestamp.  
It is possible to extend the 32 bit timestamp to a 64 bit timestamp if 
get_timer is called often enough that there will only be a single 
rollover of the bottom 32 bits  between uses. However, if that condition 
is met, there is no need to extend the timer to 64 bits. Instead, just 
convert the elapsed time since the last call (which you know) to ms and 
be done with it.  As Wolfgang said above, any counter that has better 
than 1 ms resolution  is adequate to the task. The additional 
requirement that we have stated is that the counter be long enough that 
it does not overflow between calls to get_timer. If the counter is 64 
bits long, it pretty much for sure meets this requirement  (although 
bits below 0.5 ms resolution really don't help any). If the timer is 32 
bits long, it will meet any requirements using get_timer to time out 
hardware intervals. My original implementation used a 32 bit divide and 
does exactly that. This is the shortest and simplest approach, and we 
can get that working in all cases quite easily I think. We can avoid the 
"no divide" optimization until everybody is satisfied with what we have.

  - Calling of get_raw_ticks() regularly in the main loop (how ofter will
depend on the raw tick rate, but I image it will never be necessary
to call more often than once every few minutes)

NAK. This concept is fundamentally broken.  I will not accept it.

Some existing timers are fundamentally broken - The glitch at the
0x to 0x rollover or rollover early - The method discussed
in this thread eliminates all such glitches. Provided pre-scaler in /lib/
(triggered by get_timer() usually) is called often enough (71 minutes for a
32-bit 1MHz counter) then there is no need. Even then, it is only important

We already have this nightmare of code for triggering the watchdog on
systems that use it.

Assuming there are places in the main loop that get executed often
enough is a broken concept, and I will not accept any such code.
That is fine with me. The reason this was being done was to attempt to 
emulate, as much as possible, the power PC, where the 64 bit timestamp 
counter allows calls to get_timer separated by many minutes and several 
console commands to work properly. These get timer commands will NOT 
work properly on systems that have a 32 bit counter that overflows every 
200 seconds or so. The call in the idle loop was an attempt to make the 
32 bit systems work more like the 64 bit systems. One may then either


  1.   Define calls to get_timer to measure an elapsed interval
 separated by any returns to the command processor as broken.
  2. Require the use of interrupts to extend the 32 bit timestamp.
 (This may not be possible on all systems as the timer used for
 performance monitoring does not interrupt, etc.)
  3. Allow the call in the idle loop under the assumption that we are
 talking about timing in the minutes range, not a few

Re: [U-Boot] [RFC] Review of U-Boot timer API

2011-05-23 Thread J. William Campbell

On 5/22/2011 11:29 PM, Albert ARIBAUD wrote:

Hi all,

Sorry, could not follow the discussion although I find it very
interesting, so I will handle the task of coming in late and asking the
silly questions.
I am glad you are looking at our discussion. I am sure we are going to 
need all the help/oversight/questions that we can get, as this is a 
change that will affect all architectures.

Le 23/05/2011 07:25, Graeme Russ a écrit :


On Mon, May 23, 2011 at 3:02 PM, J. William Campbell
   wrote:

On 5/22/2011 6:42 PM, Graeme Russ wrote:

OK, so in summary, we can (in theory) have:
   - A timer API in /lib/ with a single u32 get_timer(u32 base) function
   - A HAL with two functions
 - u32 get_raw_ticks()
 - u32 get_raw_tick_rate() which returns the tick rate in kHz (which
   max's out at just after 4GHz)
   - A helper function in /lib/ u32 get_raw_ms() which uses get_raw_ticks()
 and get_tick_rate() to correctly maintain the ms counter used by
 get_timer() - This function can be weak (so next point)
   - If the hardware supports a native 32-bit ms counter, get_raw_ms()
 can be overridden to simply return the hardware counter. In this case,
 get_raw_ticks() would return 1

Are you sure you did not mean 'get_raw_ticks_rate' here? Besides, I'd
like the name to specify the units used: 'get_raw_ticks_rate_in_khz' (or
conversively 'get_raw_ticks_per_ms', depending on which is simpler to
implement and use).
I think you are correct, it was the rate function desired here. I think 
the best way to go is use a "get_raw_tick_rate_in_mhz" function, because 
it is probably the easiest one to implement, and in many cases something 
like it already exists.

   - Calling of get_raw_ticks() regularly in the main loop (how ofter will
 depend on the raw tick rate, but I image it will never be necessary
 to call more often than once every few minutes)

That's to keep track of get_raw_ticks() rollovers, right? And then the
get_timer function (which, again, I would prefer to have '... in ms'
expressed in its name) would call get_raw_ticks() in confidence that at
most one rollover may have occurred since the last time the helper
function was called, so a simple difference of the current vs last tick
value will always be correct.
Exactly so. Note that this same function probably needs to be called in 
udelay for the same reason. More precisely, the get_timer function will 
call get_raw_ms, which will call get_raw_ticks. I think it may be better 
to move get_timer "down" a level in the hierarchy,
so we don't need a get_raw_ms. get_timer would then be part of the HAL. 
One would use a get_timer(0) in order to do what get_raw_ms alone would 
have done. If the user had a good reason, he would then override 
get_timer with his own version. What do you think Graeme? It reduces the 
nesting depth by one level. As for the name change to get_timer_in_ms, I 
would support it. Naturally, such a change would be up to Mr. Denk. 
Since by definition that is what the function does, it seems to be a 
good change from my point of view.

   - If the hardware implements a native 32-bit 1ms counter, no call in
 the main loop is required
   - An optional HAL function u32 get_raw_us() which can be used for
 performance profiling (must wrap correctly)

Hi All,
   Graeme, I think you have stated exactly what is the "best" approach to
this problem.  I will provide a version of "get_raw_ms" that is  initialized
using get_raw_tick_rate that will work for all "reasonable" values of
raw_tick_rate. This will be the "generic" solution. Both the initialization
function and the get_raw_ms function can be overridden if there is reason to
do so, like "exact" clock rates. I will do the same with get_raw_us. This
will be posted sometime on Monday for people to review, and to make sure I
didn't get too far off base. Thank you to both Graeme and Reinhard for
looking at/working on this.. Hopefully, this solution will put this timing
issue to rest for all future ports as well as the presently existing ones.

In Greame's description, I did not see a get_raw_ms, only a get_raw_us.
Was this last one a typo or is that a third HAL function?
get_raw_ms was referenced as a library function a few lines above.  
Right now, I think the functionality we require from the HAL is


  1. get_raw_tick_rate_in_mhz
  2. get_raw_ms
  3. get_raw_ticks
  4. (optional)get_raw_us

There is also  APIs for these functions called get_timer.

I think we need to add a call to another function called 
"initialize_timer_system" or similar that will initialize the data 
structures in gd by calling get_raw_tick_rate_in_mhz. Additionally, I 
think we need to provide a udelay function, simply because it can 
interact with calling get_raw_ms often enough. We are somewhat caught 
between two fires here, in t

Re: [U-Boot] [RFC] Review of U-Boot timer API

2011-05-22 Thread J. William Campbell
On 5/22/2011 8:26 PM, Reinhard Meyer wrote:
> Dear J. William Campbell,
>> On 5/22/2011 1:15 AM, Reinhard Meyer wrote:
>>> Dear J. William Campbell,
>>>
>>> please demonstrate for me (and others), by a practical example,
>>> how _any_ arithmetic (even less with just shifts and multiplies)
>>> can convert a free running 3.576 MHz (wild example) free running
>>> 32 bit counter (maybe software extended to 64 bits) into a ms
>>> value that will properly wrap from 2**32-1 to 0 ?
>> Hi All
>> I accept the challenge! I will present two ways to do this, one using 
>> a 32 bit by 16 bit divide, and one using only multiplies.
>> This first method is "exact", in that there is no difference in 
>> performance from a "hardware" counter ticking at the 1 ms rate. This 
>> is accomplished by operating the 1 ms counter based on the delta time 
>> in the hardware time base. It is necessary to call this routine often 
>> enough that the hardware counter does not wrap more than once between 
>> calls. This is not really a problem, as this time is 1201 seconds or 
>> so. If the routine is not called for a long time, or at the first 
>> call, it will return a timer_in_ms value that will work for all 
>> subsequent calls that are within a hardware rollover interval. Since 
>> the timer in ms is a 32 bit number anyway. The same rollover issue 
>> will exist if you "software extend" the timer to 64 bits. You must 
>> assume 1 rollover. If it is more than 1, the timer is wrong.
>>
>>
>> The variables in the gd are
>> u32 prev_timer;
>> u32 timer_in_ms;
>> u16 timer_remainder;
>>
>> /* gd->timer remainder must be initialized to 0 (actually, an number 
>> less than 3576, but 0 is nice). Other two variables don't matter but 
>> can be initialized if desired */
>>
>> u32 get_raw_ms()
>> {
>> u32 delta;
>> u32 t_save;
>>
>> read(t_save); /* atomic read of the hardware 32 bit timer running at 
>> 3.576 MHz */
>> delta_t = (t_save - gd->prev_timer) ;
>>
>> gd->prev_timer = t_save;
>> /*
>> Hopefully, the following two lines only results in one hardware 
>> divide when optimized. If your CPU has no hardware divide, or if it 
>> slow, see second method .
>> */
>> gd->timer_in_ms += delta_t / 3576; /* convert elapsed time to ms */
>> gd->timer_remainder += delta_t % 3576; /* add in remaining part not 
>> included above */
>> if (gd->timer_remainder >= 3576) /* a carry has been detected */
>> {
>> ++gd->timer_in_ms;
>> gd->timer_remainder -= 3576; /* fix remainder for the carry above */
>> }
>>
>> return(gd->timer_in_ms)
>> }
>
> Thank you! Basically this is similar to a Bresenham Algorithm.
Hi All,
   Yes, I think you are correct. I didn't know it by that name, but 
i think you are correct. It is a bit different use of the idea, but it 
is very similar.
>
>>
>> This approach works well when the number of ticks per ms is an exact 
>> number representable as a small integer, as it is in this case. It is 
>> exact with a clock rate of 600 MHz, but is not exact for a clock rate 
>> of 666 MHz. 67 is not an exact estimate of ticks per ms, It is 
>> off by 0.5 % That should be acceptable for use as a timeout 
>> delay. The accumulated error in a 10 second delay should be less than 
>> 0.5 ms.
>
> I would think the non exact cases result in such a small error that 
> can be
> tolerated. We are using the ms tick for timeouts, not for providing a 
> clock
> or exact delays. We should just round up when calculating the divider.
Yes, we should round off the divider value, so 666.66 MHz rounds 
to 67 ticks/Ms while 333.33 MHz rounds to 33 ticks/Ms.
>
> Hence the hick-ups that result when this is not called frequent enough to
> prevent a multiple rollover of the raw value between calls do not matter
> either (they should be just documented).
Good, I am glad we agree on this also.
>
>>
>> There is a way that the divide above can be approximated by 
>> multiplying by an appropriate fraction, taking the resulting delta t 
>> in ms, multiplying it by 3576, and subtracting the product from the 
>> original delta to get the remainder. This is the way to go if your 
>> CPU divides slowly or not at all. This approach is presented below.
>>
> [...]
>
> Optimizations would be up to the implementer of such a hardware and work
> only if the divider is a compile time constant. Often the divider will be
> run time determined (AT91 for example).
Correct. I will provide a "generic" version that computes the constants 
at run time.  If the clock rate is a constant, these routines can be 
overridden at compile/link time. This generic version should be 
available on Monday for further review.

Best Regards,
Bill Campbell
>
> Reinhard
>
>

___
U-Boot mailing list
U-Boot@lists.denx.de
http://lists.denx.de/mailman/listinfo/u-boot


Re: [U-Boot] [RFC] Review of U-Boot timer API

2011-05-22 Thread J. William Campbell
On 5/22/2011 6:42 PM, Graeme Russ wrote:
> OK, so in summary, we can (in theory) have:
>   - A timer API in /lib/ with a single u32 get_timer(u32 base) function
>   - A HAL with two functions
> - u32 get_raw_ticks()
> - u32 get_raw_tick_rate() which returns the tick rate in kHz (which
>   max's out at just after 4GHz)
>   - A helper function in /lib/ u32 get_raw_ms() which uses get_raw_ticks()
> and get_tick_rate() to correctly maintain the ms counter used by
> get_timer() - This function can be weak (so next point)
>   - If the hardware supports a native 32-bit ms counter, get_raw_ms()
> can be overridden to simply return the hardware counter. In this case,
> get_raw_ticks() would return 1
>   - Calling of get_raw_ticks() regularly in the main loop (how ofter will
> depend on the raw tick rate, but I image it will never be necessary
> to call more often than once every few minutes)
>   - If the hardware implements a native 32-bit 1ms counter, no call in
> the main loop is required
>   - An optional HAL function u32 get_raw_us() which can be used for
> performance profiling (must wrap correctly)
Hi All,
   Graeme, I think you have stated exactly what is the "best" 
approach to this problem.  I will provide a version of "get_raw_ms" that 
is  initialized using get_raw_tick_rate that will work for all 
"reasonable" values of raw_tick_rate. This will be the "generic" 
solution. Both the initialization function and the get_raw_ms function 
can be overridden if there is reason to do so, like "exact" clock rates. 
I will do the same with get_raw_us. This will be posted sometime on 
Monday for people to review, and to make sure I didn't get too far off 
base. Thank you to both Graeme and Reinhard for looking at/working on 
this.. Hopefully, this solution will put this timing issue to rest for 
all future ports as well as the presently existing ones.
On a different note, the "graylisting" application is causing 
some (about half) of my replies to the list from showing up. The 
messages are going to the specified individuals correctly, but not 
always to the list. This is apparently because my ISP has so many 
different network address available that the probability of using the 
same one (or same subnet) is not very high. Is there anything that can 
be done about this?

Best Regards,
Bill Campbell
> Regards,
>
> Graeme
>
>

___
U-Boot mailing list
U-Boot@lists.denx.de
http://lists.denx.de/mailman/listinfo/u-boot


Re: [U-Boot] [RFC] Review of U-Boot timer API

2011-05-22 Thread J. William Campbell
On 5/22/2011 5:02 PM, Graeme Russ wrote:
> Dear Reinhard,
>
> On Sun, May 22, 2011 at 6:15 PM, Reinhard Meyer
>   wrote:
>> Dear J. William Campbell,
>>
>> please demonstrate for me (and others), by a practical example,
>> how _any_ arithmetic (even less with just shifts and multiplies)
>> can convert a free running 3.576 MHz (wild example) free running
>> 32 bit counter (maybe software extended to 64 bits) into a ms
>> value that will properly wrap from 2**32-1 to 0 ?
>>
>> I fail to see how that will be possible...
> These may help:
>
> http://blogs.msdn.com/b/devdev/archive/2005/12/12/502980.aspx
> http://www.hackersdelight.org/divcMore.pdf
>
> of Google 'division of unsigned integer by a constant'
>
> Basically you will be dividing by a constant value of 3576 which can be
> highly optimised
Hi All,
Yes, you can do it this way. I prefer not to use properties of 
the constant because it doesn't allow multiple clock rates easily. The 
code I posted just a moment ago is easy to change for different rates 
because it does not rely much on the exact bit structure of 3576.
There are simple formulas for all the "magic numbers" in what I posted, 
so it is easy to change the input frequency.
Best Regards,
Bill Campbell
> Regards,
>
> Graeme
>
>

___
U-Boot mailing list
U-Boot@lists.denx.de
http://lists.denx.de/mailman/listinfo/u-boot


Re: [U-Boot] [RFC] Review of U-Boot timer API

2011-05-22 Thread J. William Campbell

On 5/22/2011 1:15 AM, Reinhard Meyer wrote:

Dear J. William Campbell,

please demonstrate for me (and others), by a practical example,
how _any_ arithmetic (even less with just shifts and multiplies)
can convert a free running 3.576 MHz (wild example) free running
32 bit counter (maybe software extended to 64 bits) into a ms
value that will properly wrap from 2**32-1 to 0 ?

Hi All
  I accept the challenge! I will present two ways to do this, one 
using a 32 bit by 16 bit divide, and one using only multiplies.
This first method is "exact", in that there is no difference in 
performance from a "hardware" counter ticking at the 1 ms rate. This is 
accomplished by operating the 1 ms counter based on the delta time in 
the hardware time base. It is necessary to call this routine often 
enough that the hardware counter does not wrap more than once between 
calls. This is not really a problem, as this time is 1201 seconds or so. 
If the routine is not called for a long time, or at the first call, it 
will return a timer_in_ms value that will work for all subsequent calls 
that are within a hardware rollover interval. Since the timer in ms is a 
32 bit number anyway. The same rollover issue will exist if you 
"software extend" the timer to 64 bits. You must assume 1 rollover. If 
it is more than 1, the timer is wrong.



The variables in the gd are
u32 prev_timer;
u32 timer_in_ms;
u16 timer_remainder;

/* gd->timer remainder must be initialized to 0 (actually, an number 
less than 3576, but 0 is nice). Other two variables don't matter but can 
be initialized if desired  */


u32 get_raw_ms()
{
u32 delta;
   u32  t_save;

  read(t_save);   /* atomic read of the hardware 32 bit timer 
running at 3.576 MHz */

  delta_t = (t_save  - gd->prev_timer) ;

  gd->prev_timer =  t_save;
 /*
   Hopefully, the following two lines only results in one hardware 
divide when optimized. If your CPU has no hardware divide, or if it 
slow, see second method .

*/
 gd->timer_in_ms += delta_t  / 3576; /* convert elapsed time to ms */
 gd->timer_remainder += delta_t  % 3576; /* add in remaining part 
not included above */

 if (gd->timer_remainder >= 3576) /* a carry has been detected */
{
   ++gd->timer_in_ms;
   gd->timer_remainder -= 3576; /* fix remainder for the carry above */
}

 return(gd->timer_in_ms)
}

This approach works well when the number of ticks per ms is an exact 
number representable as a small integer, as it is in this case. It is 
exact with a clock rate of 600 MHz, but is not exact for a clock rate of 
666 MHz. 67 is not an exact estimate of ticks per ms, It is off by 
0.5 % That should be acceptable for use as a timeout delay. The 
accumulated error in a 10 second delay should be less than 0.5 ms.


There is a way that the divide above can be approximated by multiplying 
by an appropriate fraction, taking the resulting delta t in ms, 
multiplying it by 3576, and subtracting the product from the original 
delta to get the remainder.  This is the way to go if your CPU divides 
slowly or not at all. This approach is presented below.


the vaues in gd are as follows:

u32 prev_timer;
u32 timer_in_ms;

/*
One tick of the 3.576 MHz timer corresponds to 1/3.576/1000 ms,
or 0.000279642 ms. Scale the fraction by 65536 (16 bit shift),
you get 37532.9217
*/
u32 get_raw_ms(void)
{
  u32  t_save;
  u32  temp;

  /* read the hardware 32 bit timer running at 3.576 MHz */
  read_timer_atomic(t_save);
  t_save -= gd->prev_timer; /* get "delta time" since last call */
  gd->prev_timer += t_save; /* assume we will use all of the counts */

  /*
   * This first while loop is entered for any delta time > about 18.3 
ms. The

   * while loop will execute twice 2.734% of the time, otherwise once.
   */
  while (t_save > 65535)
  {
temp = t_save >> 16;   /* extract msb */
temp  = ((temp * 37532) + ((temp * 60404) >> 16)) >> 11;
/* temp  = (temp * 37532) >> 11; */
gd->timer_in_ms += temp;
t_save  -= temp * 3576;
  }
  /*
   * This second while loop is entered for 94.837% of all possible 
delta times,
   * 0 through 0X. The second while loop will execute twice 
0.037% of

   * the time, otherwise once.
   */
  while (t_save >= 3576)
  {
temp  = (t_save * 37532) >> (16 + 11);
if (temp == 0)
  temp = 1; /* we know that 1 works for sure */
gd->timer_in_ms += temp;
t_save  -= temp * 3576;
  }
  gd->prev_timer -= t_save; /* restore any counts we didn't use this 
time */

  return gd->timer_in_ms;
}

I have tested this code and it seems to work fine for me. I have 
attached a more readable copy as "example .c" for those who wish to play 
around with it. In my original post, I had a version of this code that 
could be used with d

Re: [U-Boot] [RFC] Review of U-Boot timer API

2011-05-22 Thread J. William Campbell
On 5/21/2011 11:23 PM, Graeme Russ wrote:
> On 22/05/11 14:26, Reinhard Meyer wrote:
>> Dear Graeme Russ,
>>> Hi All,
>>>
>>> I've just had a good look through the timer API code with the view of
>>> rationalising it and fixing up some of the inconsistencies. What I found
>>> was a little scarier than I expected! Anyway, here is a write-up of what I
>>> found - Please comment
>> We have been at this discussion a multiple of times :) but never reached a
>> consent.
>>
>> However, at current master, I have reduced at91 architecture to only use
>> get_timer(base), set_timer() never existed and reset_timer() has been 
>> removed.
> Excellent
>
>> As it showed up recently, common cfi code still calls reset_timer() - which
>> certainly
>> can be fixed with little effort...
> Yes, this is one of the easy fixes as all call sites already use the start
> = get_timer(0), elapsed = get_timer(start) convention anyway - The
> reset_timer() calls are 100% redundant (provided get_timer() behaves
> correctly at the 32-bit rollover for all arches)
>
>>> The U-Boot timer API is not a 'callback' API and cannot 'trigger' a
>>> function call after a pre-determined time.
>> that would be too complex to implement and of little use in a single task
>> system. u-boot can do fine with polling.
> I am in no way suggesting this - I just want to clarify the API for anyone
> who needs to use it
>
>>> NOTE: http://lists.denx.de/pipermail/u-boot/2010-June/073024.html appears
>>> to imply the following implementation of get_timer() is wrong:
>>>
>>>  ulong get_timer(ulong base)
>>>  {
>>>  return get_timer_masked() - base;
>>>  }
>> Is is not wrong as long as get_timer_masked() returns the full 32 bit space
>> of numbers and 0x is followed by 0x. Most implementations
>> probably do NOT have this property.
>>> U-Boot Timer API Details
>>> 
>>> There are currently three functions in the U-Boot timer API:
>>>  ulong get_timer(ulong start_time)
>> As you point out in the following, this is the only function required.
>> However it REQUIRES that the internal timer value must exploit the full
>> 32 bit range of 0x to 0x before it wraps back to 0x.
> So this needs to be clearly spelt out in formal documentation
>
>>> Rationalising the API
>>> =
>>> Realistically, the value of the free running timer at the start of a timing
>>> operation is irrelevant (even if the counter wraps during the timed period).
>>> Moreover, using reset_timer() and set_timer() makes nested usage of the
>>> timer API impossible. So in theory, the entire API could be reduced to
>>> simply get_timer()
>> Full ACK here !!!
> I don't think there will be much resistance to this
>
>>> 3. Remove reset_timer_masked()
>>> --
>>> This is only implemented in arm but has propagated outside arch/arm and
>>> into board/ and drivers/ (bad!)
>>>
>>> regex "[\t ]*reset_timer_masked\s*\([^)]*\);" reveals 135 callers!
>>>
>>> A lot are in timer_init() and reset_timer(), but the list includes:
>>>- arch/arm/cpu/arm920t/at91rm9200/spi.c:AT91F_SpiWrite()
>>>- arch/arm/cpu/arm926ejs/omap/timer.c:__udelay()
>>>- arch/arm/cpu/arm926ejs/versatile/timer.c:__udelay()
>>>- arch/arm/armv7/s5p-common/timer.c:__udelay()
>>>- arch/arm/lh7a40x/timer.c:__udelay()
>>>- A whole bunch of board specific flash drivers
>>>- board/mx1ads/syncflash.c:flash_erase()
>>>- board/trab/cmd_trab.c:do_burn_in()
>>>- board/trab/cmd_trab.c:led_blink()
>>>- board/trab/cmd_trab.c:do_temp_log()
>>>- drivers/mtd/spi/eeprom_m95xxx.c:spi_write()
>>>- drivers/net/netarm_eth.c:na_mii_poll_busy()
>>>- drivers/net/netarm_eth.c:reset_eth()
>>>- drivers/spi/atmel_dataflash_spi.c/AT91F_SpiWrite()
>> Fixed in current master.
> Excellent. I have not pulled master for a little while, guess I should
>
>>>- If hardware supports microsecond resolution counters, get_timer() could
>>>  simply use get_usec_timer() / 1000
>> That is wrong. Dividing 32 bits by any number will result in a result that
>> does not
>> exploit the full 32 bit range, i.e. wrap from (0x/1000) to 
>> 0x,
>> which makes time differences go wrong when they span across such a wrap!
>>
> Yes, this has already been pointer out - 42 bits are needed as a bare
> minimum. However, we can get away with 32-bits provided get_timer() is
> called at least every 71 minutes
>
> P.S. Can we use the main loop to kick the timer?
>
>>>- get_usec_timer_64() could offer a longer period (584942 years!)
>> Correct. And a "must be" when one uses such division.
> Unless we can rely on get_timer() to be called at least every 71 minutes in
> which case we can handle the msb's without error in software
>
>> But you have to realize that most hardware does not provide a simple means to
>> implement a timer that runs in either exact microseconds or exact
>> milliseconds.
> This is where t

Re: [U-Boot] [RFC] Review of U-Boot timer API

2011-05-21 Thread J. William Campbell
On 5/21/2011 9:26 PM, Reinhard Meyer wrote:
> Dear Graeme Russ,
>> Hi All,
>>
>> I've just had a good look through the timer API code with the view of
>> rationalising it and fixing up some of the inconsistencies. What I found
>> was a little scarier than I expected! Anyway, here is a write-up of what I
>> found - Please comment
> We have been at this discussion a multiple of times :) but never reached a 
> consent.
>
> However, at current master, I have reduced at91 architecture to only use
> get_timer(base), set_timer() never existed and reset_timer() has been removed.
>
> As it showed up recently, common cfi code still calls reset_timer() - which 
> certainly
> can be fixed with little effort...
>
>> At the lowest level, the U-Boot timer API consists of a unsigned 32-bit
>> free running timestamp which increments every millisecond (wraps around
>> every 4294967 seconds, or about every 49.7 days). The U-Boot timer API
>> allows:
>>- Time deltas to be measured between arbitrary code execution points
>>  ulong start_time;
>>  ulong elapsed_time;
>>
>>  start_time = get_timer(0);
>>  ...
>>  elapsed_time = get_timer(start_time);
>>
>>- Repetition of code for a specified duration
>>  ulong start_time;
>>
>>  start_time = get_timer(0);
>>
>>  while (get_timer(start_time)<   REPEAT_TIME) {
>>  ...
>>  }
>>
>>- Device timeout detection
>>  ulong start_time;
>>
>>  send_command_to_device();
>>  start = get_timer (0);
>>  while (device_is_busy()) {
>>  if (get_timer(start)>   TIMEOUT)
>>  return ERR_TIMOUT;
>>  udelay(1);
>>  }
>>  return ERR_OK;
> correct.
>
>> The U-Boot timer API is not a 'callback' API and cannot 'trigger' a
>> function call after a pre-determined time.
> that would be too complex to implement and of little use in a single task
> system. u-boot can do fine with polling.
>
>> NOTE: http://lists.denx.de/pipermail/u-boot/2010-June/073024.html appears
>> to imply the following implementation of get_timer() is wrong:
>>
>>  ulong get_timer(ulong base)
>>  {
>>  return get_timer_masked() - base;
>>  }
> Is is not wrong as long as get_timer_masked() returns the full 32 bit space
> of numbers and 0x is followed by 0x. Most implementations
> probably do NOT have this property.
>
>> U-Boot Timer API Details
>> 
>> There are currently three functions in the U-Boot timer API:
>>  ulong get_timer(ulong start_time)
> As you point out in the following, this is the only function required.
> However it REQUIRES that the internal timer value must exploit the full
> 32 bit range of 0x to 0x before it wraps back to 0x.
>
>>  void set_timer(ulong preset)
>>  void reset_timer(void)
>>
>> get_timer() returns the number of milliseconds since 'start_time'. If
>> 'start_time' is zero, therefore, it returns the current value of the
>> free running counter which can be used as a reference for subsequent
>> timing measurements.
>>
>> set_timer() sets the free running counter to the value of 'preset'
>> reset_timer() sets the free running counter to the value of zero[1]. In
>> theory, the following examples are all identical
>>
>>  
>>  ulong start_time;
>>  ulong elapsed_time;
>>
>>  start_time = get_timer(0);
>>  ...
>>  elapsed_time = get_timer(start_time);
>>  
>>  ulong elapsed_time;
>>
>>  reset_timer();
>>  ...
>>  elapsed_time = get_timer(0);
>>  
>>  ulong elapsed_time;
>>
>>  set_timer(0);
>>  ...
>>  elapsed_time = get_timer(0);
>>  
>>
>> [1] arch/arm/cpu/arm926ejs/at91/ and arch/arm/cpu/arm926ejs/davinci/ are
>> exceptions, they set the free running counter to get_ticks() instead
> Not anymore on at91.
>> Architecture Specific Peculiarities
>> ===
>> ARM
>>- Generally define get_timer_masked() and reset_timer_masked()
>>- [get,reset]_timer_masked() are exposed outside arch\arm which is a bad
>>  idea as no other arches define these functions - build breakages are
>>  possible although the external files are most likely ARM specific (for
>>  now!)
>>- Most CPUs define their own versions of the API get/set functions which
>>  are wrappers to the _masked equivalents. These all tend to be the same.
>>  The API functions could be moved into arch/arm/lib and made weak for
>>  the rare occasions where they need to be different
>>- Implementations generally look sane[2] except for the following:
>>  - arm_intcm - No timer code (unused CPU arch?)
>>  - arm1136/mx35 - set_timer() is a NOP
>>  - arm926ejs/at91 - reset_timer() sets counter to g

Re: [U-Boot] [RFC] Review of U-Boot timer API

2011-05-21 Thread J. William Campbell
On 5/21/2011 5:06 PM, Graeme Russ wrote:
> On 22/05/11 01:33, J. William Campbell wrote:
>> On 5/21/2011 5:38 AM, Graeme Russ wrote:
>>> Hi All,
>>> 4. Implement microsecond API - get_usec_timer()
>>> ---
>>>- Useful for profiling
>>>- A 32-bit microsecond counter wraps in 71 minutes - Probably OK for most
>>>  U-Boot usage scenarios
>>>- By default could return get_timer() * 1000 if hardware does not support
>>>  microsecond accuracy - Beware of results>   32 bits!
>> Hi All,
>>I think the multiply overflowing an unsigned long is ok here as long
>> as the timeout value you desire is less than 71 seconds. This assumes that
>> the CPU returns the correct lower 32 bits when overflow occurs, but I think
>> this is the "normal" behavior(?)
> I think you mean 71 minutes - 2^32 = 4294967296 (usec) = 4294967.296 (msec)
> = 4294.967296 s = 71.582788267 minutes
>
> So provided any millisecond timer using a 32-bit microsecond timer as a
> time base is called every 71 minutes, a 32-bit microsecond  timer should
> suffice to keep the msb's accurately maintained by software
>
Hi Graeme,
 Yes, you are certainly correct, it is minutes, not seconds.
>>>- If hardware supports microsecond resolution counters, get_timer() could
>>>  simply use get_usec_timer() / 1000
>> I think this actually is NOT equivalent to the "current" API  in
>> that the full 32 bits of the timer is not available and as a result the
>> "wrapping" properties of a 32 bit subtract for delta times will not work
>> properly. If a "larger"  counter is available in hardware, then it is
>> certainly possible to do a 64 by 32 bit divide in get_timer, but probably
>> you don't want to do that  either. As previously discussed, it is possible
>> to extract a 32 bit monotonic counter of given resolution (microsecond or
>> millisecond resolution) from a higher resolution counter using a shift to
>> "approximately" the desired resolution followed by a couple of multiply/add
>> functions of 32 bit resolution.  To do this with a microsecond resolution,
>> a 42 bit or larger timer is required. The "extra" bits can be provided in
> Of course, how silly of me - To downscale the microsecond timer down to
> milliseconds you need to have at least 1000 times more resolution
> (9.965784285 bits) - It was late ;)
>
>> software as long as the get_timer/get_usec_timer routines are called more
>> often than every 71/2 sec, so that a correct delta in microseconds can be
>> obtained. Note that when the timer is not actively in use (not  called
>> often enough), the millisecond timer msb would stop updating, but that
>> wouldn't matter.
> Minutes - see above
Correct.
>> If the hardware supports sub-microsecond accuracy in a "longer"
>> register, say 64 bits, you can just convert the 64 bit hardware timer to 32
>> bit microseconds or milliseconds by a shift  and 32 bit multiplies
> Yes
>
>>Good luck with this effort. I think getting the timer API and also
>> the method of implementation of the interface to the hardware to be the
>> same across all u-boot architectures is a very good idea, and it is
>> possible. However, it is a quite a bit of work and I am glad you are brave
>> enough to try!
> It's all there already - it just needs a little bit of housekeeping :)
Correct, if we do not worry too much about the "low level" details of 
get_timer. It looks to me like there is a lot of cruft there, depending 
on which architecture one looks at. Many implementations create a 
parallel universe of get_ticks or some similar timing system that is 
then used to support get_timer. However, the get_ticks routines are also 
used in timeouts elsewhere in the code. Searching on "get_timer" doesn't 
find these non-standard usages. It would be nice to fix these also, but 
every bit does help!

Best Regards,
Bill Campbell
> Regards,
>
> Graeme
>
>

___
U-Boot mailing list
U-Boot@lists.denx.de
http://lists.denx.de/mailman/listinfo/u-boot


Re: [U-Boot] [RFC] ARM timing code refactoring

2011-01-24 Thread J. William Campbell
On 1/24/2011 5:02 AM, Wolfgang Denk wrote:
> Dear Albert ARIBAUD,
>
> In message<4d3d2942.4060...@free.fr>  you wrote:
>> That is assuming a 64-bit timebase, isn't it? for CPUs / SoCs that don't>
>> have such a timebase but only a 32-bit timer, the bogo_ms/jiffy would>
>> not go through the full 32-bit range, which would cause issues with the>
>> timing loops on rollover -- and while a timeout of more than 65 sec may>
>> not be too likely, a timeout starting near the wraparound value of>
>> bogo_ms still could happen.
Hi All,
If you use my approach of shifting right by n bits  (in order to 
get a counter that has "about" 1 ms resolution),  zeros are shifted into 
the top. The counter would never reach more than (32-n) bits in 
magnitude. This is easily fixed by creating a "virtual" n msbs for the 
counter, as must be done if a 64 bit counter is to be simulated. 
Actually, I only need "n" bits, but in any case it requires detecting 
that the counter has "backed up" mod 32 bits and if so, increase the 
virtual counter by 1.

If you really, really, really want a timer that "ticks" at a 1 ms rate, 
instead of bogo_ms/jiffies, I propose the following:

u32 get_time()
{
 u32 delta_t_bogo_ms;
u32  t_save;
#if defined(TIMER_IS_64_BITS)
 u64 tick;
 read(&tick);

  t_save  = tick >> gd->timer_shift;
  delta_t_bogo_ms = t_bogo_ms - gd->prev_timer;
#else
   read(t_save);
   delta_t_bogo_ms = (t_save  - gd->prev_timer) >> gd->timer_shift;
#endif
   gd->prev_timer = t_save;  /* save previous counter */
   if (delta_t_bogo_ms < gd->bogo_ms_per_65_sec)
   {
  gd->fract_timer+=  delta_t_bogo_ms * 
gd->cvt_bogo_ms_to_ms; /* accumulate fraction of ms */
  gd->timer_in_ms += gd->fract_timer >> 16;
 gd->fract_timer &= 0X;
   }
   else
 gd->fract_timer = 0; /* start accumulating from 0 fraction */
   return(gd->timer_in_ms)
}

This routine will create a timer in ms, that will be 
accurate as long as this routine is called either once about ever 65 
seconds or once per time base rollover, whichever is less. Nested timer 
use is ok. If the timer is not called frequently enough, it will return 
the same value twice, but after that it will start timing normally. This 
shouldn't matter, as the second returned value will  be the start of a 
timing loop. Timeout values can be the full 32 bits, as long as you keep 
calling the routine frequently enough. No initialization is required.
   Note that you can (and probably should) use the bottom 32 
bits of the hardware timebase as a 32 bit timebase unless the clock 
would overflow in 65 seconds (running faster than about 66 MHz), or if 
you want to  relax the 65 seconds.
If you want to save a word in the gd data,  use 0X1 instead of 
bogo_ms_per_65_sec (or a more precise value if you know it). Note that 
gd->timer_shift and gd->cvt_bogo_ms_to_ms can also be replaced by 
constants if the clock speed is fixed.

This is more expensive than using the bogo_ms timer, but does have the 
advantage that everything is in ms. FWIW, I think converting from ms to 
some other unit for loop control is fine, as long as we have a standard 
routine to do that that is "cheap". However, others may not agree. For 
sure, passing around 64 bit tick values to do this process is, IMHO, 
vast overkill and not a good general solution, as many processors really 
don't like to do 64 bit operations.

Best Regards.
Bill Campbell


> Sorry, but I don't get it.  What exactly is the problem with a 32 bit
> counter, and why would it not go through the full 32-bit range?
>
> Best regards,
>
> Wolfgang Denk
>

___
U-Boot mailing list
U-Boot@lists.denx.de
http://lists.denx.de/mailman/listinfo/u-boot


Re: [U-Boot] [RFC] ARM timing code refactoring

2011-01-23 Thread J. William Campbell
On 1/23/2011 2:57 PM, Wolfgang Denk wrote:
> Dear Reinhard Meyer,
>
> In message<4d3c9bfc.1010...@emk-elektronik.de>  you wrote:
 get_timer() returns a monotonous upward counting time stamp with a
 resolution of milliseconds. After reaching ULONG_MAX the timer wraps
 around to 0.
>> Exactly that wrap makes the situation so complicated, since the simple code
>> u32 get_timer(void)
>> {
>>return (ticks * 1000ULL) / tickspersec;
>> }
>> won't do that wrap.
> Do you have a better suggestion?
>
 The get_timer() implementation may be interrupt based and is only
 available after relocation.
>> Currently it is used before relocation in some places, I think I have
>> seen it in NAND drivers... That would have to be changed then.
> Indeed.  It is unreliable or even broken now.
>
>> This is already implemented functionally very closely (apart from factoring 
>> and the
>> get_timer(void) change) to this in AT91, the only (academic) hitch is that 
>> it will
>> burp a few billion years after each reset :)
>> What bothers me is the need for 64 bit mul/div in each loop iteration, for 
>> CPUs without
>> hardware for that this might slow down data transfer loops of the style
>>
>> u32 start_time = get_timer();
>> do {
>>  if ("data_ready")
>>  /* transfer a byte */
>>  if (get_timer() - start_time>  timeout)
>>  /* fail and exit loop */
>> } while (--"bytestodo">  0);
>>
>> since get_timer() will be somewhat like:
>>
>>  return (tick * 1000ULL) / tickspersec;
>>
>> As I stated before, tickspersec is a variable in, for example, AT91. So the
>> expression cannot be optimized by the compiler.
> I don't think this is the only way to implement this. How does Linux
> derive time info from jiffies?

Hi All,
   In order to avoid doing 64 bit math, we can define a "jiffie" 
or a "bogo_ms" that is the 64 bit timebase shifted right such that the 
lsb of the bottom 32 bits has a resolution of between 0.5 ms and 1 ms. 
It is then possible to convert the difference between two jiffie/bogo_ms 
values to a number of ms using a 32 bit multiply and a right shift of 16 
bits, with essentially negligible error.  get_bogo_ms() would return a 
32 bit number in bogo_ms, thus the timing loop would be written.

u32 start_time = get_bogo_ms();
do {
 if ("data_ready")
 /* transfer a byte */
 if (bogo_ms_to_ms(get_timer() - start_time)>  TIMEOUT_IN_MS)
 /* fail and exit loop */
} while (--"bytestodo">  0);

u32 get_bogo_ms()
{
 u64 tick;
 read(tick);

  return (tick>>  gd->timer_shift);
}
u32 bogo_ms_to_ms(u32 x)
{
/* this code assumes the resulting ms will be between 0 and 65535, 
or 65 seconds */
return ((x * gd->cvt_bogo_ms_to_ms)>>  16); /* cvt_bogo_ms_to_ms 
is a 16 bit binary fraction */
}

All the above code assumes timeouts are 65 seconds or less, which I 
think is probably fair. Conversion of ms values up to 65 seconds to 
bogo_ms is also easy, and a 32 bit multiplied result is all that is 
required.
What is not so easy is converting a 32 bit timer value to ms.  It can be 
done if the CPU can do a 32 by 32 multiply to produce a 64 bit result, 
use the msb, and possibly correct the result by an add if  bit 32,of the 
timer is set.  You need a 33 bit counter in bogo_ms to get a monotonic, 
accurate 32 bit counter in ms. The powerpc can use a mulhw operation to 
do this, and any CPU that will produce a 64 bit product can do this. 
However, many CPUs do not produce 64 bit products easily. Using division 
to do these operations are even less appealing, as many CPUs do not 
provide hardware division at all. Since it is not necessary to do this 
conversion to easily use timeouts with 1 ms resolution and accuracy,  I 
think the idea of not using a timer in ms but rather bogo_ms/jiffies is 
possibly better?

Best Regards,
Bill Campbell

> Best regards,
>
> Wolfgang Denk
>

___
U-Boot mailing list
U-Boot@lists.denx.de
http://lists.denx.de/mailman/listinfo/u-boot


Re: [U-Boot] TIMER cleanup RFC, was: [PATCH 4/4] arm920t/at91/timer: replace bss variables by gd

2010-11-30 Thread J. William Campbell
On 11/30/2010 7:48 AM, Reinhard Meyer wrote:
> Dear J. William Campbell,
>> On 11/30/2010 1:14 AM, Reinhard Meyer wrote:
>>> Dear Wolfgang Denk,
>>>
>>> what we really need is only a 32 bit monotonous free running tick that
>>> increments
>>> at a rate of at least 1 MHz. As someone pointed out a while ago, even
>>> at 1GHz that would
>>> last for four seconds before it rolls over. But a 1HGz counter could
>>> be 64 bit internally
>>> and always be returned as 32 bits when it is shifted right to bring it
>>> into the "few" MHz
>>> range.
>>>
>>> Any architecture should be able to provide such a 32 bit value. On
>>> powerpc that would
>>> simply be tbu|tbl shifted right a few bits.
>>>
>>> An architecture and SoC specific timer should export 3 functions:
>>>
>>> int timer_init(void);
>>> u32 get_tick(void); /* return the current value of the internal free
>>> running counter */
>>> u32 get_tbclk(void); /* return the rate at which that counter
>>> increments (per second) */
>>>
>>> A generic timer function common to *most* architectures and SoCs would
>>> use those two
>>> functions to provice udelay() and reset_timer() and get_timer().
>>> Any other timer functions should not be required in u-boot anymore.
>>>
>>> However get_timer() and reset_timer() are a bit of a functional problem:
>>>
>>> currently reset_timer() does either actually RESET the free running
>>> timer (BAD!) or
>>> remember its current value in another (gd-)static variable which later
>>> is subtracted
>>> when get_timer() is called. That precludes the use of several timers
>>> concurrently.
>>>
>>> Also, since the 1000Hz base for that timer is usually derived from
>>> get_tick() by
>>> dividing it by some value, the resulting 1000Hz base is not exhausting
>>> the 32 bits
>>> before it wraps to zero.
> (###)
Hi All,
   A correct method to provide a full 32 bits of resolution at a 1 
ms rate is simple. It requires maintaining a software timer in ms. This 
timer is updated by converting the difference in the hardware time base 
to ms and then adding the correct number of ms to the software timer.
As previously discussed, this only requires calling get_timer once a 
second or so. The conversion can be made simply in most cases, 
especially if the clock rate is a "nice" number. Even in the worst case, 
it is not too hard to "adjust" the saved hardware timer value to contain 
any remainder ticks. Yes, if this is done incorrectly, the results are bad.
>>> Therefore I propose two new functions that are to replace
>>> reset_timer() and get_timer():
>>>
>>> u32 init_timeout(u32 timeout_in_ms); /* return the 32 bit tick value
>>> when the timeout will be */
>>> bool is_timeout(u32 reference); /* return true if reference is in the
>>> past */
>>>
>>> A timeout loop would therefore be like:
>>>
>>> u32 t_ref = timeout_init(3000);/* init a 3 second timeout */
>>>
>>> do ... loop ... while (!is_timeout(t_ref));
>>>
>>> coarse sketches of those functions:
>>>
>>> u32 init_timeout(u32 ms)
>>> {
>>>  return get_ticks() + ((u64)get_tbclk() * (u64)ms) / (u64)1000;
>>> }
>>>
>>> bool is_timeout(u32 reference)
>>> {
>>>  return ((int)get_ticks() - (int)reference)>   0;
>>> }
>>>
>>> Unfortunately this requires to "fix" all uses of get_timer() and
>>> friends, but I see no other
>>> long term solution to the current incoherencies.
>>>
>>> Comments welcome (and I would provide patches)...
>> Hi All,
>>The idea of changing the get_timer interface to the
>> init_timeout/is_timeout pair has the advantage that it is only necessary
>> to change the delay time in ms to an internal timebase once, and after
>> that, only a 32-bit subtraction is required.
> Exactly.
>> I do not however like the
>> idea of using 64 bit math to do so, as on many systems this is quite
>> expensive. However, this is a feature that can be optimized for
>> particular CPUs.
> 64 Bit math is only necessary when get_tbclk() times the maximun anticipated
> timeout (in ms) is larger than 32 bits. This happens easyly when tbclk is a
> few MHz.
What I think you mean to say here is that the delay time fits easily 
into 32 bits for all reasonable clock rates. For instance, a 1 GHz clock 
rate yields a

Re: [U-Boot] TIMER cleanup RFC, was: [PATCH 4/4] arm920t/at91/timer: replace bss variables by gd

2010-11-30 Thread J. William Campbell
On 11/30/2010 1:14 AM, Reinhard Meyer wrote:
> Dear Wolfgang Denk,
>
> what we really need is only a 32 bit monotonous free running tick that 
> increments
> at a rate of at least 1 MHz. As someone pointed out a while ago, even at 1GHz 
> that would
> last for four seconds before it rolls over. But a 1HGz counter could be 64 
> bit internally
> and always be returned as 32 bits when it is shifted right to bring it into 
> the "few" MHz
> range.
>
> Any architecture should be able to provide such a 32 bit value. On powerpc 
> that would
> simply be tbu|tbl shifted right a few bits.
>
> An architecture and SoC specific timer should export 3 functions:
>
> int timer_init(void);
> u32 get_tick(void); /* return the current value of the internal free running 
> counter */
> u32 get_tbclk(void); /* return the rate at which that counter increments (per 
> second) */
>
> A generic timer function common to *most* architectures and SoCs would use 
> those two
> functions to provice udelay() and reset_timer() and get_timer().
> Any other timer functions should not be required in u-boot anymore.
>
> However get_timer() and reset_timer() are a bit of a functional problem:
>
> currently reset_timer() does either actually RESET the free running timer 
> (BAD!) or
> remember its current value in another (gd-)static variable which later is 
> subtracted
> when get_timer() is called. That precludes the use of several timers 
> concurrently.
>
> Also, since the 1000Hz base for that timer is usually derived from get_tick() 
> by
> dividing it by some value, the resulting 1000Hz base is not exhausting the 32 
> bits
> before it wraps to zero.
>
> Therefore I propose two new functions that are to replace reset_timer() and 
> get_timer():
>
> u32 init_timeout(u32 timeout_in_ms); /* return the 32 bit tick value when the 
> timeout will be */
> bool is_timeout(u32 reference); /* return true if reference is in the past */
>
> A timeout loop would therefore be like:
>
> u32 t_ref = timeout_init(3000);   /* init a 3 second timeout */
>
> do ... loop ... while (!is_timeout(t_ref));
>
> coarse sketches of those functions:
>
> u32 init_timeout(u32 ms)
> {
>   return get_ticks() + ((u64)get_tbclk() * (u64)ms) / (u64)1000;
> }
>
> bool is_timeout(u32 reference)
> {
>   return ((int)get_ticks() - (int)reference)>  0;
> }
>
> Unfortunately this requires to "fix" all uses of get_timer() and friends, but 
> I see no other
> long term solution to the current incoherencies.
>
> Comments welcome (and I would provide patches)...
Hi All,
   The idea of changing the get_timer interface to the 
init_timeout/is_timeout pair has the advantage that it is only necessary 
to change the delay time in ms to an internal timebase once, and after 
that, only a 32-bit subtraction is required. I do not however like the 
idea of using 64 bit math to do so, as on many systems this is quite 
expensive. However, this is a feature that can be optimized for 
particular CPUs. I also REALLY don't like the idea of having a get_ticks 
function, because for sure people will use this instead of the desired 
interface to the timer because it is "better". Then we get back into a 
mess. Since in most cases get_ticks is one or two instructions, please, 
let us hide them in init_timeout/is_timeout.
An alternate approach, which has the merit of being more like 
the originally intended interface, simply disallows reset_timer since it 
is  totally unnecessary. The only dis-advantage of the original approach 
using just get_timer is that the conversion to ms must be considered at 
each call to get_timer, and will require at a minimum one 32 bit integer 
to remember the hardware timer value the last time get_timer was called 
(unless the hardware time can be trivially converted to a 32 bit value 
in ms, which is quite uncommon). This is not a high price to pay, and 
matches the current usage. This is probably for Mr. Denk to decide. If 
we were just starting now, the init_timeout/is_timeout is simpler, but 
since we are not, perhaps keeping the current approach has value.
I would really like to help by providing some patches, but I am 
just way too busy at present.

Best Regards,
Bill Campbell
> Reinhard
>
> ___
> U-Boot mailing list
> U-Boot@lists.denx.de
> http://lists.denx.de/mailman/listinfo/u-boot
>
>

___
U-Boot mailing list
U-Boot@lists.denx.de
http://lists.denx.de/mailman/listinfo/u-boot


Re: [U-Boot] Timer implementations

2010-11-01 Thread J. William Campbell
On 10/27/2010 11:02 PM, Reinhard Meyer wrote:
> Dear J. William Campbell,
>> Hi All,
>>
>> I am pretty sure the migration to 64 bits was caused by 1) people not 
>> understanding that the timer operating on time DIFFERENCES would work 
>> fine even if the underlying timer wrapped around (most probable 
>> problem) and possibly 2) broken timer functions causing bogus 
>> timeouts, improperly "fixed" by switching to 64 bits.
>>
>> I think u-boot could get along just fine with only 2 time related 
>> functions, uint_32 get_timer(uint_32 base) and udelay(uint 32 delay). 
>> udelay will only work on "small" values of delay, on the order of 
>> milliseconds. It is to be used when a "short" but "precise" delay in 
>> microsecond resolution is required. Users of get_timer must 
>> understand that it is only valid if it is called "often enough", i.e. 
>> at least once per period of the underlying timer. This is required 
>> because u-boot does not want to rely on interrupts as a timer update 
>> method. Therefore, all uses of get_timer must 1) call it once 
>> initially to get a start value, and 2) call get_timer at least once 
>> per period of the underlying hardware counter. This underlying period 
>> is guaranteed to be at least 4.29 seconds (32 bit counter at 4 GHz). 
>> Note that this does NOT mean that the total wait must be less than 
>> 4.29 seconds, only that the rate at which the elapsed time is TESTED 
>> must be adequate.
>>
>> In order to implement this functionality, at least one hardware timer 
>> of some kind is required. An additional software "timer" in 1 ms 
>> resolution may be useful in maintaining the software time. If the 
>> hardware timer update rate is programmable, u-boot MAY set the update 
>> rate on initialization On initialization, u-boot MAY reset the 
>> hardware timer and MAY reset any associated software timer. The 
>> hardware timer MAY be started on initialization. On each call to 
>> get_timer(), u-boot MUST start the hardware timer if it was not 
>> started already. On calls to get_timer, u-boot MUST NOT reset the 
>> hardware timer if it was already started. The software timer MAY be 
>> reset if u-boot can unambiguously determine that more than 4.29 
>> seconds has elapsed since the last call to get_timer.
>>
>> The simplest case for implementing this scheme is if two programmable 
>> timers exist that can be set to 1ms and 1us. The timers are 
>> initialized at start-up, get_timer just returns the 32 bit 1 ms timer 
>> and udelay just waits for the number of ticks required on the second 
>> timer to elapse. The most common harder case is where there is only 
>> one timer available, it is running at 1 us per tick or faster, and we 
>> cannot control the rate. udelay is still easy, because we can convert 
>> the (small) delay in us to a delay in ticks by a 32 bit multiply that 
>> will not overflow 32 bits even if we have quite a few fractional bits 
>> in the tics per microsecond value. The elapsed ticks required is the 
>> (delay in us * us/per tick) >> # fractional bits in us/per tick. If 
>> that is not close enough for you, you can do it as (delay in us * 
>> (integer part of us/tick)) + ((delay in us * (fractional 
>> part)us/tick) >> # fraction bits). For "nice" numbers, like any 
>> integral number of MHz, there is no fractional
>
>> part. Only numbers like 66 MHz, or 1.666 GHz require messing with the 
>> fractional part.
>> For get_timer, it is a bit harder. The program must keep two 32 bit 
>> global variables, the timer reading "last time" and the software 
>> timer in 1 ms resolution. Whenever get_timer is called, it must 
>> increase the software timer by the number of ms that have elapsed 
>> since the previous update and record the corresponding timer reading 
>> as the new "last time". Note that if the number of ms elapsed is not 
>> an integer (a common case), the value recorded as the "last time" 
>> must be decreased by the number of ticks not included in the 1 ms 
>> resolution software timer. There are many different ways to 
>> accomplish update, depending on what hardware math capabilities are 
>> available, and whether one thinks efficiency is important here. 
>> Conceptually, you convert the elapsed time in ticks into an 
>> equivalent number of ms, add that number to the software timer, store 
>> the current value of the hardware timer in last time, and subtract 
>> any "remainder" ticks from that value. If the

Re: [U-Boot] Timer implementations

2010-10-26 Thread J. William Campbell
On 10/26/2010 6:33 AM, Reinhard Meyer wrote:
> Dear Wolfgang Denk,
>>> Then the define CONFIG_SYS_HZ should not be in every.h since that
>>> suggests that a board developer has some freedom there...
>> Agreed - there are historical reasons this has ever been changable at
>> all.
>>
>>> and MOST IMPORTANT that some implementations of udelay() might
>>> call reset_timer() and therefore break an outer timeout loop !!!
>> Such implementations are inherently broken and need to be fixed.
> Found such in arm926ejs/omap... But then, that timer is multiple-broken:
> relocation broken (uses static data), returns 32 a bit value in get_ticks(),
> returns CONFIG_SYS_HZ in get_tbclk() instead of the rate get_ticks()
> increments...
>
> PXA:
> void udelay_masked (unsigned long usec)
> {
>   unsigned long long tmp;
>   ulong tmo;
>
>   tmo = us_to_tick(usec);
>   tmp = get_ticks() + tmo;/* get current timestamp */
>
>   while (get_ticks()<  tmp)
>   /* loop till event _OR FOREVER is tmp happens to be>  32 bit_ */
>/*NOP*/;
>
> }
>
> unsigned long long get_ticks(void)
> {
>   return readl(OSCR);
> }
> - not any better :( -- its the same code that AT91 had before I fixed it.
>
>>> It is also open if reset_timer() does actually reset the hardware timer
>>> (e.g. tbu/tbl at PPC) - which would be messing up any time difference
>>> calculation using get_ticks() - or does emulate that by remembering
>>> the hardware value and subtracting it later in every subsequent
>>> get_timer() call?
>> This is an implementation detail.
> IF we require that get_ticks() and get_timer() shall not interfere with
> each other and IF both are based on the same hardware timer only the
> second method is available (same if the hardware timer is not easyly
> resettable).
>
>>> 2. get_ticks() and friends operate at a higher rate (tbu/tbl for PPC).
>>> Since they are defined as having 64 bits they MUST not wrap at 32 bits,
>>> i.e. if the hardware provides only 32 bits, the upper 32 bits must be
>>> emulated by software.
>> Right.
>>
>>> Otherwise we have to document that get_ticks() cannot be used to get
>>> 64 bit time differences.
>> No. Such an implementation is broken and needs fixing.
> Original AT91 timer.c was like that, and I think other ARMs where this was
> copied around should be looked at... I don't know when get_timer() became
> 64 bits, but it seems that some implementations just did change the return
> type: uint64 get_timer(void) {return (uint64)timer_val_32;}
Hi All,

  I am pretty sure the migration to 64 bits was caused by 1) people 
not understanding that the timer operating on time DIFFERENCES would 
work fine even if the underlying timer wrapped around (most probable 
problem) and possibly 2) broken timer functions causing bogus timeouts, 
improperly "fixed" by switching to 64 bits.

I think u-boot could get along just fine with only 2 time related 
functions,  uint_32 get_timer(uint_32 base) and udelay(uint 32 delay).  
udelay will only work on "small" values of delay, on the order of 
milliseconds. It is to be used when a "short" but "precise" delay in 
microsecond resolution is required.  Users of get_timer must understand 
that it is only valid if it is called "often enough", i.e. at least once 
per period of the underlying timer. This is required because u-boot  
does not want to rely on interrupts as a timer update method. Therefore, 
all uses of get_timer must 1) call it once initially to get a start 
value, and 2) call get_timer at least once per period  of the underlying 
hardware counter. This underlying period is guaranteed to be at least 
4.29 seconds (32 bit counter at 4 GHz). Note that this does NOT mean 
that the total wait must be less than 4.29 seconds, only that the rate 
at which the elapsed time is TESTED must be adequate.

In order to implement this functionality, at least one hardware timer of 
some kind is required. An additional software "timer" in 1 ms resolution 
may be useful in maintaining the software time. If the hardware timer 
update rate is programmable, u-boot MAY set the update rate on 
initialization On initialization, u-boot MAY reset the hardware timer 
and MAY reset any associated software timer. The hardware timer MAY be 
started on initialization. On each call to get_timer(), u-boot MUST 
start the hardware timer if it was not started already. On calls to 
get_timer, u-boot MUST NOT reset the hardware timer if it was already 
started. The software timer MAY be reset if u-boot can unambiguously 
determine that  more than 4.29 seconds has elapsed since the last call 
to get_timer.

The simplest case for implementing this scheme is if two programmable 
timers exist that can be set to 1ms and 1us.  The timers are initialized 
at start-up, get_timer just returns the 32 bit 1 ms timer and udelay 
just waits for the number of ticks required on the second timer to 
elapse. The most common harder case is where there is only one timer 

Re: [U-Boot] [PATCH v2] mmc: omap: timeout counter fix

2010-10-26 Thread J. William Campbell
On 10/25/2010 11:01 PM, Reinhard Meyer wrote:
> Dear Wolfgang Denk,
>> Dear Reinhard Meyer,
>>
>> In message<4cc66a67.4000...@emk-elektronik.de>   you wrote:
 It fails in case the timer wraps around.

 Assume 32 bit counters, start time = 0xFFF0, delay = 0x20. It
 will compute end = 0x10, the while codition is immediately false, and
 you don't have any delay at all, which most probably generates a
 false error condition.
>>> I used and assumed a 64 bit counter, that will not wrap around while
>>> our civilization still exists...
>>
>> The code is still wrong, and as a simple correct implementation exists
>> there is no excuse for using such incorrect code.
>>
>> Please fix that!
> Agreed here. People are invited to dig through u-boot and find all
> those places.
>
>>> If get_ticks() is only 32 bits worth, both methods will misbehave
>>> at a 32 bit wrap over.
>> No.
>>
start = time();
while ((time() - start)>> True, provided the underlying timer is really 64 bits, otherwise
>>> this fails, too...
>   >  You are wrong. Try for example this:
>   >
>   >  - snip ---
>   >  #include
>   >
>   >  int main(void)
>   >  {
>   >   unsigned int time = 0xFFF0;
>   >   unsigned int delay = 0x20;
>   >   unsigned int start;
>
> You are wrong here, because you take it out of context.
> My "demo" is using the (declared as) 64 bit function get_ticks().
> I mentioned above that this function MUST be truly returning 64
> bits worth of (incrementing) value to make any version work.
> If get_ticks() just returns a 32 bit counter value neither method will work
> reliably. Just check all implementations that this function is implemented
> correctly.
Hi All,
   I have wondered for quite some time about the rush to make 
get_ticks() return a 64 bit value. For any "reasonable" purpose, like 
waiting a few seconds for something to complete, a 32 bit timebase is 
plenty adequate. If the number of ticks per second is 10, i.e. a 
1 GHz clock rate, the clock wraps in a 32 bit word about every five 
seconds.
The trick is that time always moves forward, so a current get_ticks() - 
a previous get_ticks() is ALWAYS a positive number. It is necessary to 
check the clock more often than (0X1 - your_timeout) times per 
second, but unless your timeout is very near the maximum time that fits 
into 32 bits, this won't be a problem. Most CPUs have a counter that 
count at a "reasonable" rate. Some CPUs also have a "cycle counter" that 
runs at the CPU clock rate. These counters are useful to determine 
exactly how many machine cycles a certain process took, and therefore 
they have high resolution. Timers for simple delays neither need nor 
want such resolution. If the only counter available on you CPU runs at 
several GHz, and is 64 bits long,  just shift it right a few bits to 
reduce the resolution to a "reasonable"  resolution and return a 32 bit 
value. There is no need for a bunch of long long variables and extra 
code running around to process simple timeouts. It may be that we need a 
different routine for precision timing measurements with high 
resolution, but it needn't, and probably shouldn't IMHO be get_ticks().

Best Regards,
Bill Campbell
> ___
> U-Boot mailing list
> U-Boot@lists.denx.de
> http://lists.denx.de/mailman/listinfo/u-boot
>
>

___
U-Boot mailing list
U-Boot@lists.denx.de
http://lists.denx.de/mailman/listinfo/u-boot


Re: [U-Boot] [PATCH 2/2] ARM: fix relocation support for onenand device.

2010-10-23 Thread J. William Campbell
On 10/23/2010 1:56 PM, Wolfgang Denk wrote:
> Dear Enric Balletbo i Serra,
>
> In message<1287479602-21721-3-git-send-email-eballe...@iseebcn.com>  you 
> wrote:
>> We also have to relocate the onenand command table manually, otherwise
>> onenand command don't work.
>>
>> Signed-off-by: Enric Balletbo i Serra
>> ---
>>   arch/arm/lib/board.c |3 +++
>>   common/cmd_onenand.c |6 ++
>>   2 files changed, 9 insertions(+), 0 deletions(-)

Is this patch still necessary? I thought the relocation change made it OBE.

Best Regards,
Bill Campbell

> Applied, thanks.
>
> Best regards,
>
> Wolfgang Denk
>

___
U-Boot mailing list
U-Boot@lists.denx.de
http://lists.denx.de/mailman/listinfo/u-boot


Re: [U-Boot] [PATCH] mpc83xx: Add -fpic relocation support

2010-10-13 Thread J. William Campbell
  On 10/12/2010 11:30 PM, Albert ARIBAUD wrote:
> Le 12/10/2010 23:00, Joakim Tjernlund a écrit :
>
>> Yes, but the difference isn't really the arch. It is the -mrelocatable
>> flag that is the big difference.
> Not only: obviously, implementing GOT relocation is not done the same on
> both archs, and it simply is not beneficial on ARM wrt PPC in terms of
> instructions. I did a pretty extensive run of tests with and without
> -fPIC and -fPIE on ARM, and GOT relocation clearly makes code bigger,
> whereas it does not PPC.
>
> This simply implies that -fPIC is a better choice for PPC (and hence
> -mrelocatable) while -fpie is a better one for ARM.
Hi All,
  In particular, the PPC takes two 32 bit instructions to load the 
known address of a variable into a register. If the GOT is used, a 
single 32 bit instruction can load the address of a variable from the 
GOT table (pointed to by a "fixed" register) into a register. In both 
cases, there are two memory cycles, but in the GOT case, only one 
instruction is required. This is why the GOT based code is smaller. 
However, the GOT cannot be used to address constants and some other 
items that are not "variables". I do think that -fPIC and -fpie are not 
mutually incompatible. On the PPC, the GOT references would be relocated 
in the loop that updates the GOT and the references to constants would 
be relocated by the ELF relocation code. That is how shared libraries 
are relocated.

Best Regards,
Bill Campbell

> Amicalement,

___
U-Boot mailing list
U-Boot@lists.denx.de
http://lists.denx.de/mailman/listinfo/u-boot


Re: [U-Boot] [PATCH] futile c relocation attempt

2010-10-06 Thread J. William Campbell
  On 10/6/2010 2:43 AM, Graeme Russ wrote:
> On 06/10/10 01:48, Reinhard Meyer wrote:
>> ---
>>   arch/arm/cpu/arm926ejs/start.S |8 -
>>   arch/arm/lib/board.c   |   57 
>> +++-
>>   include/configs/top9000_9xe.h  |1 +
>>   3 files changed, 63 insertions(+), 3 deletions(-)
>>
> I had a quick look at this and nothing is jumping out at me. Of course I am
> not familiar with ARM asm...
>
> I don't see any reason why this ultimately will not work eventually. You
> may be having some issues with the transition from asm->C->asm through the
> relocation - This was an especially painful thing for me involving an
> intermediate trampoline which I have only recently figured out how to remove.
>
> Maybe some memory barriers are needed to stop the C optimiser mangling things?
>
> I am sure what you have is very close to the real solution :)
>
> I do think the main relocation fixup loop can be moved into a common
> library in which case we can add additional case statements. The nice thing
> is that x86 as all Type 8 which is specifically allocated to x86 so my "if
Hi All,
   I think that type 8 IS NOT allocated to the 386. For instance, 
R_PPC_ADDR14_BRTAKEN also has a value of 8. So does R_ARM_ABS8. I think 
that there will be a lot of #ifdefs just to keep the references 
symbolic. Many of the different platform  relocation types will come out 
to be the same code in the switch statement, but different symbolic 
names. There will also be some entries that are processor specific and 
have no equivalent on other processors. I think it would be a good idea 
to use the symbolic values of the Relocation types (as opposed to 
integer constants), as it will make the code clearer. There are sort of 
two ways to organize the code inside the switch statement. Since the 
code inside the switch statement is very short, it might be best for 
each architecture (ELF format) to be  bunched together, even at the 
expense of repeating the same executable statements that some other 
formats may use, as follows:
#ifdef PPC
case R_PPC_ADDR32:  /* S + A */

   break;

case R_PPC_RELATIVE:  /* B + A */

   break;
#endif
#ifdef I386
case R_386_32:  /* S + A */ /* I think this is the other 
location type Graeme used, but I could be wrong */

   break;

case R_386_RELATIVE:  /* B + A */

   break;

#endif
#ifdef ARM
case R_ARM_ABS32:  /* S + A */ /* I don't remember the ARM 
relocation types used  */

   break;

case R_ARM_REL32:  /* B + A  - P */

   break;

#endif

Or we could group the various Relocation types by what they actually do:
#ifdef PPC
case R_PPC_ADDR32:  /* S + A */
#endif
#ifdef I386
case R_386_32:  /* S + A */ /* I think this is the other 
location type Graeme used, but I could be wrong */
  #endif
  #ifdef ARM
case R_ARM_ABS32:  /* S + A */ /* I don't remember the ARM 
relocation types used  */
#endif

   break;

#ifdef PPC
case R_PPC_RELATIVE:  /* B + A */
#endif
  #ifdef I386
 case R_386_RELATIVE:  /* B + A */
#endif

   break;

#ifdef ARM
case R_ARM_REL32:  /* B + A  - P */

   break;
#endif

Note that the ARM_REL32 is defined differently than the PPC/I386 
relative, FWIW. I also don't know what to use for the names of the 
binary formats. It would be nice if we could use something already in 
the header files? Thoughts on all this solicited!

Best Regards,
Bill Campbell
>> TEXT_BASE" checks can be kept. For size freaks, we could litter the code
> with #ifdefs to remove un-needed cases ;)
>
> Interestingly, ARM is adding gd->reloc_off while x86 is subtracting
> gd->reloc_off. If this is correct, I need to change the calculation of
> gd_reloc_off to be consistent
Adding is the way the specifications define it. add B+A.

Best Regards,
Bill Campbell
> Regards,
>
> Graeme
> ___
> U-Boot mailing list
> U-Boot@lists.denx.de
> http://lists.denx.de/mailman/listinfo/u-boot
>
>

___
U-Boot mailing list
U-Boot@lists.denx.de
http://lists.denx.de/mailman/listinfo/u-boot


Re: [U-Boot] [RFC] [PATCH] arm: arm926ejs: use ELF relocations

2010-10-04 Thread J. William Campbell
  On 10/4/2010 10:30 PM, Wolfgang Denk wrote:
> Dear Albert ARIBAUD,
>
> In message<4caa50aa.3000...@free.fr>  you wrote:
>> Remember: this patch only applies to boards which boot from NOR FLASH!
>> You can test it on other types of boards (NAND-based, etc) for
>> regression testing, but nothing more.
> Assuming the NAND loder does not load U-Boot to it's final location
> at the upper end of RAM, but - say - somewhere in lower memory, the
> standard relocation preocess will be running, so I think there should
> be no real difference between (such) NAND booting systems and NOR
> booting ones - or am I missing something?
FWIW I think you are right. If u-boot is linked for the address where 
the NAND loader put it, everything should work fine. It can
size memory, move a copy of u-boot to the top of memory, and branch to 
the entry point that continues initialization.

Bill Campbell
> Best regards,
>
> Wolfgang Denk
>

___
U-Boot mailing list
U-Boot@lists.denx.de
http://lists.denx.de/mailman/listinfo/u-boot


Re: [U-Boot] [RFC] [PATCH] arm: arm926ejs: use ELF relocations

2010-10-04 Thread J. William Campbell
  On 10/4/2010 5:16 PM, Albert ARIBAUD wrote:
> Le 05/10/2010 01:21, Graeme Russ a écrit :
>> On Tue, Oct 5, 2010 at 9:57 AM, Albert ARIBAUD   
>> wrote:
>>> Le 05/10/2010 00:22, Graeme Russ a écrit :
 On Tue, Oct 5, 2010 at 9:01 AM, Albert Aribaud
wrote:
>>> The output from MAKEALL is curiously calculated... If I look at objdumps of
>>> the GOT and ELF binaries, I find that:
>>>
>>> - the GOT .text section is 118960 bytes and the ELF .text section only
>>> 108112. This is due to the fact that GOT relocation requires additional
>>> instruction for GOT indirection whereas ELF relocations work by patching the
>>> code.
>> It would be interesting to compare against the basline non-relocatable
>> version
> I #defined CONFIG_RELOC_FIXUP_WORKS and removed -pie from the ARM
> config.mk. This puts the edminiv2 code in the non-reloc build case, and
> produces identical .text and .data, and almost identical .rodata, as the
> ELF case.
>
>>> - the .rodata section is 22416 for GOT, 22698 for ELF, whereas the .data
>>> section is 2908 for GOT, 2627 for ELF. Some initialized data apparently
>>> moved from non-const ton const for some reason, but basically, initialized
>>> data remains constant.
>>>
>>> - the .bss section remains constant too, 16640 for GOT vs. 16636 for ELF.
>>> I'm not going to track what causes the 4 byte difference. :)
>>>
>>> Many sections are output in the ELF file which do not appear in the GOT
>>> file, such as .interp, .dynamic, .dynstr etc. They probably pollute
>>> MAKEALL's figures.
>> I now discard a few sections:
>>
>>  /DISCARD/ : { *(.dynstr*) }
>>  /DISCARD/ : { *(.dynamic*) }
>>  /DISCARD/ : { *(.plt*) }
>>  /DISCARD/ : { *(.interp*) }
>>  /DISCARD/ : { *(.gnu*) }
>>
>> Not that it makes a huge difference - most of these are trivially small
> Thanks. I'll add this to the .lds as a measure of clarity.
>
>>> That's roughly consistent with the numbers I get: about 19 KB of .rel.dyn
>>> plus .dynsym, which we will be able to cut by half if we preprocess it.
>> Which is not copied to RAM, so not as nasty as the .got related increase
> True also. Note that we could probably shrink the table to 1/4 of its
> current size by taking advantage from the fact that the few
> non-program-base-relative relocations it has can easily be converted to
> program-base-relative, and that two consecutive relocations are always
> less than 64 KB away from each other. Of course that moves away from
> using the ELF structures as-is, and requires additional build steps, but
> people with small FLASH devices may want it.
Hi All,
  This may be pushing it in the more general case. ARM has only a 
few relocation types. Other CPU types have more types, and therefore 
still may need a type field. You can certainly get 1/2 in all cases, and 
more if you are willing to get a bit more complex in the preprocessing. 
That said, I think this is best left to later when all CPUs are in the 
relocatable state.
>> I'm also looking at moving the low-level intialisation and relocation code
>> into a seperate section (outside .text) so I even less to relocate to RAM
> As Wolfgang pointed out, there might be issues in that all the code that
> runs in FLASH should be truly PI, which might not be a piece of cake.
> ARM C code, for instance, tends to generate literals which need to be
> relocated if you don't run the code where it was linked for.
True, but the code WILL be running at the address it was linked for. It 
just won't be copied and relocated to the "new" address, as it would 
never be run again anyway. This goal is along the lines of the "two 
stage u-boot" that has been/is being considered, where all execute only 
once code can be concentrated into a segment that is not moved into ram.

Bill Campbell
>> Then I could even compress the relocatable section, but that is just being
>> silly ;)
> :)
>
>> Regards,
>>
>> Graeme
> Amicalement,

___
U-Boot mailing list
U-Boot@lists.denx.de
http://lists.denx.de/mailman/listinfo/u-boot


Re: [U-Boot] ARM relocation, question to Heiko

2010-10-04 Thread J. William Campbell

 On 10/4/2010 10:06 AM, Wolfgang Denk wrote:

Dear "J. William Campbell",

In message<4ca9f294.8080...@comcast.net>  you wrote:

Yes, I think Wolfgang is correct. This is not going to be easy to do in
general.  To run anywhere, the code must be true Position Independent
code. If you intend to use any C code in the initialization, this will
result in needing -fPIC for at least that code. I am not sure you can
mix -fPIC and non -fPIC code in the same link, but I expect not. I am a
bit surprised that it is possible to get even the initialization code to
be Position Independent, but it appears that on at least some PPC it is
possible/has been done.

Not really. On PowerPC, only the first 20 or 30 lines of assembler
code in start.S are position independent; then we compute the link
(resp. execution) address and branch to it. From then, we run from the
very address range we were linked for (starting at TEXT_BASE).

Hi Wolfgang,
 You are of course correct. I was referring more to Jocke's 
(joakim.tjernl...@transmode.se) statements regarding:


Yes, that is there today. I am talking about linking to any TEXT_BASE(say 0)
but burn and run into another address. I impl. this quite some time
ago for PPC(search for LINK_OFF)

I understand from his comment that he had achieved total PIC for the 
initialization, that would run at any location regardless
of TEXT_BASE. I think this code was not accepted into mainline, so it is not a 
problem at present. However, any relocation code
added would have to be modified by Jocke if he wished to preserve that 
capability. I am amazed that he was able to get the
rest of u-boot to work under the constraints you pointed out. It must have been 
quite tedious.

  I also wish to support Graeme's desire that the added relocation code at 
the end of the day be written in C. The routine to do the
relocation does not require .bss and is not real long. The obvious advantage of 
this approach is that all architectures can use it. The
ELF relocation codes will have to be changed to the architecture equivalents, 
and in some casesarchitecture specific relocation code
processing added, but the theory will always be the same. This approach will 
make using relocation much easier/trivial for new
architecture  ports, thereby reducing resistance to doing it!

Best Regards,
Bill Campbell



Albert is doing now, and Graeme did before,  is the first option,
creating a loader that understands ELF. This has the advantage that it
will work on all architectures. However, once this understanding is in
place, it would be easy to write a small post-processing program that
would reduce the size of the relocation entries, much like -mrelocatable
does. This may or may not be necessary, but it is certainly possible.

Eventually we might even add -mrelocatable support for the other
architectures to the tool chain.

Best regards,

Wolfgang Denk



___
U-Boot mailing list
U-Boot@lists.denx.de
http://lists.denx.de/mailman/listinfo/u-boot


Re: [U-Boot] ARM relocation, question to Heiko

2010-10-04 Thread J. William Campbell
  On 10/4/2010 3:13 AM, Wolfgang Denk wrote:
> Dear Albert ARIBAUD,
>
> In message<4ca999ee.5030...@free.fr>  you wrote:
>> Note however that linking for base address 0 is not mandatory for
>> achieving true position independence. What is required is that the code
>> which runs from power-up until relocation be able to run anywhere, i.e.,
>> this code should not require any relocation fixup. That can be achieved
>> on ARM by using only relative branches and accessing data only relative
>> to pc (e.g. literals) or truly absolute (e.g. HW registers etc).
> That means you need to build all of U-Boot that way, because
> significant parts of the code already run before relocation
> (including all clocks and timers setup, console setup, printf and all
> routines these pull in).
>
Yes, I think Wolfgang is correct. This is not going to be easy to do in 
general.  To run anywhere, the code must be true Position Independent 
code. If you intend to use any C code in the initialization, this will 
result in needing -fPIC for at least that code. I am not sure you can 
mix -fPIC and non -fPIC code in the same link, but I expect not. I am a 
bit surprised that it is possible to get even the initialization code to 
be Position Independent, but it appears that on at least some PPC it is 
possible/has been done.
 On a related topic, I did find some information on the 
-mrelocatable history. Take a look at
http://www.mail-archive.com/g...@gcc.gnu.org/msg02528.html.
If you read both thread entries, it explains -mrelocatable as more or 
less the post-processor that re-formats the ELF relocation information 
into a smaller format and puts it in the text as another segment. What 
Albert is doing now, and Graeme did before,  is the first option, 
creating a loader that understands ELF. This has the advantage that it 
will work on all architectures. However, once this understanding is in 
place, it would be easy to write a small post-processing program that 
would reduce the size of the relocation entries, much like -mrelocatable 
does. This may or may not be necessary, but it is certainly possible.

Best Regards,
Bill Campbell
> Best regards,
>
> Wolfgang Denk
>

___
U-Boot mailing list
U-Boot@lists.denx.de
http://lists.denx.de/mailman/listinfo/u-boot


Re: [U-Boot] ARM relocation, question to Heiko

2010-10-03 Thread J. William Campbell
  On 10/3/2010 11:29 AM, Wolfgang Denk wrote:
> Dear Reinhard Meyer,
>
> In message<4ca79896.2010...@emk-elektronik.de>  you wrote:
>> I agree here. _If_ relocation, it should work without hand-adding
>> fixup stuff to all functions using initialized data with pointers.
>> Even Wolfgang forgot to fixup his 2nd level command table in
>> cmd_nvedit.c ;)
> I didn't forget it - at least not in the sensse that I think this is
> something that needs to be done.
>
> This works fine on PPC with relocation, and we should make it work
> the same on other arches.
>
>> And, for space concerns in flash, relocation should always be an
>> option on a board by board basis...
> NAK.
>
>> And as an idea, if position independent code is used, only pointers
>> in initialized data need adjustment. Cannot the linker emit a table
>> of addresses that need fixing?
> It does. That's the GOT.
I think this is actually a misunderstanding. The purpose of the GOT, at 
least from GCC's point of view, is to hold the absolute addresses of 
private data referenced by shared library code. That is what it was 
invented to do. This is similar to, but not identical with, relocating 
all data references. Initialized data in the library must have a copy 
created (and relocated as necessary if it contains pointers) by the 
runtime linker when the library is initialized in the address space of 
the process using the library. The code in the shared library is -fPIC, 
but it still needs the runtime linker to allocate a copy of the GOT for 
the current user AND to allocate and relocate any data that is required 
for the library that is private to the user. It is that second step 
where we have trouble.

Best Regards,
Bill Campbell
> Best regards,
>
> Wolfgang Denk
>

___
U-Boot mailing list
U-Boot@lists.denx.de
http://lists.denx.de/mailman/listinfo/u-boot


Re: [U-Boot] ARM relocation, question to Heiko

2010-10-03 Thread J. William Campbell
  On 10/3/2010 11:14 AM, Wolfgang Denk wrote:
> Dear "J. William Campbell",
>
> In message<4ca75bfb.5030...@comcast.net>  you wrote:
>>>>> And I think there are more places of this type in u-boot...
>>>> Yes, maybe. But relocation as I did for arm, also works
>>>> on m68k, sparc, mips, avr32 and they must do also this
>>>> fixups, so for common functions (except the new env handling,
>>>> which I think got never tested on this architectures?) should
>>>> work ...
>>> This pointer problem is solved with the fixup relocs on ppc and
>>> should work without manual relocation. I think this is a ppc
>>> only extension but I might be wrong.
>> You are correct that this is a ppc only extension. As such, it is
>> not a good candidate for "general" use.
> On contrary.
>
> If it works for PPC, then there should be ways to do the same on other
> architectures.
>
Well, maybe so, but GCC won't do it now, and there has been no move by 
other architectures to adopt this capability. I suspect that it is 
extremely unlikley that this capability will ever be ported to other 
architectures since it has been available for so long on PPC without any 
movement to other systems.

>> Using the full relocation scheme eliminates the need for all these
>> "fixups" in u-boot C code. I think this is a very desirable result.
>> It is also not clear to me that hard coding in the relocation as several
>> C routines will produce a  u-boot that is "smaller" than the one
>> produced by using normal ELF relocation. However, using full relocation
>> creates an environment that is true "C" and does not rely on people
>> remembering that they may have to fix up some parts of their code. It is
>> hard to see much downside in using the full relocation capability
>> provided by Graeme's code.
> Agreed. But if we take this path, we need to find an implementation
> that looks clean and readable.
Agreed. This should be possible to do now that there is a better 
understanding of the ELF format by the u-boot community. Perhaps the 
place to start would be trying to port what Graeme has done to ARM or 
perhaps better yet, PPC. Since lots of people on this list are PPC 
folks, we should have a lot of leverage there.

Best Regards,
Bill Campbell
>
> Best regards,
>
> Wolfgang Denk
>

___
U-Boot mailing list
U-Boot@lists.denx.de
http://lists.denx.de/mailman/listinfo/u-boot


Re: [U-Boot] ARM relocation, question to Heiko

2010-10-03 Thread J. William Campbell
  On 10/3/2010 1:58 AM, Albert ARIBAUD wrote:
> Le 03/10/2010 10:44, Graeme Russ a écrit :
>
>>> Bill just said that -pic (or, for ARM, -fPIC or -fPIE) was unnecessary
>>> for relocation. You seem to imply it actually is... In my experience,
>>> -fPIC and-fPIE do increase code by adding GOT relocation to symbols 
>>> that
>>> need fixing, so they would indeed be redundant to any other relocation
>>> mechanism -- I just did some test with basic code and this seems to
>>> confirm, no -fPIx is needed to get relocation the way you do on ARM.
>>
>> Just to clarify -fpic is a compiler option, -pic is a linker option. x86
>> has no compile time relocation options (therefore no referencing .got 
>> etc).
>> Using the link time pic option produces the relocation data table
>> (.rel.dyn) which must be pre-processed before execution can begin at the
>> relocated address
>
> Thanks for clarifying, Graeme.
>
> This is consistent with the ARM compile-time options -fPIC/-fPIE vs 
> link-time option -pie. So there may be at least an interest in 
> investigating ELF-style relocation on ARM and comparing it to 
> GOT-based relocation in terms of FLASH and RAM sizes and code speed.
>
Hi All,
   It is for sure that -fPIC/-fPIE programs will contain more 
executable instructions than programs compiled without these options.
The program will also contain more data space for the got. If -fPIC 
actually produced a fully position-independent executable, the extra 
overhead would perhaps be tolerable. However, since it does not do this, 
(problems with initialized data etc.) there is really no advantage in 
using these compile-time options. The executable code and required data 
space for the program without these switches will "always" be smaller 
and faster than with them. In order to fix the remaining issues even 
when using -fPIC, a relocation loop must exist in the u-boot code, 
either one global one or a bunch of user written specific ones. Also, 
the -pie switch will be needed anyway at link time to build the 
relocation table for the remaining relocation requirements.
   Programs compiled without -fPIC will have a larger .rel.dyn table 
than those compiled with -fPIC. However, the table entries in the 
relocation table occupy about the same storage as the code generated by 
the compiler to relocate a reference to the symbol at run time.  So this 
is probably a almost a wash. Also, the dynamic relocation data need not 
be copied into the run-time object, as it is no longer needed. So the 
likely outcome is that the "flash" image is about the same size/slightly 
larger than the one compiled by -fPIC, and that the ram footprint after 
relocation is slightly smaller.
   If one is REALLY pressed for space, the size of the dynamic 
relocation area can be reduced by a post-processor program that would 
re-format the relocation entries. This re-formatting is possible because 
1) ELF is a very general format and we only need a small subset of it, 
and 2) u-boot code will never occupy say 16 MB of space, so each 
relocation can probably be compressed into a 32 bit word. I doubt anyone 
is that desperate, but it IS possible.
   It will be interesting to see what the results of this comparison 
are. For me, the no user awareness of relocation is worth a lot, and the 
fact that the difference/overhead of relocation will all be in exactly 
one place is very appealing.

Best Regards,
Bill Campbell
>> Cheers,
>>
>> Graeme
>
> Amicalement,

___
U-Boot mailing list
U-Boot@lists.denx.de
http://lists.denx.de/mailman/listinfo/u-boot


Re: [U-Boot] ARM relocation, question to Heiko

2010-10-02 Thread J. William Campbell

 On 10/2/2010 3:17 AM, Joakim Tjernlund wrote:

Hello Reinhard,

Reinhard Meyer wrote:

Dear Albert ARIBAUD,

I try to understand how the relocation process could handle pointers (to
functions or other data) in const or data sections.
Your code cannot know what is data and what is a pointer that needs
adjustment?

Best Regards,
Reinhard

Hi Reinhart,

Short answer - the relocation process does not handle pointers inside
data structures.

And yes, this means the content arrays of pointers such as init_sequence
is not relocated. Been there, done that, can give you one of the

The init_sequence should not called anymore after relocation, as it is
the init_sequence ... or?


tee-shirts I got :)

ATM I have not found a way to fix this, except making the code which
uses the pointers aware that the are location-sensitive and fix them
when using them.

That means that things like this cannot work (with relocation),
unless adding the relocation offset before using the pointer:

Yep, you have to fix these pointers after relocation ...


const struct {
const u8 shift;
const u8 idcode;
struct spi_flash *(*probe) (struct spi_slave *spi, u8 *idcode);
} flashes[] = {
#ifdef CONFIG_SPI_FLASH_SPANSION
{ 0, 0x01, spi_flash_probe_spansion, },
#endif

[...]

#ifdef CONFIG_SPI_FRAM_RAMTRON
{ 6, 0xc2, spi_fram_probe_ramtron, },
# ifdef CONFIG_SPI_FRAM_RAMTRON_NON_JEDEC
{ 0, 0xff, spi_fram_probe_ramtron, },
# endif
# undef IDBUF_LEN
# define IDBUF_LEN 9 /* we need to read 6+3 bytes */
#endif
};

And I think there are more places of this type in u-boot...

Yes, maybe. But relocation as I did for arm, also works
on m68k, sparc, mips, avr32 and they must do also this
fixups, so for common functions (except the new env handling,
which I think got never tested on this architectures?) should
work ...

This pointer problem is solved with the fixup relocs on ppc and
should work without manual relocation. I think this is a ppc
only extension but I might be wrong.


Hi All,
  You are correct that this is a ppc only extension. As such, it is 
not a good candidate for "general" use.



I believe that the other alternative is to do it as x86 does
which I think is the general way which should work on any arch.
Graem Russ would know better.

Almost exactly a year ago, this was all pretty much presented by Graeme 
in the threads

Relocation size penalty calculation (October 14, 2009)
i386 Relocation (November 24, 2009)

Using the full relocation scheme eliminates the need for all these 
"fixups" in u-boot C code. I think this is a very desirable result.
It is also not clear to me that hard coding in the relocation as several 
C routines will produce a  u-boot that is "smaller" than the one 
produced by using normal ELF relocation. However, using full relocation 
creates an environment that is true "C" and does not rely on people 
remembering that they may have to fix up some parts of their code. It is 
hard to see much downside in using the full relocation capability 
provided by Graeme's code.
FWIW, the relocation code and data does not have to be moved into ram if 
space is at a premium.


Best Regards,
Bill Campbell


  Jocke

___
U-Boot mailing list
U-Boot@lists.denx.de
http://lists.denx.de/mailman/listinfo/u-boot




___
U-Boot mailing list
U-Boot@lists.denx.de
http://lists.denx.de/mailman/listinfo/u-boot


Re: [U-Boot] [PATCH] flash.h: pull in common.h for types

2009-11-17 Thread J. William Campbell
Mike Frysinger wrote:
> On Tuesday 17 November 2009 16:56:58 Wolfgang Denk wrote:
>   
>> Scott Wood wrote:
>> 
 My question: is there a definitive position  somewhere  (for  example
 for  the  Linux kernel; I'm sure we don't have one for U-Boot [yet]),
 whether system headers should be self-sufficient?
 
>>> I'd say they should be self-sufficient, in that the inclusion of the
>>> header itself should not fail if I haven't included some arbitrary other
>>> header.  I don't see what the argument would be for not doing this.
>>>   
>> Well, Theo de Raadt says for example "... people would be able to
>> include less files; indeed, almost be careless about what they
>> include. But this would not increase portability in any way. And
>> 'make build' would probably, if it was taken the nth degree, take
>> twice as long. Therefore there is no benefit for the crazy rule you
>> suggest..." - see
>> http://www.mail-archive.com/t...@openbsd.org/msg00425.html
>> 
>
> i disagree with this using, ironically, the same base logic, but a different 
> conclusion:
> http://sourceware.org/ml/libc-alpha/2006-08/msg00064.html
>
> also, i think a self contained system like u-boot which has full control at 
> the api level can be better at this than a user interface which really sits 
> on 
> top of an abi and has to deal with a lot of crap from user code.
>
> while i'm not asking for you or anyone else to audit header paths here as i 
> think that level of enforcement will bog things down, small patches from 
> people who choose to fix things should be merged.
>
>   
FWIW, I think one needs to be very careful with this reasoning. It is 
clear that experienced and capable programmers disagree on the correct 
approach to this problem. It is also true that the logical structure of 
the include chain is important. A "crap" interface is going to be hard 
to maintain no matter how you do it. The problem is, "crap" has a way of 
sneaking into well-designed interfaces in the form of small patches to 
"fix" things. (I am not saying this is one of them). I have observed 
that in any large program, as time goes on, more and more things get 
included in more and more places. Hiding this fact by doing the 
inclusion in other header files often obscures the drift towards a very 
polluted name space, where editing just about any include file requires 
the entire system to be rebuilt. Some would say, so what, it is easy to 
do. However, these situations often result in hard to find bugs. The 
question of "which modules can access this variable" becomes "all of 
them" because of (badly designed?)  interfaces being included in other 
interfaces without the user being explicitly aware of this. Often I see 
in patch critiques the statement "This belongs in a header file", which 
may be true, but can lead to a bunch of un-related things being stuck 
together in a header file just to meet the requirement. Those bad 
choices then get included in other header files and we are off to the 
races. Since the users are not explicitly aware they are pulling in a 
bunch of stuff that has nothing to do with the main purpose of the 
module being written, there is often no incentive to clean up bad header 
designs until things get so bad it is almost impossible to untangle 
things. So when adding an include to an include file, think hard whether 
you are bringing along a lot of things that are "not required" and that 
one would not expect to be exposed in all files using the modified 
include file. If you are bringing along a lot of "baggage", perhaps the 
interface you are including could use some re-work.

Best Regards,
Bill Campbell
>>> I don't know whether Linux has a specific policy on this, but I haven't
>>> noticed many problems in this regard, and when I did find one in the
>>> kernel a few years back I didn't get any argument when I submitted a
>>> patch to fix it.
>>>   
>
> ive semi-frequently post fixes to linux headers so that you can include just 
> that header and have it work.  i have yet to hear anyone complain; rather 
> every one has been merged (ignoring issues unrelated to the original purpose).
>
>   
>>> Which man pages are you looking at?
>>>   
>> Well, for example:
>>
>> open(2):
>>  SYNOPSIS
>> #include 
>> #include 
>> #include 
>>
>> mknod(2):
>>  SYNOPSIS
>> #include 
>> #include 
>> #include 
>> #include 
>>
>> stat(2):
>>  SYNOPSIS
>> #include 
>> #include 
>> #include 
>>
>> Why do we need these lists of #includes? WHy doe - for example -
>>  not auto-include anything it might need?
>>
>> To me this seems to be an indication that there is no intention to
>> make headers self-sufficent, but I am absolutely not sure.
>> 
>
> i'm pretty sure your man page example is an unrelated issue.  the include 
> list 
> does not imply that any one of those headers ca

Re: [U-Boot] Relocation size penalty calculation

2009-10-17 Thread J. William Campbell
Graeme Russ wrote:
> On Thu, Oct 15, 2009 at 3:45 AM, J. William Campbell
>  wrote:
>   
>> Joakim Tjernlund wrote:
>> 
>   


> Apologies if this is getting way off-topic for a simple boot loader, but
> this is information I have gathered from far and wide over the net. I am
> surprised that there isn't a web site out there on 'How to create a
> relocatable boot loader'...
>
> OK, its all starting to come together now - It helps when you look at the
> right files ;)
>
> Firstly, u-boot.map
>
> 0x380589a0__rel_dyn_start = .
>
> .rel.dyn0x380589a0 0x42b0
>  *(.rel.dyn)
>  .rel.got   0x0x0 cpu/i386/start.o
>  .rel.plt   0x0x0 cpu/i386/start.o
>  .rel.text  0x380589a0 0x2e28 cpu/i386/start.o
>  .rel.start16   0x3805b7c8   0x10 cpu/i386/start.o
>  .rel.data  0x3805b7d8  0xc18 cpu/i386/start.o
>  .rel.rodata0x3805c3f0  0x360 cpu/i386/start.o
>  .rel.u_boot_cmd
> 0x3805c750  0x500 cpu/i386/start.o
> 0x3805cc50__rel_dyn_end = .
>
>
> And the output of readelf...
>
> Section Headers:
>   [Nr] Name  TypeAddr OffSize   ES Flg Lk Inf 
> Al
>   [ 0]   NULL 00 00 00  0   0 
>  0
>   [ 1] .text PROGBITS3804 001000 0118a4 00  AX  0   0 
>  4
>   [ 2] .rel.text REL  066c68 005d00 08 40   1 
>  4
>   [ 3] .rodata   PROGBITS380518a4 0128a4 005da5 00   A  0   0 
>  4
>   [ 4] .rel.rodata   REL  06c968 000360 08 40   3 
>  4
>   [ 5] .interp   PROGBITS38057649 018649 13 00   A  0   0 
>  1
>   [ 6] .dynstr   STRTAB  3805765c 01865c 0001ee 00   A  0   0 
>  1
>   [ 7] .hash HASH3805784c 01884c cc 04   A 11   0 
>  4
>   [ 8] .data PROGBITS38057918 018918 000a3c 00  WA  0   0 
>  4
>   [ 9] .rel.data REL  06ccc8 000c18 08 40   8 
>  4
>   [10] .got.plt  PROGBITS38058354 019354 0c 04  WA  0   0 
>  4
>   [11] .dynsym   DYNSYM  38058360 019360 000200 10   A  6   1 
>  4
>   [12] .dynamic  DYNAMIC 38058560 019560 80 08  WA  6   0 
>  4
>   [13] .u_boot_cmd   PROGBITS380585e0 0195e0 0003c0 00  WA  0   0 
>  4
>   [14] .rel.u_boot_cmd   REL  06d8e0 000500 08 40  13 
>  4
>   [15] .bss  NOBITS  3805cc50 01ec50 001a34 00  WA  0   0 
>  4
>   [16] .bios PROGBITS 01e000 00053e 00  AX  0   0 
>  1
>   [17] .rel.bios REL  06dde0 c0 08 40  16 
>  4
>   [18] .rel.dyn  REL 380589a0 0199a0 0042b0 08   A 11   0 
>  4
>   [19] .start16  PROGBITSf800 01e800 000110 00  AX  0   0 
>  1
>   [20] .rel.start16  REL  06dea0 38 08 40  19 
>  4
>   [21] .resetvec PROGBITSfff0 01eff0 10 00  AX  0   0 
>  1
>   [22] .rel.resetvec REL  06ded8 08 08 40  21 
>  4
>
> ...
>
> Relocation section '.rel.text' at offset 0x66c68 contains 2976 entries:
>  Offset InfoTypeSym.Value  Sym. Name
> 38040010  0101 R_386_32  3804   .text
> 3804001e  0101 R_386_32  3804   .text
> 38040028  0101 R_386_32  3804   .text
> 3804003f  0101 R_386_32  3804   .text
> 38040051  0101 R_386_32  3804   .text
> 38040075  0101 R_386_32  3804   .text
> 38040085  0101 R_386_32  3804   .text
> 3804009d  0003e602 R_386_PC32380403fa   load_uboot
> 380400a6  0101 R_386_32  3804   .text
> 38040015  00029f02 R_386_PC323804bdd8   early_board_init
> 38040023  0003f702 R_386_PC323804bdda   show_boot_progress_asm
>
> ...
>
> Relocation section '.rel.rodata' at offset 0x6c968 contains 108 entries:
>  Offset InfoTypeSym.Value  Sym. Name
> 38051908  0201 R_386_32  380518a4   .rodata
> 38051938  0201 R_386_32  380518a4   .rodata
> 38051968  0201 R_386_32  380518a4   .rodata
> 38051998  0201 R_386_32  380518a4   .rodata
> 380519c8  0201 R_386_32  380518a4   .rodata
> 380519f8  0201 R_386_32  380518a4   .rodata
>
> ...
>
> Relocation section '.rel.dyn' at offset 0x199a0 contains 2134 entries:
>  Offset   

Re: [U-Boot] Relocation size penalty calculation

2009-10-14 Thread J. William Campbell
Joakim Tjernlund wrote:
> "J. William Campbell"  wrote on 14/10/2009 
> 17:35:44:
>   
>> Joakim Tjernlund wrote:
>> 
>>> "J. William Campbell"  wrote on 14/10/2009 
>>> 01:48:52:
>>>
>>>   
>>>> Joakim Tjernlund wrote:
>>>>
>>>> 
>>>>> Graeme Russ  wrote on 13/10/2009 22:06:56:
>>>>>
>>>>>
>>>>>
>>>>>   
>>>>>> On Tue, Oct 13, 2009 at 10:53 PM, Joakim Tjernlund
>>>>>>  wrote:
>>>>>>
>>>>>>
>>>>>> 
>>>>>>> Graeme Russ  wrote on 13/10/2009 13:21:05:
>>>>>>>
>>>>>>>
>>>>>>>   
>>>>>>>> On Sun, Oct 11, 2009 at 11:51 PM, Joakim Tjernlund
>>>>>>>>  wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>> 
>>>>>>>>> Graeme Russ  wrote on 11/10/2009 12:47:19:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>   
>>>>>>>> [Massive Snip :)]
>>>>>>>> 
>
> [Yet another SNIP :)]
>
>   
>>>>>>> Evil idea, skip -fpic et. all and add the full reloc procedure
>>>>>>> to relocate by rewriting directly in TEXT segment. Then you save space
>>>>>>> but you need more relocation code. Something like dl_do_reloc from
>>>>>>> uClibc. Wonder how much extra code that would be? Not too much I think.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>   
>>>>>> With the following flags
>>>>>>
>>>>>> PLATFORM_RELFLAGS += -fvisibility=hidden
>>>>>> PLATFORM_CPPFLAGS += -fno-dwarf2-cfi-asm
>>>>>> PLATFORM_LDFLAGS += -pic --emit-relocs -Bsymbolic -Bsymbolic-functions
>>>>>>
>>>>>> I get no .got, but a lot of R_386_PC32 and R_386_32 relocations. I think
>>>>>> this might mean I need the symbol table in the binary in order to resolve
>>>>>> them
>>>>>>
>>>>>>
>>>>>> 
>>> BTW, how many relocs do you get compared with -fPIC? I suspect you more
>>> now but hopefully not that many more.
>>>
>>>
>>>   
>>>>> Possibly, but I think you only need to add an offset to all those
>>>>> relocs.
>>>>>
>>>>>
>>>>>   
>>>> Almost right. The relocations specify a symbol value that needs to be
>>>> added to the data in memory to relocate the reference. The symbol values
>>>> involved should be the start of the text section for program references,
>>>> the start of the uninitialized data section for bss references, and the
>>>> start of the data section for initialized data and constants. So there
>>>> are about four symbols whose value you need to keep. Take a look at
>>>> http://refspecs.freestandards.org/elf/elf.pdf (which you have probably
>>>> already looked at) and it tells you what to do with R_386_PC32 ad
>>>> R_386_32 relocations. Hopefully the objcopy with the --strip-unneeded
>>>> will remove all the symbols you don't actually need, but I don't know
>>>> that for sure. Note also that you can change the section flags of a
>>>> section marked noload  to load.
>>>>
>>>> 
>>> Still think you can get away with just ADDING an offset. The image is 
>>> linked to a
>>> specific address and then you move the whole image to a new address. 
>>> Therefore
>>> you should be able to read the current address, add offset, write back the
>>>   
>> new address.
>> 
>>> Normally one do what you describe but here we know that the whole img has 
>>> moved so
>>> we don't have to do calculate the new address from scratch.
>>>
>>>   
>> If the addresses of the bss, text, and data segments change by the same
>> value, I think you are correct. However, if the text and data/bss
>> segments are moved by different offsets, naturally the relocations would
>> be different. One reason to retain this capability would be to allow the
>> u-

Re: [U-Boot] Relocation size penalty calculation

2009-10-14 Thread J. William Campbell
Joakim Tjernlund wrote:
> Graeme Russ  wrote on 14/10/2009 13:48:27:
>   
>> On Wed, Oct 14, 2009 at 6:25 PM, Joakim Tjernlund
>>  wrote:
>> 
>>> "J. William Campbell"  wrote on 14/10/2009 
>>> 01:48:52:
>>>   
>>>> Joakim Tjernlund wrote:
>>>> 
>>>>> Graeme Russ  wrote on 13/10/2009 22:06:56:
>>>>>
>>>>>
>>>>>   
>>>>>> On Tue, Oct 13, 2009 at 10:53 PM, Joakim Tjernlund
>>>>>>  wrote:
>>>>>>
>>>>>> 
>>>>>>> Graeme Russ  wrote on 13/10/2009 13:21:05:
>>>>>>>
>>>>>>>   
>>>>>>>> On Sun, Oct 11, 2009 at 11:51 PM, Joakim Tjernlund
>>>>>>>>  wrote:
>>>>>>>>
>>>>>>>> 
>>>>>>>>> Graeme Russ  wrote on 11/10/2009 12:47:19:
>>>>>>>>>
>>>>>>>>>   
>>>>>>>> [Massive Snip :)]
>>>>>>>>
>>>>>>>>
>>>>>>>> 
>>>>>>>>>> So, all that is left are .dynsym and .dynamic ...
>>>>>>>>>>   .dynsym
>>>>>>>>>> - Contains 70 entries (16 bytes each, 1120 bytes)
>>>>>>>>>> - 44 entries mimic those entries in .got which are not relocated
>>>>>>>>>> - 21 entries are the remaining symbols exported from the linker
>>>>>>>>>>   script
>>>>>>>>>> - 4 entries are labels defined in inline asm and used in C
>>>>>>>>>>
>>>>>>>>>> 
>>>>>>>>> Try adding proper asm declarations. Look at what gcc
>>>>>>>>> generates for a function/variable and mimic these.
>>>>>>>>>
>>>>>>>>>   
>>>>>>>> Thanks - Now .dynsym contains only exports from the linker script
>>>>>>>>
>>>>>>>> 
>>>>>>> :)
>>>>>>>
>>>>>>>   
>>>>>>>>>> - 1 entry is a NULL entry
>>>>>>>>>>
>>>>>>>>>>   .dynamic
>>>>>>>>>> - 88 bytes
>>>>>>>>>> - Array of Elf32_Dyn
>>>>>>>>>> - typedef struct {
>>>>>>>>>>   Elf32_Sword d_tag;
>>>>>>>>>>   union {
>>>>>>>>>>   Elf32_Word  d_val;
>>>>>>>>>>   Elf32_Addr  d_ptr;
>>>>>>>>>>   } d_un;
>>>>>>>>>>   } Elf32_Dyn;
>>>>>>>>>> - 0x11 entries
>>>>>>>>>>   [00] 0x0010, 0x DT_SYMBOLIC, (ignored)
>>>>>>>>>>   [01] 0x0004, 0x38059994 DT_HASH, points to .hash
>>>>>>>>>>   [02] 0x0005, 0x380595AB DT_STRTAB, points to .dynstr
>>>>>>>>>>   [03] 0x0006, 0x3805BDCC DT_SYMTAB, points to .dynsym
>>>>>>>>>>   [04] 0x000A, 0x03E6 DT_STRSZ, size of .dynstr
>>>>>>>>>>   [05] 0x000B, 0x0010 DT_SYMENT, ???
>>>>>>>>>>   [06] 0x0015, 0x DT_DEBUG, ???
>>>>>>>>>>   [07] 0x0011, 0x3805A8F4 DT_REL, points to .rel.text
>>>>>>>>>>   [08] 0x0012, 0x14D8 DT_RELSZ, ???
>>>>>>>>>>
>>>>>>>>>> 
>>>>>>>>> How big DT_REL is
>>>>>>>>>
>>>>>>>>>   
>>>>>>>>>>   [09] 0x0013, 0x0008 DT_RELENT, ???
>>>>>>>>>>
>>>>>>>>>> 
>>>>>>>>> hmm, cannot remeber :)
>>>>>>>>>
>>>>>>>>>   
>>>>>>>> How big an entry in DT_REL is
>>>>>

Re: [U-Boot] Relocation size penalty calculation

2009-10-14 Thread J. William Campbell
Joakim Tjernlund wrote:
> "J. William Campbell"  wrote on 14/10/2009 
> 01:48:52:
>   
>> Joakim Tjernlund wrote:
>> 
>>> Graeme Russ  wrote on 13/10/2009 22:06:56:
>>>
>>>
>>>   
>>>> On Tue, Oct 13, 2009 at 10:53 PM, Joakim Tjernlund
>>>>  wrote:
>>>>
>>>> 
>>>>> Graeme Russ  wrote on 13/10/2009 13:21:05:
>>>>>
>>>>>   
>>>>>> On Sun, Oct 11, 2009 at 11:51 PM, Joakim Tjernlund
>>>>>>  wrote:
>>>>>>
>>>>>> 
>>>>>>> Graeme Russ  wrote on 11/10/2009 12:47:19:
>>>>>>>
>>>>>>>   
>>>>>> [Massive Snip :)]
>>>>>>
>>>>>>
>>>>>> 
>>>>>>>> So, all that is left are .dynsym and .dynamic ...
>>>>>>>>   .dynsym
>>>>>>>> - Contains 70 entries (16 bytes each, 1120 bytes)
>>>>>>>> - 44 entries mimic those entries in .got which are not relocated
>>>>>>>> - 21 entries are the remaining symbols exported from the linker
>>>>>>>>   script
>>>>>>>> - 4 entries are labels defined in inline asm and used in C
>>>>>>>>
>>>>>>>> 
>>>>>>> Try adding proper asm declarations. Look at what gcc
>>>>>>> generates for a function/variable and mimic these.
>>>>>>>
>>>>>>>   
>>>>>> Thanks - Now .dynsym contains only exports from the linker script
>>>>>>
>>>>>> 
>>>>> :)
>>>>>
>>>>>   
>>>>>>>> - 1 entry is a NULL entry
>>>>>>>>
>>>>>>>>   .dynamic
>>>>>>>> - 88 bytes
>>>>>>>> - Array of Elf32_Dyn
>>>>>>>> - typedef struct {
>>>>>>>>   Elf32_Sword d_tag;
>>>>>>>>   union {
>>>>>>>>   Elf32_Word  d_val;
>>>>>>>>   Elf32_Addr  d_ptr;
>>>>>>>>   } d_un;
>>>>>>>>   } Elf32_Dyn;
>>>>>>>> - 0x11 entries
>>>>>>>>   [00] 0x0010, 0x DT_SYMBOLIC, (ignored)
>>>>>>>>   [01] 0x0004, 0x38059994 DT_HASH, points to .hash
>>>>>>>>   [02] 0x0005, 0x380595AB DT_STRTAB, points to .dynstr
>>>>>>>>   [03] 0x0006, 0x3805BDCC DT_SYMTAB, points to .dynsym
>>>>>>>>   [04] 0x000A, 0x03E6 DT_STRSZ, size of .dynstr
>>>>>>>>   [05] 0x000B, 0x0010 DT_SYMENT, ???
>>>>>>>>   [06] 0x0015, 0x DT_DEBUG, ???
>>>>>>>>   [07] 0x0011, 0x3805A8F4 DT_REL, points to .rel.text
>>>>>>>>   [08] 0x0012, 0x14D8 DT_RELSZ, ???
>>>>>>>>
>>>>>>>> 
>>>>>>> How big DT_REL is
>>>>>>>
>>>>>>>   
>>>>>>>>   [09] 0x0013, 0x0008 DT_RELENT, ???
>>>>>>>>
>>>>>>>> 
>>>>>>> hmm, cannot remeber :)
>>>>>>>
>>>>>>>   
>>>>>> How big an entry in DT_REL is
>>>>>>
>>>>>> 
>>>>> Right, how could I forget :)
>>>>>
>>>>>   
>>>>>>>>   [0a] 0x0016, 0x DT_TEXTREL, ???
>>>>>>>>
>>>>>>>> 
>>>>>>> Oops, you got text relocations. This is generally a bad thing.
>>>>>>> TEXTREL is commonly caused by asm code that arent truly pic so it needs
>>>>>>> to modify the .text segment to adjust for relocation.
>>>>>>> You should get rid of this one. Look for DT_TEXTREL in .o files to find
>>>>>>> the culprit.
>>>>>>>
>>>>>>>
>>>>>>>   
>>>>

Re: [U-Boot] Relocation size penalty calculation

2009-10-13 Thread J. William Campbell
Joakim Tjernlund wrote:
> Graeme Russ  wrote on 13/10/2009 22:06:56:
>
>   
>> On Tue, Oct 13, 2009 at 10:53 PM, Joakim Tjernlund
>>  wrote:
>> 
>>> Graeme Russ  wrote on 13/10/2009 13:21:05:
>>>   
 On Sun, Oct 11, 2009 at 11:51 PM, Joakim Tjernlund
  wrote:
 
> Graeme Russ  wrote on 11/10/2009 12:47:19:
>   
 [Massive Snip :)]

 
>> So, all that is left are .dynsym and .dynamic ...
>>   .dynsym
>> - Contains 70 entries (16 bytes each, 1120 bytes)
>> - 44 entries mimic those entries in .got which are not relocated
>> - 21 entries are the remaining symbols exported from the linker
>>   script
>> - 4 entries are labels defined in inline asm and used in C
>> 
> Try adding proper asm declarations. Look at what gcc
> generates for a function/variable and mimic these.
>   
 Thanks - Now .dynsym contains only exports from the linker script
 
>>> :)
>>>   
>> - 1 entry is a NULL entry
>>
>>   .dynamic
>> - 88 bytes
>> - Array of Elf32_Dyn
>> - typedef struct {
>>   Elf32_Sword d_tag;
>>   union {
>>   Elf32_Word  d_val;
>>   Elf32_Addr  d_ptr;
>>   } d_un;
>>   } Elf32_Dyn;
>> - 0x11 entries
>>   [00] 0x0010, 0x DT_SYMBOLIC, (ignored)
>>   [01] 0x0004, 0x38059994 DT_HASH, points to .hash
>>   [02] 0x0005, 0x380595AB DT_STRTAB, points to .dynstr
>>   [03] 0x0006, 0x3805BDCC DT_SYMTAB, points to .dynsym
>>   [04] 0x000A, 0x03E6 DT_STRSZ, size of .dynstr
>>   [05] 0x000B, 0x0010 DT_SYMENT, ???
>>   [06] 0x0015, 0x DT_DEBUG, ???
>>   [07] 0x0011, 0x3805A8F4 DT_REL, points to .rel.text
>>   [08] 0x0012, 0x14D8 DT_RELSZ, ???
>> 
> How big DT_REL is
>   
>>   [09] 0x0013, 0x0008 DT_RELENT, ???
>> 
> hmm, cannot remeber :)
>   
 How big an entry in DT_REL is
 
>>> Right, how could I forget :)
>>>   
>>   [0a] 0x0016, 0x DT_TEXTREL, ???
>> 
> Oops, you got text relocations. This is generally a bad thing.
> TEXTREL is commonly caused by asm code that arent truly pic so it needs
> to modify the .text segment to adjust for relocation.
> You should get rid of this one. Look for DT_TEXTREL in .o files to find
> the culprit.
>
>   
 Alas I cannot - The relocations are a result of loading a register with a
 return address when calling show_boot_progress in the very early stages of
 initialisation prior to the stack becoming available. The x86 does not
 allow direct access to the IP so the only way to find the 'current
 execution address' is to 'call' to the next instruction and pop the return
 address off the stack
 
>>> hmm, same as ppc but that in it self should not cause a TEXREL, should it?
>>> Ahh, the 'call' is absolute, not relative? I guess there is some way around 
>>> it
>>> but it is not important ATM I guess.
>>>
>>> Evil idea, skip -fpic et. all and add the full reloc procedure
>>> to relocate by rewriting directly in TEXT segment. Then you save space
>>> but you need more relocation code. Something like dl_do_reloc from
>>> uClibc. Wonder how much extra code that would be? Not too much I think.
>>>
>>>   
>> With the following flags
>>
>> PLATFORM_RELFLAGS += -fvisibility=hidden
>> PLATFORM_CPPFLAGS += -fno-dwarf2-cfi-asm
>> PLATFORM_LDFLAGS += -pic --emit-relocs -Bsymbolic -Bsymbolic-functions
>>
>> I get no .got, but a lot of R_386_PC32 and R_386_32 relocations. I think
>> this might mean I need the symbol table in the binary in order to resolve
>> them
>> 
>
> Possibly, but I think you only need to add an offset to all those
> relocs.
>   
Almost right. The relocations specify a symbol value that needs to be 
added to the data in memory to relocate the reference. The symbol values 
involved should be the start of the text section for program references, 
the start of the uninitialized data section for bss references, and the 
start of the data section for initialized data and constants. So there 
are about four symbols whose value you need to keep. Take a look at 
http://refspecs.freestandards.org/elf/elf.pdf (which you have probably 
already looked at) and it tells you what to do with R_386_PC32 ad 
R_386_32 relocations. Hopefully the objcopy with the --strip-unneeded 
will remove all the symbols you don't actually need, but I don't know 
that for sure. Note also that you can change the section flags of a 
section marked noload  to load.

Best Regards,
Bill Campbell
>   Jokce
>
> ___
> U-Boot mailing list
> U-Boot@lists.denx.de
> ht

Re: [U-Boot] Relocation size penalty calculation

2009-10-13 Thread J. William Campbell
Joakim Tjernlund wrote:
> Graeme Russ  wrote on 13/10/2009 13:21:05:
>   
>> On Sun, Oct 11, 2009 at 11:51 PM, Joakim Tjernlund
>>  wrote:
>> 
>>> Graeme Russ  wrote on 11/10/2009 12:47:19:
>>>   
>> [Massive Snip :)]
>>
>> 
 So, all that is left are .dynsym and .dynamic ...
   .dynsym
 - Contains 70 entries (16 bytes each, 1120 bytes)
 - 44 entries mimic those entries in .got which are not relocated
 - 21 entries are the remaining symbols exported from the linker
   script
 - 4 entries are labels defined in inline asm and used in C
 
>>> Try adding proper asm declarations. Look at what gcc
>>> generates for a function/variable and mimic these.
>>>   
>> Thanks - Now .dynsym contains only exports from the linker script
>> 
> :)
>   
 - 1 entry is a NULL entry

   .dynamic
 - 88 bytes
 - Array of Elf32_Dyn
 - typedef struct {
   Elf32_Sword d_tag;
   union {
   Elf32_Word  d_val;
   Elf32_Addr  d_ptr;
   } d_un;
   } Elf32_Dyn;
 - 0x11 entries
   [00] 0x0010, 0x DT_SYMBOLIC, (ignored)
   [01] 0x0004, 0x38059994 DT_HASH, points to .hash
   [02] 0x0005, 0x380595AB DT_STRTAB, points to .dynstr
   [03] 0x0006, 0x3805BDCC DT_SYMTAB, points to .dynsym
   [04] 0x000A, 0x03E6 DT_STRSZ, size of .dynstr
   [05] 0x000B, 0x0010 DT_SYMENT, ???
   [06] 0x0015, 0x DT_DEBUG, ???
   [07] 0x0011, 0x3805A8F4 DT_REL, points to .rel.text
   [08] 0x0012, 0x14D8 DT_RELSZ, ???
 
>>> How big DT_REL is
>>>   
   [09] 0x0013, 0x0008 DT_RELENT, ???
 
>>> hmm, cannot remeber :)
>>>   
>> How big an entry in DT_REL is
>> 
>
> Right, how could I forget :)
>   
   [0a] 0x0016, 0x DT_TEXTREL, ???
 
>>> Oops, you got text relocations. This is generally a bad thing.
>>> TEXTREL is commonly caused by asm code that arent truly pic so it needs
>>> to modify the .text segment to adjust for relocation.
>>> You should get rid of this one. Look for DT_TEXTREL in .o files to find
>>> the culprit.
>>>
>>>   
>> Alas I cannot - The relocations are a result of loading a register with a
>> return address when calling show_boot_progress in the very early stages of
>> initialisation prior to the stack becoming available. The x86 does not
>> allow direct access to the IP so the only way to find the 'current
>> execution address' is to 'call' to the next instruction and pop the return
>> address off the stack
>> 
>
> hmm, same as ppc but that in it self should not cause a TEXREL, should it?
> Ahh, the 'call' is absolute, not relative? I guess there is some way around it
> but it is not important ATM I guess.
>
> Evil idea, skip -fpic et. all and add the full reloc procedure
> to relocate by rewriting directly in TEXT segment. Then you save space
> but you need more relocation code. Something like dl_do_reloc from
> uClibc. Wonder how much extra code that would be? Not too much I think.
>   
I think this approach will turn out to be a big win. At present, the 
problem with just using the relocs is that objcopy is stripping them out 
when u-boot.bin is created, as I understand it. It seems this can be 
solved by changing the command switches appropriately, like using 
--strip-unneeded. In any case, there is some combination of switches 
that will preserve the relocation data. The executable code will get 
smaller, there will be no .got, and the relocation data will be larger 
(than with -fpic). In total size, it probably will be slightly smaller, 
but that is a guess. The most important benefit of this approach is that 
it will work for all architectures, thereby solving the problem once and 
forever! Even if the result is a bit larger, the RAM footprint will be 
reduced by the smaller object code size (since the relocation data need 
not be copied into ram).Having this approach as an option would be real 
nice, since it would always "just work".

Best Regards,
Bill Campbell
>   
>> This is not a problem because this is very low-level init that is not
>> called once relocated into RAM - These relocations can be safely ignored
>> 
>
>   
   [0b] 0x6FFA, 0x0236 ???, Entries in .rel.dyn
   [0c] 0x, 0x DT_NULL, End of Array
   [0d] 0x, 0x DT_NULL, End of Array
   [0e] 0x, 0x DT_NULL, End of Array
   [0f] 0x, 0x DT_NULL, End of Array
   [10] 0x, 0x DT_NULL, End of Array

 I think some more investigation into the need for .dynsym and .dynamic is
 still required...
 
>> .dynsym may still be required if only for accessing the __u_boot_cmd
>> structure. However, I may be able to 

Re: [U-Boot] Relocation size penalty calculation

2009-10-08 Thread J. William Campbell
Joakim Tjernlund wrote:
>> On Fri, Oct 9, 2009 at 9:27 AM, J. William Campbell
>>  wrote:
>> 
>>> Graeme Russ wrote:
>>>   
>>>> On Fri, Oct 9, 2009 at 2:58 AM, J. William Campbell
>>>>  wrote:
>>>>
>>>> 
>>>>> Graeme Russ wrote:
>>>>>
>>>>>   
>>>>>> Out of curiosity, I wanted to see just how much of a size penalty I am
>>>>>> incurring by using gcc -fpic / ld -pic on my x86 u-boot build. Here are
>>>>>> the results (fixed width font will help - its space, not tab,
>>>>>> formatted):
>>>>>>
>>>>>> Section non-reloc reloc
>>>>>> ---
>>>>>> .text000118c4  000137fc <- 0x1f38 bytes (~8kB) bigger
>>>>>> .rodata  5bad  59d0
>>>>>> .interp  n/a   0013
>>>>>> .dynstr  n/a   0648
>>>>>> .hashn/a   0428
>>>>>> .eh_frame3268  34fc
>>>>>> .data0a6c  01dc
>>>>>> .data.reln/a   0098
>>>>>> .data.rel.ro.local   n/a   0178
>>>>>> .data.rel.local  n/a   07e4
>>>>>> .got   01f0
>>>>>> .got.plt n/a   000c
>>>>>> .rel.got n/a   03e0
>>>>>> .rel.dyn n/a   1228
>>>>>> .dynsym  n/a   0850
>>>>>> .dynamic n/a   0080
>>>>>> .u_boot_cmd  03c0  03c0
>>>>>> .bss 1a34  1a34
>>>>>> .realmode0166  0166
>>>>>> .bios053e  053e
>>>>>> ===
>>>>>> Total0001d5dd  00022287 <- 0x4caa bytes (~19kB) bigger
>>>>>>
>>>>>> Its more than a 16% increase in size!!!
>>>>>>
>>>>>> .text accounts for a little under half of the total bloat, and of that,
>>>>>> the crude dynamic loader accounts for only 341 bytes
>>>>>>
>>>>>>
>>>>>> 
>>>>> Hi Graeme,
>>>>> I would be interested in a third option (column), the x86 build with
>>>>> just -mrelocateable but NOT -fpic. It will not be definitive because
>>>>> there
>>>>> will be extra code that references the GOT and missing code to do some of
>>>>> the relocation, but it would still be interesting.
>>>>>
>>>>>   
>>>> x86 does not have -mrelocatable. This is a PPC only option :(
>>>>
>>>> 
>>> Hi Graeme,
>>>  You are unfortunately correct. However, I wonder if we can get
>>> essentially the same result by executing the final ld step with the
>>> --emit-relocs switch included. This may also include some "extra" sections
>>> that we would want to strip out, but if it works, it could give all
>>> ELF-based systems a way to a relocatable u-boot.
>>>
>>>   
>> I don't think --emit-relocs is necessary with -pic. I haven't gone through
>> all the permutations to see if there is a smaller option, but gcc -fpic and
>> ld -pie creates enough information to perform relocation on the x86
>> platform
>> 
>
>   
It is true that --emit-relocs is not required when -pic and -pie are 
used instead. However, pic and pie are designed to allow shared code 
(libraries)  to appear at different logical addresses in several 
programs without altering the text. This is grand overkill for what we 
need, which is the ability to relocate the code. The -pic and -pie code 
will be larger than the code without pic and pie. How much larger is a 
good question. On the PPC, it is larger but not much larger, because 
there are lots of registers available and one is almost for sure got (no 
pun intended) the magic relocation constant(s) in it. On the 386 with 
many fewer registers, pic and pie will cause the code to be 
percentage-wise larger than on the PPC. Thus avoiding pic and pie is a 
Good Thing in most cases.
> Try -fvisibility=hidden
>   
I assume the -fvisibility=hidden is suggested in order to reduce 
(eliminate) the symbol table from the output, which we don't need 
because there are assumed to be no undefined symbols in our final ld. If 
that works, great! I was assuming we might need a custom "strip" program 
to delete any sections that we don't need, but this sounds easier if it 
gets them all.

Best Regards,
Bill Campbell
>  Jocke
>
>
>
>   

___
U-Boot mailing list
U-Boot@lists.denx.de
http://lists.denx.de/mailman/listinfo/u-boot


Re: [U-Boot] Relocation size penalty calculation

2009-10-08 Thread J. William Campbell
Graeme Russ wrote:
> On Fri, Oct 9, 2009 at 2:58 AM, J. William Campbell
>  wrote:
>   
>> Graeme Russ wrote:
>> 
>>> Out of curiosity, I wanted to see just how much of a size penalty I am
>>> incurring by using gcc -fpic / ld -pic on my x86 u-boot build. Here are
>>> the results (fixed width font will help - its space, not tab, formatted):
>>>
>>> Section non-reloc reloc
>>> ---
>>> .text000118c4  000137fc <- 0x1f38 bytes (~8kB) bigger
>>> .rodata  5bad  59d0
>>> .interp  n/a   0013
>>> .dynstr  n/a   0648
>>> .hashn/a   0428
>>> .eh_frame3268  34fc
>>> .data0a6c  01dc
>>> .data.reln/a   0098
>>> .data.rel.ro.local   n/a   0178
>>> .data.rel.local  n/a   07e4
>>> .got   01f0
>>> .got.plt n/a   000c
>>> .rel.got n/a   03e0
>>> .rel.dyn n/a   1228
>>> .dynsym  n/a   0850
>>> .dynamic n/a   0080
>>> .u_boot_cmd  03c0  03c0
>>> .bss 1a34  1a34
>>> .realmode0166  0166
>>> .bios053e  053e
>>> ===
>>> Total0001d5dd  00022287 <- 0x4caa bytes (~19kB) bigger
>>>
>>> Its more than a 16% increase in size!!!
>>>
>>> .text accounts for a little under half of the total bloat, and of that,
>>> the crude dynamic loader accounts for only 341 bytes
>>>
>>>   
>> Hi Graeme,
>>  I would be interested in a third option (column), the x86 build with
>> just -mrelocateable but NOT -fpic. It will not be definitive because there
>> will be extra code that references the GOT and missing code to do some of
>> the relocation, but it would still be interesting.
>> 
>
> x86 does not have -mrelocatable. This is a PPC only option :(
>   
Hi Graeme,
   You are unfortunately correct. However, I wonder if we can 
get essentially the same result by executing the final ld step with the 
--emit-relocs switch included. This may also include some "extra" 
sections that we would want to strip out, but if it works, it could give 
all ELF-based systems a way to a relocatable u-boot.

Best Regards,
Bill Campbell
**
>
>   
>> Best Regards,
>> Bill Campbell
>> 
>>> Have any metrics been done for PPC?
>>>
>>> Regards,
>>>
>>> Graeme
>>>   
>
> Once the reloc branch has been merged, how many arches are left which do
> not support relocation?
>
> Regards,
>
> Graeme
>
>
>   

___
U-Boot mailing list
U-Boot@lists.denx.de
http://lists.denx.de/mailman/listinfo/u-boot


Re: [U-Boot] Relocation size penalty calculation

2009-10-08 Thread J. William Campbell
Peter Tyser wrote:
> On Thu, 2009-10-08 at 08:53 -0700, J. William Campbell wrote:
>   
>> Peter Tyser wrote:
>> 
>>> On Thu, 2009-10-08 at 22:54 +1100, Graeme Russ wrote:
>>>   
>>>   
>>>> Out of curiosity, I wanted to see just how much of a size penalty I am
>>>> incurring by using gcc -fpic / ld -pic on my x86 u-boot build. Here are
>>>> the results (fixed width font will help - its space, not tab, formatted):
>>>>
>>>> Section non-reloc reloc
>>>> ---
>>>> .text000118c4  000137fc <- 0x1f38 bytes (~8kB) bigger
>>>> .rodata  5bad  59d0
>>>> .interp  n/a   0013
>>>> .dynstr  n/a   0648
>>>> .hashn/a   0428
>>>> .eh_frame3268  34fc
>>>> .data0a6c  01dc
>>>> .data.reln/a   0098
>>>> .data.rel.ro.local   n/a   0178
>>>> .data.rel.local  n/a   07e4
>>>> .got   01f0
>>>> .got.plt n/a   000c
>>>> .rel.got n/a   03e0
>>>> .rel.dyn n/a   1228
>>>> .dynsym  n/a   0850
>>>> .dynamic n/a   0080
>>>> .u_boot_cmd  03c0  03c0
>>>> .bss 1a34  1a34
>>>> .realmode0166  0166
>>>> .bios053e  053e
>>>> ===
>>>> Total0001d5dd  00022287 <- 0x4caa bytes (~19kB) bigger
>>>>
>>>> Its more than a 16% increase in size!!!
>>>>
>>>> .text accounts for a little under half of the total bloat, and of that,
>>>> the crude dynamic loader accounts for only 341 bytes
>>>>
>>>> Have any metrics been done for PPC?
>>>> 
>>>> 
>>> Things actually improve a little bit when we use -mrelocatable and get
>>> rid of all the manual "+= gd->reloc_off" fixups:
>>>
>>> 1) Top of mainline on XPedite5370:
>>>textdata bss dec hex filename
>>>  308612   24488   33172  366272   596c0 u-boot
>>>
>>> 2) Top of "reloc" branch on XPedite5370 (ie -mrelocatable):
>>>textdata bss dec hex filename
>>>  303704   28644   33156  365504   593c0 u-boot
>>>
>>>   
>>>   
>> Hi Peter,
>>  Just to be clear, the total text+data length of u-boot with the 
>> "manual" relocations (#1)  is LARGER than the text+data length of u-boot 
>> with the "manual" relocations removed and the necessary centralized 
>> relocation code added, along with any additional data sections required 
>> by -mrelocateable (#2), by 768 (dec) bytes?
>> 
>
> Hi Bill,
> Doah, looks like I chose a bad board as an example.  The XPedite5370
> already had -mrelocatable defined in its own
> board/xes/xpedite5370/config.mk in mainline, so the above comparison
> should be ignored as both builds used -mrelocatable.
>
> Here's some *real* results from the MPC8548CDS:
> 1) Top of mainline:
>text  data bss dec hex filename
>  219968 17052   22992  260012   3f7ac u-boot
>
> 2) Top of "reloc" branch (ie -mrelocatable)
>text  data bss dec hex filename
>  219192 20640   22980  262812   4029c u-boot
>
> So the reloc branch is 2.7K bigger for the MPC8548CDS.
>   
Hi Peter,
 OK, that's more like it! A 1.2 % size increase in ROM seems like a 
very small price to pay for a truly relocatable u-boot image that will 
run on any size memory without the programmer having to actively worry 
about what may need relocating as code is written. . Also, it should be 
noted that the size increase in 2)  is mostly in relocation segments 
that do not need to be copied into ram, so the ram footprint should be 
smaller for 2) than 1). The relocation code itself could also be placed 
is a segment that is not copied into ram, although that may be more 
trouble than it is worth.
   I am looking forward to Graeme's results with the 386. I expect 
that it will not be quite so favorable, perhaps a 4 or 5% size increase 
for -mrelocatable over an absolute build. However, -mrelocatable vs. 
-fpic may be comparable, with -mrelocatable actually winning. But then 
again, I could be totally wrong!

Best Regards,
Bill Campbell
> Best,
> Peter
>
>
>
>   

___
U-Boot mailing list
U-Boot@lists.denx.de
http://lists.denx.de/mailman/listinfo/u-boot


Re: [U-Boot] Relocation size penalty calculation

2009-10-08 Thread J. William Campbell
Graeme Russ wrote:
> Out of curiosity, I wanted to see just how much of a size penalty I am
> incurring by using gcc -fpic / ld -pic on my x86 u-boot build. Here are
> the results (fixed width font will help - its space, not tab, formatted):
>
> Section non-reloc reloc
> ---
> .text000118c4  000137fc <- 0x1f38 bytes (~8kB) bigger
> .rodata  5bad  59d0
> .interp  n/a   0013
> .dynstr  n/a   0648
> .hashn/a   0428
> .eh_frame3268  34fc
> .data0a6c  01dc
> .data.reln/a   0098
> .data.rel.ro.local   n/a   0178
> .data.rel.local  n/a   07e4
> .got   01f0
> .got.plt n/a   000c
> .rel.got n/a   03e0
> .rel.dyn n/a   1228
> .dynsym  n/a   0850
> .dynamic n/a   0080
> .u_boot_cmd  03c0  03c0
> .bss 1a34  1a34
> .realmode0166  0166
> .bios053e  053e
> ===
> Total0001d5dd  00022287 <- 0x4caa bytes (~19kB) bigger
>
> Its more than a 16% increase in size!!!
>
> .text accounts for a little under half of the total bloat, and of that,
> the crude dynamic loader accounts for only 341 bytes
>   
Hi Graeme,
   I would be interested in a third option (column), the x86 build 
with just -mrelocateable but NOT -fpic. It will not be definitive 
because there will be extra code that references the GOT and missing 
code to do some of the relocation, but it would still be interesting.

Best Regards,
Bill Campbell
> Have any metrics been done for PPC?
>
> Regards,
>
> Graeme
> ___
> U-Boot mailing list
> U-Boot@lists.denx.de
> http://lists.denx.de/mailman/listinfo/u-boot
>
>
>   

___
U-Boot mailing list
U-Boot@lists.denx.de
http://lists.denx.de/mailman/listinfo/u-boot


Re: [U-Boot] Relocation size penalty calculation

2009-10-08 Thread J. William Campbell
Peter Tyser wrote:
> On Thu, 2009-10-08 at 22:54 +1100, Graeme Russ wrote:
>   
>> Out of curiosity, I wanted to see just how much of a size penalty I am
>> incurring by using gcc -fpic / ld -pic on my x86 u-boot build. Here are
>> the results (fixed width font will help - its space, not tab, formatted):
>>
>> Section non-reloc reloc
>> ---
>> .text000118c4  000137fc <- 0x1f38 bytes (~8kB) bigger
>> .rodata  5bad  59d0
>> .interp  n/a   0013
>> .dynstr  n/a   0648
>> .hashn/a   0428
>> .eh_frame3268  34fc
>> .data0a6c  01dc
>> .data.reln/a   0098
>> .data.rel.ro.local   n/a   0178
>> .data.rel.local  n/a   07e4
>> .got   01f0
>> .got.plt n/a   000c
>> .rel.got n/a   03e0
>> .rel.dyn n/a   1228
>> .dynsym  n/a   0850
>> .dynamic n/a   0080
>> .u_boot_cmd  03c0  03c0
>> .bss 1a34  1a34
>> .realmode0166  0166
>> .bios053e  053e
>> ===
>> Total0001d5dd  00022287 <- 0x4caa bytes (~19kB) bigger
>>
>> Its more than a 16% increase in size!!!
>>
>> .text accounts for a little under half of the total bloat, and of that,
>> the crude dynamic loader accounts for only 341 bytes
>>
>> Have any metrics been done for PPC?
>> 
>
> Things actually improve a little bit when we use -mrelocatable and get
> rid of all the manual "+= gd->reloc_off" fixups:
>
> 1) Top of mainline on XPedite5370:
>text  data bss dec hex filename
>  308612 24488   33172  366272   596c0 u-boot
>
> 2) Top of "reloc" branch on XPedite5370 (ie -mrelocatable):
>text  data bss dec hex filename
>  303704 28644   33156  365504   593c0 u-boot
>
>   
Hi Peter,
 Just to be clear, the total text+data length of u-boot with the 
"manual" relocations (#1)  is LARGER than the text+data length of u-boot 
with the "manual" relocations removed and the necessary centralized 
relocation code added, along with any additional data sections required 
by -mrelocateable (#2), by 768 (dec) bytes? And both cases (1 and 2) 
work equivalently?

Best Regards,
Bill Campbell.
> For fun:
> 3) #2 but with s/-mrelocatable/-fpic/ (probably doesn't boot):
>text  data bss dec hex filename
>  303704 24472   33156  361332   58374 u-boot
>
>
> There may be some other changes that affect the size between mainline
> and "reloc", but their sizes are in the same general ballpark.
>
> Best,
> Peter
>
> ___
> U-Boot mailing list
> U-Boot@lists.denx.de
> http://lists.denx.de/mailman/listinfo/u-boot
>
>
>   

___
U-Boot mailing list
U-Boot@lists.denx.de
http://lists.denx.de/mailman/listinfo/u-boot


Re: [U-Boot] [PATCH 0/2] Make sure 85xx bss doesn't start at 0x0

2009-10-07 Thread J. William Campbell
Joakim Tjernlund wrote:
>> On Wed, Oct 7, 2009 at 5:55 PM, Wolfgang Denk  wrote:
>> 
>>> Dear Graeme Russ,
>>>
>>> In message  
>>> you wrote:
>>>   
 I think that even the -mrelocatable / .fixup method may not be needed at
 all. -pie / -pic used by themselves creates enough information for an OS
 dynamic loader to relocate an executable, so why not U-Boot? Given that
 the type and location of each section is easily determined, a striped
 down dynamic loader can provide a platform-independent relocation scheme.
 
>>> One reason for not using ELF images for the boot loader is size. The
>>> ELF header alone is often more than we would be willing to accept, not
>>> to mention the additional code.
>>>
>>>   
>> But the headers get stripped from the final binary. All we are left with in
>> order to locate the ELF section data are the symbols exported from the
>> linker script
>>
>> The extra code is only three very tight for-loops. I had them wrapped in
>> functions to improve readability, but they are good inline candidates (only
>> called once each) and I doubt they use much code space at all (I'll send
>> through actual numbers soon)
>> 
>
> But how much space in the extra sections you link in?
>
> if size is comparable with fixup ptrs we should probably
> consider using the same for ppc. Then we can use -fpic/-fpie and
> that is significantly smaller then -fPIC on PPC.
>
>   
>> Question is, does -mrelocatable result in smaller .got (et al) are is the
>> .fixup section adding extra size for the sake of ease of implementation?
>> 
>
> fixup section expands with lots of ptrs, then fixup is placed just after .got
>
>
>   
There is also a trade-off in using the -mrelocatable / .fixup method 
that should be considered in the general case. -fPIC or even -fpic code 
is almost always larger than the code generated for -mrelocateable, so 
it it the final size of the object blob in prom that matters, not just 
the length of the fixup segment. This effect is architecture dependent, 
where PIC is much cheaper on a PPC than on an Intel 386. So the "best" 
combination may depend on the chip.
I would also strongly encourage the -mrelocateable method be always 
available as an option. Historically, -fPIC has been buggy in initial 
releases of GCC for new architectures, probably because you can do 
useful work with just relocateable code and PIC can come "later". So it 
may be that u-boot would need an -mrelocateable approach for some period 
until -fPIC/-fpic worked. In any case, it would be a good fallback if 
one suspected a bug in PIC.

Best Regards,
Bill Campbell
___
U-Boot mailing list
U-Boot@lists.denx.de
http://lists.denx.de/mailman/listinfo/u-boot


Re: [U-Boot] [PATCH 0/2] Make sure 85xx bss doesn't start at 0x0

2009-10-06 Thread J. William Campbell
Peter Tyser wrote:
> On Tue, 2009-10-06 at 13:34 -0700, J. William Campbell wrote:
>   
>> Peter Tyser wrote:
>> 
>>> On Tue, 2009-10-06 at 19:51 +0200, Wolfgang Denk wrote:
>>>   
>>>   
>>>> Dear Peter Tyser,
>>>>
>>>> In message <1254843932.24664.2083.ca...@localhost.localdomain> you wrote:
>>>> 
>>>> 
>>>>> I personally like the current implementation of putting the bss after
>>>>> the entire U-Boot image.  It keeps U-Boot's code, malloc pool, stack,
>>>>> bss, etc all in the same general area which is nice, and has the side
>>>>> benefit that the bootpg won't be overwritten.
>>>>>   
>>>>>   
>>>> OK, if you think so...
>>>>
>>>> 
>>>> 
>>>>> I know ORing in 0x10 is a bit ugly, but what's the real downside of
>>>>> doing it?
>>>>>   
>>>>>   
>>>> Nothing. I just hate to allocate the bss at 0x0, because this is
>>>> actually incorrect - it's the result of an address overflow /
>>>> truncation, and pretty much misleading to someone trying to read and
>>>> understand the code. For the linked image, it does not _look_ as if
>>>> the bss was located _after_ the U-Boot image, it looks detached and
>>>> allocated in low RAM.
>>>> 
>>>> 
>>> Do you have a preference Kumar?  You're probably going to be the first
>>> in line to have to deal with any resulting confusion:)
>>>
>>> I personally would rank the options:
>>> 1. OR in an offset to the bss address and leave some good comments in
>>> the linker script and commit message
>>>
>>> 2. Make the bss the last section like other PPC boards which would
>>> result in the bootpg sometimes being overwritten
>>>
>>> 3. Put the bss at an arbitrary address
>>>   
>>>   
>> FWIW, I think an arbitrary address disjoint from the u-boot addresses is 
>> best. While u-boot is in ROM, you can't use the bss anyway. The bss will 
>> actually be located at an address selected by the u-boot code itself 
>> after memory is sized. All references to the bss will be re-located by 
>> subtracting the arbitrary start address and adding the run-time chosen 
>> start address. So the linked start address is not important, except that 
>> is cannot be NULL or it may confuse the relocation code that doesn't 
>> want to re-locate NULL pointers. Some of the confusion in this 
>> discussion probably stems from the fact that the linker scripts make the 
>> bss look like "part of u-boot", when it is really not. It is just a 
>> chunk of "zero'ed" ram, located anywhere the u-boot code decides to put 
>> it. An arbitrary strange address would make this more apparent.
>> 
>
> Hi Bill,
> What's the advantage of having the bss not be located next to U-Boot?
> The big disadvantage of picking an arbitrary address for the bss is that
> there's now 1 more magical section of SDRAM that the user needs to know
> shouldn't be used.  I already field enough question from people that
> corrupt their exception vectors or stack/malloc pool/u-boot code, I
> don't want to add more bss questions:)
>   
Hi Peter,
  The point is that the address chosen for the ld step is NOT the 
address in ram where the bss will reside anyway. This address can 
overlap the exception vectors, stack, or even the u-boot code itself and 
it wouldn't matter (other than possible confusion). The actual physical 
address where the bss and u-boot itself resides is COMPUTED by u-boot 
after it sizes memory. u-boot only needs to know how big the section is 
in order to allow enough room. All references to the bss will then be 
re-located correctly. Where the bss actually ends up is a function of 
u-boot code. It may be on some processors that the computation of bss 
start is done assuming the bss is adjacent to u-boot in the original 
memory map, but if so, it is an un-necessary restriction. All that is 
required is a "safe" chunk of ram, which is also what is needed for 
stack and malloc area and should be chosen in a similar manner. So at 
run time, bss probably ends up adjacent to u-boot in ram because that's 
how it was coded, but at ld time it shouldn't matter.

Best Regards,
Bill Campbell
> Best,
> Peter
>
> PS. please keep the original email recipients on CC
>
>
>
>   

___
U-Boot mailing list
U-Boot@lists.denx.de
http://lists.denx.de/mailman/listinfo/u-boot


Re: [U-Boot] [PATCH 0/2] Make sure 85xx bss doesn't start at 0x0

2009-10-06 Thread J. William Campbell
Peter Tyser wrote:
> On Tue, 2009-10-06 at 19:51 +0200, Wolfgang Denk wrote:
>   
>> Dear Peter Tyser,
>>
>> In message <1254843932.24664.2083.ca...@localhost.localdomain> you wrote:
>> 
>>> I personally like the current implementation of putting the bss after
>>> the entire U-Boot image.  It keeps U-Boot's code, malloc pool, stack,
>>> bss, etc all in the same general area which is nice, and has the side
>>> benefit that the bootpg won't be overwritten.
>>>   
>> OK, if you think so...
>>
>> 
>>> I know ORing in 0x10 is a bit ugly, but what's the real downside of
>>> doing it?
>>>   
>> Nothing. I just hate to allocate the bss at 0x0, because this is
>> actually incorrect - it's the result of an address overflow /
>> truncation, and pretty much misleading to someone trying to read and
>> understand the code. For the linked image, it does not _look_ as if
>> the bss was located _after_ the U-Boot image, it looks detached and
>> allocated in low RAM.
>> 
>
> Do you have a preference Kumar?  You're probably going to be the first
> in line to have to deal with any resulting confusion:)
>
> I personally would rank the options:
> 1. OR in an offset to the bss address and leave some good comments in
> the linker script and commit message
>
> 2. Make the bss the last section like other PPC boards which would
> result in the bootpg sometimes being overwritten
>
> 3. Put the bss at an arbitrary address
>   
FWIW, I think an arbitrary address disjoint from the u-boot addresses is 
best. While u-boot is in ROM, you can't use the bss anyway. The bss will 
actually be located at an address selected by the u-boot code itself 
after memory is sized. All references to the bss will be re-located by 
subtracting the arbitrary start address and adding the run-time chosen 
start address. So the linked start address is not important, except that 
is cannot be NULL or it may confuse the relocation code that doesn't 
want to re-locate NULL pointers. Some of the confusion in this 
discussion probably stems from the fact that the linker scripts make the 
bss look like "part of u-boot", when it is really not. It is just a 
chunk of "zero'ed" ram, located anywhere the u-boot code decides to put 
it. An arbitrary strange address would make this more apparent.

Best Regards,
Bill Campbell
> Best,
> Peter
>
> ___
> U-Boot mailing list
> U-Boot@lists.denx.de
> http://lists.denx.de/mailman/listinfo/u-boot
>
>
>   

___
U-Boot mailing list
U-Boot@lists.denx.de
http://lists.denx.de/mailman/listinfo/u-boot


Re: [U-Boot] [PATCH] CFI Driver: Reset watchdog timer after each flash operation

2009-10-02 Thread J. William Campbell
Mike Frysinger wrote:
> On Friday 02 October 2009 08:30:51 Wolfgang Denk wrote:
>   
>> Ingo van Lil wrote:
>> 
>>> The CFI driver does not reset the device's watchdog, so long-running
>>> flash operations will cause the watchdog timer to expire. A comment in
>>> flash_status_check() suggests that udelay() is expected to reset the
>>> watchdog, but I can't find any architecture where it does.
>>>   
>> Please have a closer look, then. On PowerPC, udelay()
>> ["lib_ppc/time.c"] calls wait_ticks(), which in turn
>> ["lib_ppc/ticks.S"] calls WATCHDOG_RESET
>>
>> If this is missing in other architectures, it should be fixed at the
>> root cause, i. e. in udelay() or in the respective support routines.
>> 
>
> Blackfin is missing it as well as i really had no idea it was supposed to be 
> there.  certainly no doc states this requirement.  perhaps it'd make sense to 
> break apart the common stuff to a common udelay() that does things like call 
> the watchdog and then call the implementation __udelay().  there should be at 
> least a doc/README.arch that includes these kind of details ...
> -mike
>   
>   
At this point it might be appropriate to ask if including such a reset 
in udelay() is a good idea. The way it is, no "infinite loop" in u-boot 
that contains a udelay() will ever allow the watchdog to time out. That 
restriction is somewhat non-obvious. I would argue that there are 
basically two classes of commands in u-boot, those that should execute 
so fast that there is no way the watchdog should be triggered, and 
commands that may take "a long time" during  which the watchdog may 
erroneously be triggered if there is no provision made to reset it. In 
the second case, the resetting of the watchdog must be placed where 
"measurable progress" towards command completion is being made, i.e. 
there is a certainty that we are getting closer to a command complete. 
Just because the program invokes udelay, there is no assurance that 
measurable progress is being made. The udelay may be part of a process 
that will never complete. Therefore, having a watchdog reset in udelay 
seems like a less than optimal idea in general. Maybe now would be a 
good time to look at removing it?

Bill Campbell
> 
>
> ___
> U-Boot mailing list
> U-Boot@lists.denx.de
> http://lists.denx.de/mailman/listinfo/u-boot
>   

___
U-Boot mailing list
U-Boot@lists.denx.de
http://lists.denx.de/mailman/listinfo/u-boot


Re: [U-Boot] Virtual addresses, u-boot, and the MMU

2009-09-03 Thread J. William Campbell
Becky Bruce wrote:
>
> On Sep 2, 2009, at 2:59 AM, Wolfgang Denk wrote:
>
>> Dear "J. William Campbell",
>>
>> In message <4a9d99b1.1010...@comcast.net> you wrote:
>>>
>> ...
>>>> Becky then posted the summary of this discussion here:
>>>>
>>>> http://thread.gmane.org/gmane.comp.boot-loaders.u-boot/50705
>> ...
>>> In quick summary, for the next few years, we will require that all
>>> "important" physical addresses have corresponding virtual addresses.
>>
>> ...unless there are good reasons to deviate from that rule, but these
>> cases are expected to be rare exceptions from the rule.
>>
>>
>>>> In any case, I think we should be careful not to mix things: what  we
>>>> are  discussing here are address mappings. What we are not discussing
>>>> is specific memory properties like  being  cached/uncached,  guarded/
>>>> non-guarded, etc.
>>>>
>>>> Such properties are important, too, but  they  need  to  get  handled
>>>> through a separate interface.
>>>>
>>> Here is where I am quite sure you are going to have a problem. In very
>>> many CPUs, cache control and memory management are joined at the hip.
>>> Some systems have no easy way to enable and disable (D,I) cache
>>> globally, it is only doable on a page or segment basis. The PPC 
>>> hardware
>>> has a relatively low cost way to do so, but not all architectures do.
>>
>> I am well aware of this problem, which is one of the reasons that the
>> majority of systems is running  with  data  cache  turned  off.  Even
>> PowerPC  has  DC off (at least after relocation to RAM), ARM does not
>> implement cache support yet, etc.
>>
>> When we state that U-Boot is a boot loader and thus  should  be  kept
>> simple,  this  more or less logically results in the consequence that
>> if it's difficult to enable the DC (on some systems), then just don't
>> do it, then.
>>
>> Nobody enforces you to enable caches when you find it hard to do.
>>
>>
>>> Very True. I did forget about the read being just a memory 
>>> reference. So
>>> if we desire the flash to be cached, it would have to "normally" be
>>> cached for reads to take advantage of the operation.
>>
>> ACK.
>>
>>> Thanks for looking at this. It therefore seems to me that adding an
>>> "uncache(virtual address)" operation (that may return a substitute
>>> address for the actual write to the flash) followed by a
>>> "restore_cache()" operation inside the flash driver write routine 
>>> should
>>> work. The uncache routine would do nothing if the flash is not 
>>> cached to
>>
>> This is where Detlev's comment about using the chance to define a
>> cache API comes into play.
>>
>> Yes, we probably should create a set of functions like
>>
>> enable_data_cache(address, size);
>> and
>> disable_data_cache(address, size);
>>
>> which would turn on resp. off the caching attributes on the given
>> memory range.
>>
>>> begin with, would globally turn off data cache if that is easy to 
>>> do, or
>>> would provide an alternate virtual address to be used in the write. 
>>> That
>>
>> This is where I disagree.
>>
>> I'm not really deep enough in the implementation details and thus
>> would appreciate comments for example from Becky and Stefan. In my
>> opinion, turning on or off the cache on an address range should be
>> implemented as outlined above, i. e. as an operation changing the
>> caching properties of the address range.
>
> This makes sense to me.  The disable function would need to flush the 
> range from the cache, but that's the only difficulty I forsee.  
> However, I dug up some AVR32 docs, and it looks like the whole dual 
> cacheable/CI mapping thing may be architectural.  The architecture 
> seems to specify a virtual memory map for privileged state, and part 
> of the VA range is not translated by the MMU, but has a default 
> translation.  There appear to be segments in the untranslated VA space 
> that map to the same PA, one cacheable and the other not.
Unfortunately, more than one architecture that has  problems turning off 
cache in an arbitrary region of memory. If a system has a "global" cache 
disable, that would be the easiest option. Since u-boot is single 
threaded, we don't have to worry about the cache not being turned back 
on b

Re: [U-Boot] Virtual addresses, u-boot, and the MMU

2009-09-01 Thread J. William Campbell
Wolfgang Denk wrote:
> Dear "J. William Campbell",
>
> In message <4a9d5ef2.4030...@comcast.net> you wrote:
>   
>>   I have followed the recent discussions about problems in the CFI 
>> driver caused by the need to change the attributes of the address at 
>> which the flash is mapped. This discussion has raised some questions in 
>> my mind regarding the assumptions u-boot makes regarding the behavior of 
>> the addresses used in the code. I am writing this message for comments 
>> that may correct any mis-understandings I have. First, the addresses 
>> used by the u-boot program to reference memory and code are all 
>> "virtual" addresses (VA) because they go through the MMU  to become a 
>> physical address(PA). If there is no MMU, the virtual address and the 
>> physical address are identical.
>> 
>
> Even if there is a MMU, we keep it "switched off" on most systems, or
> otherwise set it up in such a way that there is still a 1:1  mapping,
> i. e. the virtual address and physical address are identical, too.
>
> There have been several discussions concerning this topic on IRC
> (#u-boot at freenode); I'll try to summarize these here - not sure
> though if I don't miss anything: please feel free to complement what
> might be missing.
>
>   

> Becky then posted the summary of this discussion here:
>
> http://thread.gmane.org/gmane.comp.boot-loaders.u-boot/50705
>
>
> Note that there was a general agreement among those who raised their
> voices.
>
>   
In quick summary, for the next few years, we will require that all 
"important" physical addresses have corresponding virtual addresses. 
Some limited support for mapping in "other resources" may be provided at 
an operator interface level, but it will be quite limited. OK, seems 
reasonable to me.
>   
>> The "normal", or legacy,policy for u-boot is to arrange a memory 
>> map such that all physical addresses are mapped to some virtual address. 
>> Originally, the mapping were such that the VA was actually == the PA, 
>> but today on some CPUs, this is not possible. When the size of the 
>> physical address space exceeds the size of the virtual address space, 
>> the VA may not =- the PA numerically, but there is a one-to-one 
>> correspondence. It MAY also acceptable to map the same PA so that it 
>> appears more than once in the address space (VA), but if this is done, 
>> 
>
> This may or may not be possible. It may even make sense or be needed
> on some systems, and it may be impossible to do on others.
>
> In any case, I think we should be careful not to mix things: what  we
> are  discussing here are address mappings. What we are not discussing
> is specific memory properties like  being  cached/uncached,  guarded/
> non-guarded, etc.
>
> Such properties are important, too, but  they  need  to  get  handled
> through a separate interface.
>   
Here is where I am quite sure you are going to have a problem. In very 
many CPUs, cache control and memory management are joined at the hip. 
Some systems have no easy way to enable and disable (D,I) cache 
globally, it is only doable on a page or segment basis. The PPC hardware 
has a relatively low cost way to do so, but not all architectures do.
>   
>>   Becky Bruce "re-wrote the driver to use VAs instead of PAs." I am 
>> not exactly sure what this means, but I assume it meant allowing the VA 
>> referencing the flash
>> to be distinct from the PA where the flash "lives" (and may require 36 
>> bits on a PPC system to represent the PA). Does the driver re-map 
>> 
>
> I think the information provided above sheds more light on this.
>   
Yes, it did.
>   
>> portions of the flash in order to access them? If the flash is really 
>> large, I can certainly see a need to do so. However, I assume on "medium 
>> size" flashes, it does not need to remap. In that case, don't all 
>> references just go through the MMU and get translated? The VA != PA, but 
>> from the point of view of u-boot, the VA is the only address that 
>> matters. The AVR32 certainly does not map flash dynamically, so it would 
>> not matter on that CPU.
>> 
>
> OK.
>
>   
>>   The issue with the CFI driver on the AVR32 is that it needs to 
>> disable cache on the address space of the flash memory when it is 
>> writing to the flash. This apparently is not trivial to do, but there is 
>> 
>
> Actually this is not specific to the AVR32, and so far  most  systems
> simply  do  not  enable  caches at all on the flash memor

Re: [U-Boot] Virtual addresses, u-boot, and the MMU

2009-09-01 Thread J. William Campbell
  I have followed the recent discussions about problems in the CFI 
driver caused by the need to change the attributes of the address at 
which the flash is mapped. This discussion has raised some questions in 
my mind regarding the assumptions u-boot makes regarding the behavior of 
the addresses used in the code. I am writing this message for comments 
that may correct any mis-understandings I have. First, the addresses 
used by the u-boot program to reference memory and code are all 
"virtual" addresses (VA) because they go through the MMU  to become a 
physical address(PA). If there is no MMU, the virtual address and the 
physical address are identical.
The "normal", or legacy,policy for u-boot is to arrange a memory 
map such that all physical addresses are mapped to some virtual address. 
Originally, the mapping were such that the VA was actually == the PA, 
but today on some CPUs, this is not possible. When the size of the 
physical address space exceeds the size of the virtual address space, 
the VA may not =- the PA numerically, but there is a one-to-one 
correspondence. It MAY also acceptable to map the same PA so that it 
appears more than once in the address space (VA), but if this is done, 
any references inside u-boot (such as in drivers) must be consistent, 
i.e. the flash chip must exist at exactly one virtual address from 
u-boots viewpoint. This is required because the u-boot drivers maintain 
pointers into the flash, and it is important to be able to compare these 
pointers to input virtual addresses. It is also necessary that the VA of 
the flash chip not change after the driver has been set up, or else any 
saved pointers will be wrong.
  Becky Bruce "re-wrote the driver to use VAs instead of PAs." I am 
not exactly sure what this means, but I assume it meant allowing the VA 
referencing the flash
to be distinct from the PA where the flash "lives" (and may require 36 
bits on a PPC system to represent the PA). Does the driver re-map 
portions of the flash in order to access them? If the flash is really 
large, I can certainly see a need to do so. However, I assume on "medium 
size" flashes, it does not need to remap. In that case, don't all 
references just go through the MMU and get translated? The VA != PA, but 
from the point of view of u-boot, the VA is the only address that 
matters. The AVR32 certainly does not map flash dynamically, so it would 
not matter on that CPU.
  The issue with the CFI driver on the AVR32 is that it needs to 
disable cache on the address space of the flash memory when it is 
writing to the flash. This apparently is not trivial to do, but there is 
a second mapping that does have the cache off. Wolfgang has recommended 
the creation of a function to turn off the cache for the flash area, and 
also (presumably) one to turn the flash back on when the write is 
complete. Haavard has at present a function that returns an alternate 
address with the cache already off that addresses the same memory. This 
wouldn't cause a problem if the mapping happened immediately before the 
actual copy operation took place and was used for nothing else. However, 
if it happens early on in the driver, the address will not match the 
structure set up by the rest of the flash code using the non-translated 
address.
Therefore, I have the following questions: If the map_physmem() 
macro is removed from the driver on the AVR32, does the driver work if 
it is told that the flash PA=VA = the un-cached address? If not, why 
not? Shouldn't this be just like any CFI on an un-cached PPC address? 
The driver will be somewhat slower reading but otherwise it should work. 
If/when it does work, couldn't a map_in_cache() macro be placed directly 
in front of the read code that copies data from flash to other buffers. 
The macro would return an address of the same data referenced through a 
cached address if it exists. This address would go nowhere else and 
never be stored anywhere. This would speed up the copy operation for 
situations where it matters, and is applicable to all platforms that can 
do such a thing. The most general solution would be a call to 
map_in_cache/map_in_not_cache for both reads and writes in the CFI 
driver. These routines would return a "substitute" address (or the same 
input one), and may actually add another mapping dynamically, use a 
pre-existing appropriate mapping, just turn on/turn off data cache 
globally, or do nothing). At the end of the copy, map_restore() would 
put the map back  By default, the assumption would be that the flash is 
not cached and the macros do nothing. Sounds simple to me, what have I 
overlooked?

Bill Campbell
> Haavard Skinnemoen wrote:
>
> 
>
>   
>> Right...I'm beginning to doubt that anyone is familiar enough with the
>> u-boot code, since everyone seems to have their own opinion about how
>> things are supposed to work.
>>
>> To summarize, here are the possible ways to fix the problem as I see it: