Re: Mersenne: Re: Large memory pages in Linux

2003-12-24 Thread Nick Craig-Wood
On Wed, Dec 24, 2003 at 07:46:32AM +, Brian J. Beesley wrote:
> On Tuesday 23 December 2003 20:15, Matthias Waldhauer wrote:
> >
> > Last friday I read some messages about recent kernel modifications and
> > patches for version 2.6.0. There is an "imcplicit_large_page" patch,
> > allowing applications to use large pages without modifications. I don't
> > have the time to dig into it :(
> 
> Sure. This is a much better approach than mucking about with 
> application-specific modifications which would likely involve serious 
> security hazards (leaking kernel priveleges to the application) and/or clash 
> with other applications private large-page code and/or large page enabled 
> kernels in the future.

Its also a much more invasive approach - changing something as
fundamental as the kernels page size (and the assumption that its
constant) is very hard work - don't expect it in 2.6!

As for security hazards - the hugepage implementation in 2.6 has been
very well thought out.  No extra privs are required.  The super user
must :-

a) allocate some huge pages
b) mount the hugetlbfs filesystem
c) use standard unix permissions so the right user(s) can use
   hugetlbfs

The permissioned user can then allocate huge pages using mmap out of
hugetlbfs.

You use my particular hack with a couple of LD_PRELOAD libraries which
doesn't require patching the application, eg

  LD_PRELOAD="`pwd`/intercept.so `pwd`/alloc.so" ./mprime -m

So you can use huge pages with any application that is dynamically
linked.

This seems pretty secure and non-invasive to me ;-)

(I will shortly put full instructions and code on a web page - family
duties over christmas permitting!)

-- 
Nick Craig-Wood
[EMAIL PROTECTED]
_
Unsubscribe & list info -- http://www.ndatech.com/mersenne/signup.htm
Mersenne Prime FAQ  -- http://www.tasam.com/~lrwiman/FAQ-mers


Re: Mersenne: Re: Large memory pages in Linux

2003-12-24 Thread Nick Craig-Wood
On Tue, Dec 23, 2003 at 09:15:46PM +0100, Matthias Waldhauer wrote:
> For a TLB aware client like Prime95 the improvements are small as 
> expected.

Kudos to George ;-)

> But I'm sure that speedups greater than ~2.5% are possible by
> modifying the FFTs to make optimal use of large pages.

I'd love to have George's opinion on this.

> For most FFT sizes currently in use the available number of 2/4MB
> TLB entries cover the complete working set. But for larger sizes it
> is again necessary to take care of TLB entries.

My PII laptop has 32 TLB entries and 4 MB pages, so thats 128 MB of
"flat" memory which is really quite a large FFT!

> Last friday I read some messages about recent kernel modifications
> and patches for version 2.6.0. There is an "imcplicit_large_page"
> patch, allowing applications to use large pages without
> modifications. I don't have the time to dig into it :(

Do you mean wli's superpage patches?  I think he is aiming those for
2.7.  They sound very intersting but will require extensive reworking
of the kernels internal assumptions that a page is constant size.

-- 
Nick Craig-Wood
[EMAIL PROTECTED]
_
Unsubscribe & list info -- http://www.ndatech.com/mersenne/signup.htm
Mersenne Prime FAQ  -- http://www.tasam.com/~lrwiman/FAQ-mers


Mersenne: Large memory pages in Linux

2003-12-23 Thread Nick Craig-Wood
On Wed, Dec 10, 2003 at 06:26:54PM +, Nick Craig-Wood wrote:
> I'm in the process of writing (not quite finished or working ;-) some
> code which you load as an LD_PRELOAD library under linux.  This gets
> its fingers into the memory allocation, and makes all malloc space
> come from hugetlbfs (how you get large pages under linux).
> 
> My primary user for this was to be mprime of course!

Well I finished the code and here are the results on my lowly laptop
running 2.6.0.

Intel(R) Pentium(R) III processor
CPU speed: 550.78 MHz
CPU features: RDTSC, CMOV, PREFETCH, MMX, SSE
L1 cache size: 16 KB
L2 cache size: 256 KB
L1 cache line size: 32 bytes
L2 cache line size: 32 bytes
TLBS: 64
Prime95 version 22.12, RdtscTiming=1

Normal
--

Best time for 256K FFT length: 80.256 ms.
Best time for 320K FFT length: 101.820 ms.
Best time for 384K FFT length: 125.191 ms.
Best time for 448K FFT length: 145.505 ms.
Best time for 512K FFT length: 161.178 ms.
Best time for 640K FFT length: 215.113 ms.
Best time for 768K FFT length: 258.055 ms.
Best time for 896K FFT length: 304.786 ms.
Best time for 1024K FFT length: 345.747 ms.
Best time for 1280K FFT length: 449.540 ms.
Best time for 1536K FFT length: 541.963 ms.
Best time for 1792K FFT length: 661.651 ms.

With all memory allocations coming from 4 MB pages
--

Best time for 256K FFT length: 79.293 ms.   1.2%
Best time for 320K FFT length: 102.032 ms. -0.2%
Best time for 384K FFT length: 124.022 ms.  0.9%
Best time for 448K FFT length: 145.492 ms.  0.0%
Best time for 512K FFT length: 161.568 ms. -0.2%
Best time for 640K FFT length: 213.311 ms.  0.8%
Best time for 768K FFT length: 254.609 ms.  1.3%
Best time for 896K FFT length: 301.911 ms.  0.9%
Best time for 1024K FFT length: 339.203 ms. 1.9%
Best time for 1280K FFT length: 439.119 ms. 2.3%
Best time for 1536K FFT length: 531.422 ms. 1.9%
Best time for 1792K FFT length: 645.350 ms. 2.5%

So consistent but small improvements in the larger FFTs.  This just
goes to show what a good job George has done in not thrashing the TLB!

I wonder if Prime95 could be made more efficient if it didn't have to
worry about the TLB?  Its obviously detecting the TLB slots for this
computer which is wrong in this case - perhaps there is a way of
overriding this?

Please email me if you'd like to experiment with the code - its quite
simple (it just took rather a lot of different approaches to get
right!).  You'll need to be running 2.6.0 with HUGETLB support if you
want to play (see hugetlbpage.txt in Documentation in the kernel
source for more info).

-- 
Nick Craig-Wood
[EMAIL PROTECTED]
_
Unsubscribe & list info -- http://www.ndatech.com/mersenne/signup.htm
Mersenne Prime FAQ  -- http://www.tasam.com/~lrwiman/FAQ-mers


Re: Mersenne: Large memory pages in Linux and Windows Server (64bit?)

2003-12-10 Thread Nick Craig-Wood
On Wed, Dec 10, 2003 at 02:50:11PM +0100, Matthias Waldhauer wrote:
> Back to the topic:
> Some time ago there was a discussion going on regarding the use of large 
> memory pages. In a mersenneforum thread I collected some info regarding 
> new linux kernels and some real world results published in a paper.
> 
> Here some extracts:
> Linux kernel versions 2.5.36+ and 2.6 include a "HugeTLBs" patch, which 
> allows an application to allocate large memory pages.
> Also 64bit Windows Server seems to support them too:
> http://msdn.microsoft.com/library/default.asp?url=/library/en-us/memory/base/large_page_support.asp
> 
> My thoughts about the possilibities:
> Oracle managed to get a 8% speedup by using the large pages. Although I 
> have little experience in this area I think for FFTs the speedup will be 
> much larger
[snip]

I agree!  I have been thinking about the exact same thing.

I'm in the process of writing (not quite finished or working ;-) some
code which you load as an LD_PRELOAD library under linux.  This gets
its fingers into the memory allocation, and makes all malloc space
come from hugetlbfs (how you get large pages under linux).

My primary user for this was to be mprime of course!

When finished this should be an easy way of trying any application
with large pages without having to modify it.

I haven't finished the code yet, but it shouldn't take long,
especially if people are asking for it!

-- 
Nick Craig-Wood
[EMAIL PROTECTED]
_
Unsubscribe & list info -- http://www.ndatech.com/mersenne/signup.htm
Mersenne Prime FAQ  -- http://www.tasam.com/~lrwiman/FAQ-mers


Re: Mersenne: Re: mprime, linux and 2 MB pages

2002-03-19 Thread Nick Craig-Wood

On Tue, Mar 19, 2002 at 10:03:50PM +0100, Steinar H. Gunderson wrote:
> Paste from gwnum.c, Prime95 v19:
> 
> /* Well I implemented the above only to discover I had dreadful */
> /* performance in pass 1.  How can that be?  The problem is that each  */
> /* cache line in pass 1 comes from a different 4KB page.  Therefore, */
> /* pass 1 accessed 128 different pages.  This is a problem because the */
> /* Pentium chip has only 64 TLBs (translation lookaside buffers) to map */
> /* logical page addresses into physical addresses.  So we need to shuffle */
> /* the data further so that pass 1 data is on fewer pages while */
> /* pass 2 data is spread over more pages. */
> 
> So, it might be that due to TLB thrashing, George would have to
> choose a less efficient memory layout to avoid them, and thus get
> lower speed overall.

I think 2MB pages will be a win for any memory layout using < 128 MB
of RAM (64 TLB entries * 2 MB pages).

-- 
Nick Craig-Wood
[EMAIL PROTECTED]
_
Unsubscribe & list info -- http://www.ndatech.com/mersenne/signup.htm
Mersenne Prime FAQ  -- http://www.tasam.com/~lrwiman/FAQ-mers



Re: Mersenne: mprime, linux and 2 MB pages

2002-03-19 Thread Nick Craig-Wood

On Tue, Mar 19, 2002 at 08:11:54PM +, Brian J. Beesley wrote:
> Ah, but ... frequently accessing pages (virtual _or_ physical) will
> keep the TLB pages from getting too far away from the processor;
> probably at worst they will stay in the L1 cache.

Yes.

> The overhead of accessing from L1 cache is small compared with the
> overhead of accessing data from main memory, and _tiny_ compared
> with the overhead of accessing data from the page/swap file.

Yes... but a TLB miss costs one or maybe two extra memory cycles.  Ie
it halves the performance if you are missing every fetch.

> Don't you _need_ to have at least enough TLB entries to map the
> whole of the processor cache? (Since without it you can't map the
> cache table entries...)

Hmm, not sure.  The processor cache isn't directly mapped into memory
so it doesn't need TLB entries.  Depending on the architecture the
cache will either cache physical addresses or logical addresses.

[snip]
> Whatever effect TLB thrashing may or may not be having, it doesn't look as 
> though it's having a dominant effect on mprime.

Very true and that is a testament to George's programming skills.  A
naively programmed FFT will demonstrate TLB thrashing admirably!

> > I think this would make a real difference to mprime - what percentage
> > I don't know - at the cost of on average 1 MB of RAM extra.
> 
> I wouldn't mind _doubling_ the memory footprint, if we got a _significant _
> performance boost as a consequence.

Me neither.  I'd like to try it but I need to persuade some friendly
kernel hacker to implement it for me ;-) I would guess it might make
at most 10% difference to the run time.

> BTW why does this argument apply only to mprime? Surely Windows has the same 
> underlying architecture - though obviously it's harder to get the Windows 
> kernel changed than linux. 

It applies exactly the same to Windows of course.  Its just that you
can chat with the Linux kernel developers on the mailing list ;-)

-- 
Nick Craig-Wood
[EMAIL PROTECTED]
_
Unsubscribe & list info -- http://www.ndatech.com/mersenne/signup.htm
Mersenne Prime FAQ  -- http://www.tasam.com/~lrwiman/FAQ-mers



Re: Mersenne: mprime, linux and 2 MB pages

2002-03-19 Thread Nick Craig-Wood

On Mon, Mar 18, 2002 at 02:12:48PM +, Brian J. Beesley wrote:
> On Monday 18 March 2002 10:21, Nick Craig-Wood wrote:
> > There has been some discussion on the linux kernel mailing list about
> > providing 2 MB pages (instead of 4kB ones) to user space for the use
> > of database or scientific calculations.
> >
> > It seems to me that prime95/mprime would benefit from this enormously
> > - it should reduce the TLB thrashing to practically zero and hence
> > speed up mprime by some unknown amount.
> >
> > Is this true?  Should we put in our plea to the developers?
> 
> Other people may and probably will disagree, but I think it will make very 
> little if any difference for most applications.

For most applications yes - however it would be configurable in a
per-process manner.  Possibly each process might have to take special
action to get 2 MB pages - say a new flag to memmap().

> The point is that mprime should normally be running on a system in a way 
> which means that all its active data pages are in memory. Having active data 
> paged out will cause a hideous performance hit.
> 
> If the active data is already memory resident, TLB thrashing is not going to 
> be an issue.

The TLB (translation lookaside buffer) has very little to do with the
Virtual Memory system.  The TLB is used by the processor to cache the
address translations from logical memory to physical memory.  These
have to be read from the page table RAM which is expensive - hence the
cache.

When I was working on a DWT implementation for StrongARM I found that
thrashing the TLB caused a factor of two slow down.  The StrongARM
system I was using had no virtual memory.

If mprime is using 10 MB of memory say, then each page needs 1 TLB
entry to be used at all by the processor - ie 2560 TLB entries which
is way bigger than the size of the TLB in the processor (I don't
remember what it is in x86 but on StrongARM it has 32 entries).  To
access each physical page the TLB has to be reloaded from the page
tables which is an extra memory access or two.  If you use 2 MB pages
then there are only 5 pages needed and hence the TLB will never need
to be refilled and hence some speed gain.

> If the page size is going to be changed at all, there is a lot to be
> said for using the same size pages as AGP hardware - 4MB I think -
> there have already been some issues on some Athlon (K7) architecture
> linux systems caused by incorrect mapping between linux virtual
> pages and AGP address space; obviously using the same page size
> removes this source of confusion.

The choice of page size is constrained by the memory management
hardware.  I think 4k and 2 MB are the only sensible choices but I may
be wrong.

> One factor with shifting to a much larger page size is a
> corresponding decrease in the number of pages available to the
> system - a 32 MByte system will have only 8 4MB pages resident in
> real memory at any one time. Since page access rules are often used
> to protect data from accidental modification by rogue pointers etc.,
> a big reduction in system physical page count is a distinctly mixed
> blessing.

The proposal was that you would be able to turn on 2 MB pages for a
given process - obviously you wouldn't want 2 MB pages for every
process unless you had > 100 GB of RAM.

I think this would make a real difference to mprime - what percentage
I don't know - at the cost of on average 1 MB of RAM extra.

-- 
Nick Craig-Wood
[EMAIL PROTECTED]
_
Unsubscribe & list info -- http://www.ndatech.com/mersenne/signup.htm
Mersenne Prime FAQ  -- http://www.tasam.com/~lrwiman/FAQ-mers



Re: Mersenne: NTT faster than FFT ?

2002-03-18 Thread Nick Craig-Wood

On Sun, Mar 17, 2002 at 08:08:06PM -0500, Jason Papadopoulos wrote:
> If you want to explore your idea further, here are some primes that may
> be useful. All are of the form i*2^e+1, and have the advantage that roots
> of unity out to some small order are "simple":
> 
>iexponent   FFT radixDWT size   root
> 007d43548abc4361   320 8   2^17  67
> 04235deab8321000   512 8   2^22  19
> 0051656f7fe405c1  2048 8   2^19  29
> 07ab47a650572a01  2112 8   2^16  29
> 
> 06954fe21e3e8100   38432   N/A   37
>  290d741  179232   N/A5

These are big moduli?  Surely this would make it quite inefficient?

I spent a long time looking at all the options when I made the
StrongARM mersenne prime checker.  For that architecture a modulus of
2^64-2^32+1 was the easy winner given that StrongARM doesn't have
floating point or a divide instruction but does have a quick
32x32->64bit multiply.  More info here

  http://www.axis.demon.co.uk/armprime/

For other archictures (like alpha) having the top bit set in the 64
bit modulus was a pain.

> My personal opinion is that NTTs will not be faster than FFTs until modular
> multiplication has the same throughput as floating point multiplication. 
> Right now we're at least a factor of 3-5 away from that.

I agree with you and this was my experience with StrongARM.  The
integer DWT was about 3 times slower than the equivalent pentium
implementation.

-- 
Nick Craig-Wood
[EMAIL PROTECTED]
_
Unsubscribe & list info -- http://www.ndatech.com/mersenne/signup.htm
Mersenne Prime FAQ  -- http://www.tasam.com/~lrwiman/FAQ-mers



Mersenne: mprime, linux and 2 MB pages

2002-03-18 Thread Nick Craig-Wood

There has been some discussion on the linux kernel mailing list about
providing 2 MB pages (instead of 4kB ones) to user space for the use
of database or scientific calculations.

It seems to me that prime95/mprime would benefit from this enormously
- it should reduce the TLB thrashing to practically zero and hence
speed up mprime by some unknown amount.

Is this true?  Should we put in our plea to the developers?

-- 
Nick Craig-Wood
[EMAIL PROTECTED]
_
Unsubscribe & list info -- http://www.ndatech.com/mersenne/signup.htm
Mersenne Prime FAQ  -- http://www.tasam.com/~lrwiman/FAQ-mers



Re: Mersenne: slaying cpus with prime95

2002-01-14 Thread Nick Craig-Wood

On Mon, Jan 14, 2002 at 02:45:21PM -0500, Steve Elias wrote:
> here are some instances where i have damaged computers
> by (capriciously?) running prime95! 
> 
> 1 - i just got my wife's toshiba laptop back from toshiba warranty
> service.  running prime95 for ~6 months on it caused the fan to die,
> and then the laptop would overheat & shutdown even without prime95
> running.  apparently the heat caused lots of disk badblocks too.

Mmm, yes the fan on my laptop (Dell 38000) sounds like someone dumped
a load of grit in it - not good :-( I stopped running prime95 (mprime
actually) on it for that reason.

-- 
Nick Craig-Wood
[EMAIL PROTECTED]
_
Unsubscribe & list info -- http://www.ndatech.com/mersenne/signup.htm
Mersenne Prime FAQ  -- http://www.tasam.com/~lrwiman/FAQ-mers



Re: Mersenne: Re: splitting up 10m digit primes

1999-10-18 Thread Nick Craig-Wood

On Tue, Oct 19, 1999 at 12:28:52AM -0500, Ken Kriesel wrote:
> How would you extend this concept to P-1, ECM, and factoring save files?

As far as I'm aware prime95 is the only program to do P-1, ECM or
factoring so I wasn't intending that the save files should include
these.  Maybe they should?

> Would revisions to the standard format be handled within the 8-byte
> Ascii file identifier, or in a separate field?

There was a 4 byte number for revisions at the start of the header.

-- 
Nick Craig-Wood
[EMAIL PROTECTED]
http://www.axis.demon.co.uk/
_
Unsubscribe & list info -- http://www.scruz.net/~luke/signup.htm
Mersenne Prime FAQ  -- http://www.tasam.com/~lrwiman/FAQ-mers



Re: Mersenne: Re: splitting up 10m digit primes

1999-10-17 Thread Nick Craig-Wood

On Sun, Oct 17, 1999 at 03:21:49PM +0100, Brian J. Beesley wrote:
> On 17 Oct 99, at 8:22, Nick Craig-Wood wrote:
> > I'd like to propose for discussion a simple cross-platform save file format
> > as below.
> > 
> > Width   Value   Description
> > 
> > 8   "MersSave"  in ASCII as a file identifier
> > 4   int32   version number of file
> > 16  ASCII   space filled name of creating program
> 
> I want this longer, & to include version major/minor/release & build 
> numbers as well. 32 bytes?

Yes sounds like a good idea.  Perhaps we should formalise the version
number like this?

1int8Major
1int8Minor
1int8Micro
1int8Patch level

> > 4   int32   exponent being tested (n)
> > 4   int32   iteration that this residue represents  
> > 4   int32   length of platform specific area x words
> > 4x  int32   platform specific
> 
> e.g. George's programs would want to store the offset here.

Ah, offset!  I wrote this before offset was introduced. This would
need to be part of the save file I think because different offset =>
different save file data.  I think offset would fit into an int32?

> Would it be too much to ask to have an int32 count of "platform 
> specific" bytes following in a variable length field here? Some 
> platforms will require none (just a zero count), others might want to 
> embed a (currently unknown) vector of arbitary length at this point.

That was what I intended by the last two entries above, but I may not
have been clear enough.

4   int32   length of platform specific area x words
4x  int32   platform specific (x words of data may be 0 length)

> > len int32   binary data for the residue
> > len = ((n + 0x1F) & ~ 0x1F) >> 3
> > 4   int32   CCITT CRC-32 of the entire file 
> > 
> > All integers should be represented as little-endian in the file (the majority
> > of the clients are little-endian).  The residue should be little endian with
> > the unused 1-31 bits at the end set to 0.
> 
> Since the "majority of clients" which are "little endian" are 
> overwhelmingly IA32 processors, which have byte reversal instructions 
> specifically designed to do the job, I'd like to propose that the 
> integers be stored in "big-endian" format. Or, alternatively, that we 
> devote _two_ bits of the file version number - both the MS and LS 
> bits - to flags, set to 0 for little-endian format or 1 for big-
> endian format. Still leaves 30 bits for file version number info, and 
> allows clients to write whatever format is easiest. Write operations 
> outnumber read operations by a big margin!

This is sensible.  I like the big/little endian flag idea.  I've tried
to design the header so that it can be represented by a C struct and
just (f)write()-en straight to the file.  Having big/little endian
flag is an excellent idea here.

Also it might just be better to store the actual numeric data as
int8's rather than int32s - this way the endian ness of the processors
doesn't matter.

-- 
Nick Craig-Wood
[EMAIL PROTECTED]
http://www.axis.demon.co.uk/
_
Unsubscribe & list info -- http://www.scruz.net/~luke/signup.htm
Mersenne Prime FAQ  -- http://www.tasam.com/~lrwiman/FAQ-mers



Re: Mersenne: Re: splitting up 10m digit primes

1999-10-16 Thread Nick Craig-Wood

On Sun, Oct 17, 1999 at 02:13:33AM +0200, Steinar H. Gunderson wrote:
> On Sat, Oct 16, 1999 at 03:46:30PM +0100, Chris Jefferson wrote:
> >However, proving that they has used the
> >interim file generated by this particular program might be difficult...
> 
> Not very. Save files are incompatible from program to program. Or at least
> so I think...

Some time ago I proposed a standard format for save files.  It also
had the advantage that it only saved the actual bits used so would
make the save files 1MByte rather than 1.8 MByte for an 8,000,000
exponent in prime95.

> Converting a savefile is also going to be a lot of work :-)

It needs to be done exactly right - hence by the program which
generated it probably.

Here is what I wrote :-



Reading in other threads about save files it seems to me that it would be
worth while standardising the format of save files so that they can be ported
across platforms.

At the moment I think both Prime95 and the unix programs write the floating
point values of the FFT array to a file which is inherently non-portable
across machine architectures.  It also makes the save files quite a bit
larger than necessary (an important consideration if we ever want to hold a
central repository of these files).

I'd like to propose for discussion a simple cross-platform save file format
as below.

Width   Value   Description

8   "MersSave"  in ASCII as a file identifier
4   int32   version number of file
16  ASCII   space filled name of creating program
4   int32   exponent being tested (n)
4   int32   iteration that this residue represents  
4   int32   length of platform specific area x words
4x  int32   platform specific
len int32   binary data for the residue
len = ((n + 0x1F) & ~ 0x1F) >> 3
4   int32   CCITT CRC-32 of the entire file 

All integers should be represented as little-endian in the file (the majority
of the clients are little-endian).  The residue should be little endian with
the unused 1-31 bits at the end set to 0.

A file format like this would have the following advantages

  1) You could take a save file from one platform/program to another (via a
 central repository at primenet maybe)
  2) You could take a save file from one FFT size to another (imagine a
 software upgrade where the FFT limits have changed)
  3) They are much smaller (eg 600k vs 2M)

It would have the disadvantages

  1) Slightly slower and more complex to read and write.  However this is an
 infrequent operation.  (That said time to write a save file may be IO
 limited in which case a shorter save file will be quicker.)
  2) We've all got to agree on it ;-)

Discuss!

-- 
Nick Craig-Wood
[EMAIL PROTECTED]
http://www.axis.demon.co.uk/
_
Unsubscribe & list info -- http://www.scruz.net/~luke/signup.htm
Mersenne Prime FAQ  -- http://www.tasam.com/~lrwiman/FAQ-mers



Re: Mersenne: gzipped binary expansion of the largest prime

1999-10-14 Thread Nick Craig-Wood

On Wed, Oct 13, 1999 at 11:10:29AM -0400, Darxus wrote:
> Can somebody give me the last few digits of the decimal expansion of
> (2^6972593)-1 so that I can verify my copy's intact ?

gp (also know as pari) is a good tool for this...

$ gp
GP/PARI CALCULATOR Version 2.0.17 (beta)
[snip]
(08:21) gp > ((2^6972593)-1)%100
%1 = 2924193791

This took no time at all to calculate.  Whereas calculating the
complete decimal expansion took ages...
-- 
Nick Craig-Wood
[EMAIL PROTECTED]
http://www.axis.demon.co.uk/
_
Unsubscribe & list info -- http://www.scruz.net/~luke/signup.htm
Mersenne Prime FAQ  -- http://www.tasam.com/~lrwiman/FAQ-mers



Re: Mersenne: complaint

1999-09-16 Thread Nick Craig-Wood

On Wed, Sep 15, 1999 at 07:42:53PM +0100, Brian J. Beesley wrote:
> I'm going to be _very_ interested in how many people choose to run 
> $100,000 prize candidates using v19. There is an obvious balance 
> between the time it takes to complete a test & the enthusiasm of 
> people to participate, even if there is a substantial cash prize 
> riding on it.

If one had the data then it would be reasonably easy to calculate the
expectation (in $ per year) for a given exponent size & CPU.  You'd
need to know the iteration time for any given exponent for that CPU
and the probably distribution of mersenne primes at that exponent
size.  Unfortunately I don't know these otherwise I would have worked
it out ;-)

(Digression: Though expectation is a useful statistical measure - in
real life where there are very small probablilities involved
multiplied by very large amounts the results are not useful
practically.  Eg the UK lottery - for your 1 pound stake your
expectation is 50p.  Better to go to the bookies where for your pound
you get an expectation of 95p...)

-- 
Nick Craig-Wood
[EMAIL PROTECTED]
http://www.axis.demon.co.uk/
_
Unsubscribe & list info -- http://www.scruz.net/~luke/signup.htm
Mersenne Prime FAQ  -- http://www.tasam.com/~lrwiman/FAQ-mers



Re: Mersenne: Down With The Evil Thread

1999-07-28 Thread Nick Craig-Wood

On Tue, Jul 27, 1999 at 04:28:41PM +, Henrik Olsen wrote:
> US: million 10^6, billion 10^9, trillion 10^12 ...
> 
> Non-US: million 10^6, milliard 10^9, billion 10^12,
>  billiard 10^15, trillion 10^18 ...

Actually I think most UK people would use the US definitions
now-a-days (except maybe *really* old people ;-)

I certainly would...

-- 
Nick Craig-Wood
[EMAIL PROTECTED]
http://www.axis.demon.co.uk/
_
Unsubscribe & list info -- http://www.scruz.net/~luke/signup.htm
Mersenne Prime FAQ  -- http://www.tasam.com/~lrwiman/FAQ-mers



Re: Mersenne: Re: Mersenne Digest V1 #569

1999-06-08 Thread Nick Craig-Wood

On Mon, Jun 07, 1999 at 04:37:32PM -0500, JON STRAYER wrote:

> I'm I correct in thinking that since the exponent is five times the
> size of current exponents that the LL test will take 25 times as
> long (give the same processor)?
> 
> > You need to test exponent (1000 / log 2) = 33219281 or above.

Time for an n bit multiply using DWT is O(n log n) plus some fiddly
bits to do with decreasing precision available when using longer
convolutions - I'd probably estimate this as O(log log n).  As it
takes n iterations for each test then to test an n digit number using
the Lucas Lehmer test with DWT takes O(n^2 . log n . log log n).

If we say that the current exponent is about 6.5 million then we need
to test one which is approx 5 times bigger.  Using the above formula I
make this 29.97 times longer for a 5 times increase in exponent size
starting at 6.5 million.

Probably ;-)
-- 
Nick Craig-Wood
[EMAIL PROTECTED]
http://www.axis.demon.co.uk/

Unsubscribe & list info -- http://www.scruz.net/~luke/signup.htm



Re: Mersenne: Prime95 on linux, a tech question

1999-06-07 Thread Nick Craig-Wood

On Sun, Jun 06, 1999 at 10:58:13AM -0700, [EMAIL PROTECTED] wrote:
> First of all put the following at the bottom of your /etc/rc.d/rc.local
> file:
> 
> /Your/path/to/mprime/mprime &

I'd recommend starting it as a particular user not as root.  I like to
generalise this and give the user a script file to run at boot, eg put
this at the end of rc.local

# start the users' background jobs off

if [ -f ~user/.rc.local ]; then
su - user -c ./.rc.local &
fi

And then something like this into ~user/.rc.local

#!/bin/sh
(cd ~/mprime ; ./mprime -d >>out.txt 2>&1 &)

> To start it from the command line do this:
> 
> nohup /Your/path/to/mprime/mprime &

Nohup shouldn't be necessary with the & if you are using a modern
shell (eg bash) though it certainly is necessary on some old /bin/sh's

-- 
|- Nick Craig-Wood -- http://www.axis.demon.co.uk/ -- [EMAIL PROTECTED] -|

Unsubscribe & list info -- http://www.scruz.net/~luke/signup.htm



Re: Mersenne: Java Prime and lotsa questions

1999-05-05 Thread Nick Craig-Wood

In news:local.mersenne> on Tue 04 May, Brian J Beesley wrote:
> I'm still "rather hazy" (read, totally baffled) on exactly how  the
> Crandall-Fagin procedure works, so I don't know if you could  adapt NTT to
> do the mod 2^p-1 automatically. But my gut feeling is  that it ought to be
> possible.

It is possible.  That is how my StrongARM code works.  The maths for the
irrational base transform all works out provided the relevant fractional
powers of 2 exist in the field you are using  (ie for a transform of length n
there must exists a number x so that x^n = 2).  Whether this is the complex
field or some other Galois field it doesn't matter.  This puts some
constraints on which modulus you can use...  My code uses GF(2^64-2^32+1)
which works for DWTs up to 2^26 bits (67 million).  The 'interesting' binary
representation of this number makes for lots of short cuts when taking the
modulus also.

-- 
Nick Craig-Wood
[EMAIL PROTECTED]
http://www.axis.demon.co.uk/


Unsubscribe & list info -- http://www.scruz.net/~luke/signup.htm



Re: Mersenne: Integer FFT

1998-12-21 Thread Nick Craig-Wood

Bill Daly wrote:

> The usual concept of an integer FFT is to choose a prime q such that q-1
> is divisible by a reasonably large power of 2, say N = 2^n, find a
> primitive (integer) N-th root of 1 mod q, say w, then use w and
> arithmetic mod q to calculate the FFT. If it also happens that there is
> an integer N-th root of 2 mod q, then the FFT can be converted into a
> DWT suitable for implementing multiplication mod 2^p - 1 for some large
> prime p. For example, with q = 2^64 - 2^32 + 1, we have both roots for N
> up to 2^26, and for that matter we can also use an FFT/DWT of order N =
> 5*2^n up to 5*2^26. However, this requires calculating products of the
> form a*b mod q where a and b are 64-bit integers, which, while
> straightforward, will probably take more time than the corresponding
> complex floating-point multiply in the ordinary FFT/DWT. 

That depends on the CPU.  On StrongARM it is much quicker to do a 64*64
integer multiply because it doesn't have an FPU...

> It is also hard to find appropriate primes q for which this works.

Agreed.

> Richard Crandall has proposed implementing an FFT in the Galois field
> GF(q^2), where q is itself a Mersenne prime, e.g. q = 2^61 - 1. He calls
> this the fast Galois transform (FGT). The idea is that for any prime q =
> 3 mod 4, the field of Gaussian integers a + b*i mod q is isomorphic to
> GF(q^2), thus we can simply replace complex real values with complex
> integers mod q. The multiplicative order of GF(q^2) is q^2-1, thus for q
> = 2^61 - 1, q^2 - 1 will be divisible by 2^62, thus we can find a
> complex integer w such that w^2^62 = 1. Also, since the order of 2 in
> GF(q^2) is 61, there will also be a value r such that r^2^62 = 2. We can
> use r and w to implement a complex integer DWT mod q, which requires
> code to add, subtract and multiply mod q. 

That is a very interesting idea.  Do you have any further references?

> However, we still have the problem that the calculation of a*b mod q is
> likely to take more time than we would really like.
> 
> I have a suggestion. The FGT requires only a prime q = 3 mod 4, for
> which q+1 is divisible by a reasonably large power of 2. Let's consider
> primes q = k*2^n - 1 which are slightly less than 2^32. The idea here is
> that this will require only 32-bit integer arithmetic. If we use two
> such primes, q1 and q2, and we calculate separately the two sets of
> convolution sums z1[0..N-1] mod q1 and z2[0..N-1] mod q2, then we can
> use the Chinese Remainder Theorem to calculate the convolution sums
> z[0..N-1] mod q1*q2. This is less difficult than it might seem, since we
> can precalculate values u1 and u2 such that u1 = 1 mod q1, u1 = 0 mod
> q2, u2 = 0 mod q1, u2 = 1 mod q2. Then if we have, from the two FGT
> calculations, values z1[j] and z2[j] such that z[j] = z1[j] mod q1 and
> z[j] = z2[j] mod q2, we have z[j] = u1*z1[j] + u2*z2[j] mod q1*q2, which
> is easily verified to be correct. This calculation can be further
> simplified because of the fact that u1 + u2 = 1 mod q1*q2 (in fact, u1 +
> u2 = q1*q2 + 1). Thus, (u1+u2)*(z1[j]+z2[j]) = z1[j]+z2[j] mod q1*q2,
> and z[j] = u1*z1[j] + u2*z2[j] = (z1[j] + z2[j] + (u1-u2)*(z1[j]-z2[j]))
> / 2 mod q1*q2. This requires only a single multiply
> (u1-u2)*(z1[j]-z2[j]) mod q1*q2, together with some shifts and adds mod
> q1*q2.
> 
> The rationale for this is that the efficacy of an integer FFT depends on
> the size of the modulus. If the modulus is slightly less than 2^32, then
> for N in the range used by GIMPS we may be able to use only 6-bit
> coefficients, whereas with a modulus slightly less than 2^64, in this
> case q1*q2, we will be able to use 22-bit coefficients, thereby
> increasing the range of validity by a factor of nearly 4, in exchange
> for using two FGTs instead of one and the Chinese Remainder step at the
> end of each iteration.

I looked into doing something similar for the StrongARM.  Not with a FGT but
with 2 or 3  ordinary integer 32 bit FFTs followed by a CRT step.  I came to
the conclusion that it wasn't worth it.

The main cost of the FFT isn't the time to multiply it is the time to get
things in and out of memory.  I wouldn't have thought a 32 bit FFT would be
much more than twice the speed of a 64 bit FFT and I would expect any
advantage would get chewed up by the CRT.  That said you can combine the CRT
with other bits of the DWT I expect so maybe it wouldn't be that bad.  On the
Pentium it may be a different story entirely.

One particular problem with StrongARM is not having a divide instruction it
makes the modulo arithmetic rather tricky for arbitrary q and there aren't
that many good q's below 2^32.

I like very much the idea of using the FGT with 2^61-1 though!
-- 
Nick Craig-Wood
[EMAIL PROTECTED]
http://www.axis.demon.co.uk/



Mersenne: Integer DWT

1998-10-27 Thread Nick Craig-Wood

Bill Daly wrote:

> This is very interesting. I did some experiments using q = 2^64 - 2^32 +
> 1 

Yes that is exactly the prime I am using.

> to see how a multiply mod q would stack up against a complex floating
> point multiply, and I found that on a 266Mhz Pentium II the
> multiplication mod q was about 15% faster. Unfortunately, I have lost
> both the code and the results (my laptop was stolen), but they are
> easily reconstructed. I didn't follow up on it because I couldn't see
> how to integrate the DWT with an integer FFT, but if you have found a
> way, then maybe it is in fact possible to implement an integer version
> of LL which is competitive with prime95.

It is all to do with the properties of p.  You've discovered already that you
can only fourier transform with lengths that divide into p-1.

Factors of p-1 are 2^32 * 3 * 5 * 17 * 257 * 65537

This means that practically an FFT can be defined of length up to 2^32 with
optional factors of 3 and 5.

If our transform length is n then to do a Discrete Weighted Transform we need
an n-th root of 2.  2 has order 192 mod p (ie 2^192 mod p = 1) so we can have

(p-1)/192 = (2^58 - 2^26) / 3 = 2^26 * 5 * 17 * 257 * 65537 th root of 2.

This means that we can do the DWT for lengths up to 2^26 with an optional
factor of 5

This is extremely fortunate, I did extensive searches looking for primes
which satisfy both the FFT and the DWT constraints and this was the only one
with decent properties (for FFT, DWT and mod p with no division).  It would
have been nice to have a factor of 3 but you can't have everything!

Some other random facts :-

  7 is a primitive root mod p
  
  An n-th root of unity can be generated by 7^(5*(p-1)/n) mod p.
  
  An n-th root of two can be generated by 7^(5*(p-1)/192/n) mod p
  
  So a suitable 5 * 2^26-th root of 1 is 0xED41D05B78D6E286 and the 5 *
  2^26-th root of 2 is 0xC47FC73D33F80E14
  
  Note that a 64-th root of unity is 8 mod p.

As for how you do the DWT, once you've got your n-th roots of 1 and 2 you
just proceed exactly as per the floating point version just in the mod p
field rather than the complex field.

Peter-Lawrence Montgomery helped me a great deal working this all out.  He
also worked out some great short cuts for doing 128 bit and 64 bit reductions
mod p.  I turned all that into lean ARM assembler!

> The particular form of q is what makes reduction mod q efficient.
[snip]

Note that if you choose your root of unity so that the 64-th root of unity is
8 then you can convert lots of the primitive operations into shifts rather
than multiplies.  This is a big win on the StrongARM.

> It is of course fortuitous that q is prime.

Fortuitous isn't the word I would have used, it took many days of CPU time
for me to fix on this prime!

Hopefully I will have the time to tidy up the code and release it to the
world soon...

-- 
Nick Craig-Wood
[EMAIL PROTECTED]
http://www.axis.demon.co.uk/



Re: Mersenne: AMD K7 will

1998-10-26 Thread Nick Craig-Wood

Foghorn Leghorn wrote:

> > The Playstation doesn't do floating point.  It would be a very slow
> > platform for running GIMPS. :)
> 
> Question: does Prime95 use floating point because the algorithm requires 
> it, or because something about the Intel architecture makes it a good 
> choice?

The algorithm doesn't require it but on most processors it is quickest to do
a floating point discrete weighted transform.

That said I've written (with quite a bit of help from Peter LM) a DWT for
StrongARM which is runs without use of floating point (StrongARM doesn't have
an FPU).

The individual computations are a bit more expensive because of the need to
work modulo a 64 bit prime so this makes the StrongARM routine about 1/3 of
the speed of an equivalently clocked Pentium.

An Integer DWT has a few advantages though, namely because it doesn't have
round off errors it can pack a few more bits in and use slightly smaller FFT
sizes.

> I'm not clear about how Prime95 is able to use floating point to  perform
> something that is theoretically an all-integer computation. 

Clever isn't it!  Every iteration prime95 checks each floating point number
to see if it is an integer +/- < 0.4.   If it is outside this range then it
calls an error.

> Is there any platform on which an all-integer implementation would be 
> superior to one that uses floating point?

Any processor without an FPU.  Like StrongARM or Sony Playstation...

Assuming that MIPS codes equivalently to ARM then a 33 MHz MIPS (in the
Playstation) is going to come out at a 10 Mhz Pentium equivalent for GIMPs
use.  That probably isn't going to be worth your while unfortunately.
-- 
Nick Craig-Wood
[EMAIL PROTECTED]
http://www.axis.demon.co.uk/



Re: Mersenne: Why isn't computing x**2 faster than x*y?

1998-09-25 Thread Nick Craig-Wood

Luke Welsh wrote:

> Aside from the fact the we need to do a single FFT instead
> of two, why isn't squaring inherently faster than multiplying?

You mean two instead of three but I take your point.  This is something I've
spent many hours thinking about.  Everything about squaring screams that it
ought to be easier than general multiplication especially the fact that mod
multiplies map from and to the same set.

However the best I've seen for squaring is a factor of two improvement over
the general multiply case.  I don't know if anyone has proved anything here.

-- 
Nick Craig-Wood
[EMAIL PROTECTED]
http://www.axis.demon.co.uk/



Re: Mersenne: Fast shifting on RISC

1998-09-16 Thread Nick Craig-Wood


In news:local.mersenne> on Wed 16 Sep, Bojan Antonovic wrote:
> > The PII is able to shift two numbers by n positions in one step. The PII is
> > a Post-RISC processor, ie it is able to execute 2 or 3 instructions per
> > cycle and each basic instruction takes 1 cycle. The CISC and RISC
> > architectures are both obsolete.
>
> I meant with step 1 cycle (for calculation only).

The ARM is like this.  The ARM is a 3 operand RISC and you may shift one of
the operands left/right/rotate by any amount with with no extra execution
time using the barrel shifter.  This is one of the reasons assembly
programmers like the ARM - you can be extrememly clever if you like,
especially given the fact that you can make all instructions conditional.

-- 
Nick Craig-Wood
[EMAIL PROTECTED]
http://www.axis.demon.co.uk/