Re: Mersenne: Re: Large memory pages in Linux
On Wed, Dec 24, 2003 at 07:46:32AM +, Brian J. Beesley wrote: > On Tuesday 23 December 2003 20:15, Matthias Waldhauer wrote: > > > > Last friday I read some messages about recent kernel modifications and > > patches for version 2.6.0. There is an "imcplicit_large_page" patch, > > allowing applications to use large pages without modifications. I don't > > have the time to dig into it :( > > Sure. This is a much better approach than mucking about with > application-specific modifications which would likely involve serious > security hazards (leaking kernel priveleges to the application) and/or clash > with other applications private large-page code and/or large page enabled > kernels in the future. Its also a much more invasive approach - changing something as fundamental as the kernels page size (and the assumption that its constant) is very hard work - don't expect it in 2.6! As for security hazards - the hugepage implementation in 2.6 has been very well thought out. No extra privs are required. The super user must :- a) allocate some huge pages b) mount the hugetlbfs filesystem c) use standard unix permissions so the right user(s) can use hugetlbfs The permissioned user can then allocate huge pages using mmap out of hugetlbfs. You use my particular hack with a couple of LD_PRELOAD libraries which doesn't require patching the application, eg LD_PRELOAD="`pwd`/intercept.so `pwd`/alloc.so" ./mprime -m So you can use huge pages with any application that is dynamically linked. This seems pretty secure and non-invasive to me ;-) (I will shortly put full instructions and code on a web page - family duties over christmas permitting!) -- Nick Craig-Wood [EMAIL PROTECTED] _ Unsubscribe & list info -- http://www.ndatech.com/mersenne/signup.htm Mersenne Prime FAQ -- http://www.tasam.com/~lrwiman/FAQ-mers
Re: Mersenne: Re: Large memory pages in Linux
On Tue, Dec 23, 2003 at 09:15:46PM +0100, Matthias Waldhauer wrote: > For a TLB aware client like Prime95 the improvements are small as > expected. Kudos to George ;-) > But I'm sure that speedups greater than ~2.5% are possible by > modifying the FFTs to make optimal use of large pages. I'd love to have George's opinion on this. > For most FFT sizes currently in use the available number of 2/4MB > TLB entries cover the complete working set. But for larger sizes it > is again necessary to take care of TLB entries. My PII laptop has 32 TLB entries and 4 MB pages, so thats 128 MB of "flat" memory which is really quite a large FFT! > Last friday I read some messages about recent kernel modifications > and patches for version 2.6.0. There is an "imcplicit_large_page" > patch, allowing applications to use large pages without > modifications. I don't have the time to dig into it :( Do you mean wli's superpage patches? I think he is aiming those for 2.7. They sound very intersting but will require extensive reworking of the kernels internal assumptions that a page is constant size. -- Nick Craig-Wood [EMAIL PROTECTED] _ Unsubscribe & list info -- http://www.ndatech.com/mersenne/signup.htm Mersenne Prime FAQ -- http://www.tasam.com/~lrwiman/FAQ-mers
Mersenne: Large memory pages in Linux
On Wed, Dec 10, 2003 at 06:26:54PM +, Nick Craig-Wood wrote: > I'm in the process of writing (not quite finished or working ;-) some > code which you load as an LD_PRELOAD library under linux. This gets > its fingers into the memory allocation, and makes all malloc space > come from hugetlbfs (how you get large pages under linux). > > My primary user for this was to be mprime of course! Well I finished the code and here are the results on my lowly laptop running 2.6.0. Intel(R) Pentium(R) III processor CPU speed: 550.78 MHz CPU features: RDTSC, CMOV, PREFETCH, MMX, SSE L1 cache size: 16 KB L2 cache size: 256 KB L1 cache line size: 32 bytes L2 cache line size: 32 bytes TLBS: 64 Prime95 version 22.12, RdtscTiming=1 Normal -- Best time for 256K FFT length: 80.256 ms. Best time for 320K FFT length: 101.820 ms. Best time for 384K FFT length: 125.191 ms. Best time for 448K FFT length: 145.505 ms. Best time for 512K FFT length: 161.178 ms. Best time for 640K FFT length: 215.113 ms. Best time for 768K FFT length: 258.055 ms. Best time for 896K FFT length: 304.786 ms. Best time for 1024K FFT length: 345.747 ms. Best time for 1280K FFT length: 449.540 ms. Best time for 1536K FFT length: 541.963 ms. Best time for 1792K FFT length: 661.651 ms. With all memory allocations coming from 4 MB pages -- Best time for 256K FFT length: 79.293 ms. 1.2% Best time for 320K FFT length: 102.032 ms. -0.2% Best time for 384K FFT length: 124.022 ms. 0.9% Best time for 448K FFT length: 145.492 ms. 0.0% Best time for 512K FFT length: 161.568 ms. -0.2% Best time for 640K FFT length: 213.311 ms. 0.8% Best time for 768K FFT length: 254.609 ms. 1.3% Best time for 896K FFT length: 301.911 ms. 0.9% Best time for 1024K FFT length: 339.203 ms. 1.9% Best time for 1280K FFT length: 439.119 ms. 2.3% Best time for 1536K FFT length: 531.422 ms. 1.9% Best time for 1792K FFT length: 645.350 ms. 2.5% So consistent but small improvements in the larger FFTs. This just goes to show what a good job George has done in not thrashing the TLB! I wonder if Prime95 could be made more efficient if it didn't have to worry about the TLB? Its obviously detecting the TLB slots for this computer which is wrong in this case - perhaps there is a way of overriding this? Please email me if you'd like to experiment with the code - its quite simple (it just took rather a lot of different approaches to get right!). You'll need to be running 2.6.0 with HUGETLB support if you want to play (see hugetlbpage.txt in Documentation in the kernel source for more info). -- Nick Craig-Wood [EMAIL PROTECTED] _ Unsubscribe & list info -- http://www.ndatech.com/mersenne/signup.htm Mersenne Prime FAQ -- http://www.tasam.com/~lrwiman/FAQ-mers
Re: Mersenne: Large memory pages in Linux and Windows Server (64bit?)
On Wed, Dec 10, 2003 at 02:50:11PM +0100, Matthias Waldhauer wrote: > Back to the topic: > Some time ago there was a discussion going on regarding the use of large > memory pages. In a mersenneforum thread I collected some info regarding > new linux kernels and some real world results published in a paper. > > Here some extracts: > Linux kernel versions 2.5.36+ and 2.6 include a "HugeTLBs" patch, which > allows an application to allocate large memory pages. > Also 64bit Windows Server seems to support them too: > http://msdn.microsoft.com/library/default.asp?url=/library/en-us/memory/base/large_page_support.asp > > My thoughts about the possilibities: > Oracle managed to get a 8% speedup by using the large pages. Although I > have little experience in this area I think for FFTs the speedup will be > much larger [snip] I agree! I have been thinking about the exact same thing. I'm in the process of writing (not quite finished or working ;-) some code which you load as an LD_PRELOAD library under linux. This gets its fingers into the memory allocation, and makes all malloc space come from hugetlbfs (how you get large pages under linux). My primary user for this was to be mprime of course! When finished this should be an easy way of trying any application with large pages without having to modify it. I haven't finished the code yet, but it shouldn't take long, especially if people are asking for it! -- Nick Craig-Wood [EMAIL PROTECTED] _ Unsubscribe & list info -- http://www.ndatech.com/mersenne/signup.htm Mersenne Prime FAQ -- http://www.tasam.com/~lrwiman/FAQ-mers
Re: Mersenne: Re: mprime, linux and 2 MB pages
On Tue, Mar 19, 2002 at 10:03:50PM +0100, Steinar H. Gunderson wrote: > Paste from gwnum.c, Prime95 v19: > > /* Well I implemented the above only to discover I had dreadful */ > /* performance in pass 1. How can that be? The problem is that each */ > /* cache line in pass 1 comes from a different 4KB page. Therefore, */ > /* pass 1 accessed 128 different pages. This is a problem because the */ > /* Pentium chip has only 64 TLBs (translation lookaside buffers) to map */ > /* logical page addresses into physical addresses. So we need to shuffle */ > /* the data further so that pass 1 data is on fewer pages while */ > /* pass 2 data is spread over more pages. */ > > So, it might be that due to TLB thrashing, George would have to > choose a less efficient memory layout to avoid them, and thus get > lower speed overall. I think 2MB pages will be a win for any memory layout using < 128 MB of RAM (64 TLB entries * 2 MB pages). -- Nick Craig-Wood [EMAIL PROTECTED] _ Unsubscribe & list info -- http://www.ndatech.com/mersenne/signup.htm Mersenne Prime FAQ -- http://www.tasam.com/~lrwiman/FAQ-mers
Re: Mersenne: mprime, linux and 2 MB pages
On Tue, Mar 19, 2002 at 08:11:54PM +, Brian J. Beesley wrote: > Ah, but ... frequently accessing pages (virtual _or_ physical) will > keep the TLB pages from getting too far away from the processor; > probably at worst they will stay in the L1 cache. Yes. > The overhead of accessing from L1 cache is small compared with the > overhead of accessing data from main memory, and _tiny_ compared > with the overhead of accessing data from the page/swap file. Yes... but a TLB miss costs one or maybe two extra memory cycles. Ie it halves the performance if you are missing every fetch. > Don't you _need_ to have at least enough TLB entries to map the > whole of the processor cache? (Since without it you can't map the > cache table entries...) Hmm, not sure. The processor cache isn't directly mapped into memory so it doesn't need TLB entries. Depending on the architecture the cache will either cache physical addresses or logical addresses. [snip] > Whatever effect TLB thrashing may or may not be having, it doesn't look as > though it's having a dominant effect on mprime. Very true and that is a testament to George's programming skills. A naively programmed FFT will demonstrate TLB thrashing admirably! > > I think this would make a real difference to mprime - what percentage > > I don't know - at the cost of on average 1 MB of RAM extra. > > I wouldn't mind _doubling_ the memory footprint, if we got a _significant _ > performance boost as a consequence. Me neither. I'd like to try it but I need to persuade some friendly kernel hacker to implement it for me ;-) I would guess it might make at most 10% difference to the run time. > BTW why does this argument apply only to mprime? Surely Windows has the same > underlying architecture - though obviously it's harder to get the Windows > kernel changed than linux. It applies exactly the same to Windows of course. Its just that you can chat with the Linux kernel developers on the mailing list ;-) -- Nick Craig-Wood [EMAIL PROTECTED] _ Unsubscribe & list info -- http://www.ndatech.com/mersenne/signup.htm Mersenne Prime FAQ -- http://www.tasam.com/~lrwiman/FAQ-mers
Re: Mersenne: mprime, linux and 2 MB pages
On Mon, Mar 18, 2002 at 02:12:48PM +, Brian J. Beesley wrote: > On Monday 18 March 2002 10:21, Nick Craig-Wood wrote: > > There has been some discussion on the linux kernel mailing list about > > providing 2 MB pages (instead of 4kB ones) to user space for the use > > of database or scientific calculations. > > > > It seems to me that prime95/mprime would benefit from this enormously > > - it should reduce the TLB thrashing to practically zero and hence > > speed up mprime by some unknown amount. > > > > Is this true? Should we put in our plea to the developers? > > Other people may and probably will disagree, but I think it will make very > little if any difference for most applications. For most applications yes - however it would be configurable in a per-process manner. Possibly each process might have to take special action to get 2 MB pages - say a new flag to memmap(). > The point is that mprime should normally be running on a system in a way > which means that all its active data pages are in memory. Having active data > paged out will cause a hideous performance hit. > > If the active data is already memory resident, TLB thrashing is not going to > be an issue. The TLB (translation lookaside buffer) has very little to do with the Virtual Memory system. The TLB is used by the processor to cache the address translations from logical memory to physical memory. These have to be read from the page table RAM which is expensive - hence the cache. When I was working on a DWT implementation for StrongARM I found that thrashing the TLB caused a factor of two slow down. The StrongARM system I was using had no virtual memory. If mprime is using 10 MB of memory say, then each page needs 1 TLB entry to be used at all by the processor - ie 2560 TLB entries which is way bigger than the size of the TLB in the processor (I don't remember what it is in x86 but on StrongARM it has 32 entries). To access each physical page the TLB has to be reloaded from the page tables which is an extra memory access or two. If you use 2 MB pages then there are only 5 pages needed and hence the TLB will never need to be refilled and hence some speed gain. > If the page size is going to be changed at all, there is a lot to be > said for using the same size pages as AGP hardware - 4MB I think - > there have already been some issues on some Athlon (K7) architecture > linux systems caused by incorrect mapping between linux virtual > pages and AGP address space; obviously using the same page size > removes this source of confusion. The choice of page size is constrained by the memory management hardware. I think 4k and 2 MB are the only sensible choices but I may be wrong. > One factor with shifting to a much larger page size is a > corresponding decrease in the number of pages available to the > system - a 32 MByte system will have only 8 4MB pages resident in > real memory at any one time. Since page access rules are often used > to protect data from accidental modification by rogue pointers etc., > a big reduction in system physical page count is a distinctly mixed > blessing. The proposal was that you would be able to turn on 2 MB pages for a given process - obviously you wouldn't want 2 MB pages for every process unless you had > 100 GB of RAM. I think this would make a real difference to mprime - what percentage I don't know - at the cost of on average 1 MB of RAM extra. -- Nick Craig-Wood [EMAIL PROTECTED] _ Unsubscribe & list info -- http://www.ndatech.com/mersenne/signup.htm Mersenne Prime FAQ -- http://www.tasam.com/~lrwiman/FAQ-mers
Re: Mersenne: NTT faster than FFT ?
On Sun, Mar 17, 2002 at 08:08:06PM -0500, Jason Papadopoulos wrote: > If you want to explore your idea further, here are some primes that may > be useful. All are of the form i*2^e+1, and have the advantage that roots > of unity out to some small order are "simple": > >iexponent FFT radixDWT size root > 007d43548abc4361 320 8 2^17 67 > 04235deab8321000 512 8 2^22 19 > 0051656f7fe405c1 2048 8 2^19 29 > 07ab47a650572a01 2112 8 2^16 29 > > 06954fe21e3e8100 38432 N/A 37 > 290d741 179232 N/A5 These are big moduli? Surely this would make it quite inefficient? I spent a long time looking at all the options when I made the StrongARM mersenne prime checker. For that architecture a modulus of 2^64-2^32+1 was the easy winner given that StrongARM doesn't have floating point or a divide instruction but does have a quick 32x32->64bit multiply. More info here http://www.axis.demon.co.uk/armprime/ For other archictures (like alpha) having the top bit set in the 64 bit modulus was a pain. > My personal opinion is that NTTs will not be faster than FFTs until modular > multiplication has the same throughput as floating point multiplication. > Right now we're at least a factor of 3-5 away from that. I agree with you and this was my experience with StrongARM. The integer DWT was about 3 times slower than the equivalent pentium implementation. -- Nick Craig-Wood [EMAIL PROTECTED] _ Unsubscribe & list info -- http://www.ndatech.com/mersenne/signup.htm Mersenne Prime FAQ -- http://www.tasam.com/~lrwiman/FAQ-mers
Mersenne: mprime, linux and 2 MB pages
There has been some discussion on the linux kernel mailing list about providing 2 MB pages (instead of 4kB ones) to user space for the use of database or scientific calculations. It seems to me that prime95/mprime would benefit from this enormously - it should reduce the TLB thrashing to practically zero and hence speed up mprime by some unknown amount. Is this true? Should we put in our plea to the developers? -- Nick Craig-Wood [EMAIL PROTECTED] _ Unsubscribe & list info -- http://www.ndatech.com/mersenne/signup.htm Mersenne Prime FAQ -- http://www.tasam.com/~lrwiman/FAQ-mers
Re: Mersenne: slaying cpus with prime95
On Mon, Jan 14, 2002 at 02:45:21PM -0500, Steve Elias wrote: > here are some instances where i have damaged computers > by (capriciously?) running prime95! > > 1 - i just got my wife's toshiba laptop back from toshiba warranty > service. running prime95 for ~6 months on it caused the fan to die, > and then the laptop would overheat & shutdown even without prime95 > running. apparently the heat caused lots of disk badblocks too. Mmm, yes the fan on my laptop (Dell 38000) sounds like someone dumped a load of grit in it - not good :-( I stopped running prime95 (mprime actually) on it for that reason. -- Nick Craig-Wood [EMAIL PROTECTED] _ Unsubscribe & list info -- http://www.ndatech.com/mersenne/signup.htm Mersenne Prime FAQ -- http://www.tasam.com/~lrwiman/FAQ-mers
Re: Mersenne: Re: splitting up 10m digit primes
On Tue, Oct 19, 1999 at 12:28:52AM -0500, Ken Kriesel wrote: > How would you extend this concept to P-1, ECM, and factoring save files? As far as I'm aware prime95 is the only program to do P-1, ECM or factoring so I wasn't intending that the save files should include these. Maybe they should? > Would revisions to the standard format be handled within the 8-byte > Ascii file identifier, or in a separate field? There was a 4 byte number for revisions at the start of the header. -- Nick Craig-Wood [EMAIL PROTECTED] http://www.axis.demon.co.uk/ _ Unsubscribe & list info -- http://www.scruz.net/~luke/signup.htm Mersenne Prime FAQ -- http://www.tasam.com/~lrwiman/FAQ-mers
Re: Mersenne: Re: splitting up 10m digit primes
On Sun, Oct 17, 1999 at 03:21:49PM +0100, Brian J. Beesley wrote: > On 17 Oct 99, at 8:22, Nick Craig-Wood wrote: > > I'd like to propose for discussion a simple cross-platform save file format > > as below. > > > > Width Value Description > > > > 8 "MersSave" in ASCII as a file identifier > > 4 int32 version number of file > > 16 ASCII space filled name of creating program > > I want this longer, & to include version major/minor/release & build > numbers as well. 32 bytes? Yes sounds like a good idea. Perhaps we should formalise the version number like this? 1int8Major 1int8Minor 1int8Micro 1int8Patch level > > 4 int32 exponent being tested (n) > > 4 int32 iteration that this residue represents > > 4 int32 length of platform specific area x words > > 4x int32 platform specific > > e.g. George's programs would want to store the offset here. Ah, offset! I wrote this before offset was introduced. This would need to be part of the save file I think because different offset => different save file data. I think offset would fit into an int32? > Would it be too much to ask to have an int32 count of "platform > specific" bytes following in a variable length field here? Some > platforms will require none (just a zero count), others might want to > embed a (currently unknown) vector of arbitary length at this point. That was what I intended by the last two entries above, but I may not have been clear enough. 4 int32 length of platform specific area x words 4x int32 platform specific (x words of data may be 0 length) > > len int32 binary data for the residue > > len = ((n + 0x1F) & ~ 0x1F) >> 3 > > 4 int32 CCITT CRC-32 of the entire file > > > > All integers should be represented as little-endian in the file (the majority > > of the clients are little-endian). The residue should be little endian with > > the unused 1-31 bits at the end set to 0. > > Since the "majority of clients" which are "little endian" are > overwhelmingly IA32 processors, which have byte reversal instructions > specifically designed to do the job, I'd like to propose that the > integers be stored in "big-endian" format. Or, alternatively, that we > devote _two_ bits of the file version number - both the MS and LS > bits - to flags, set to 0 for little-endian format or 1 for big- > endian format. Still leaves 30 bits for file version number info, and > allows clients to write whatever format is easiest. Write operations > outnumber read operations by a big margin! This is sensible. I like the big/little endian flag idea. I've tried to design the header so that it can be represented by a C struct and just (f)write()-en straight to the file. Having big/little endian flag is an excellent idea here. Also it might just be better to store the actual numeric data as int8's rather than int32s - this way the endian ness of the processors doesn't matter. -- Nick Craig-Wood [EMAIL PROTECTED] http://www.axis.demon.co.uk/ _ Unsubscribe & list info -- http://www.scruz.net/~luke/signup.htm Mersenne Prime FAQ -- http://www.tasam.com/~lrwiman/FAQ-mers
Re: Mersenne: Re: splitting up 10m digit primes
On Sun, Oct 17, 1999 at 02:13:33AM +0200, Steinar H. Gunderson wrote: > On Sat, Oct 16, 1999 at 03:46:30PM +0100, Chris Jefferson wrote: > >However, proving that they has used the > >interim file generated by this particular program might be difficult... > > Not very. Save files are incompatible from program to program. Or at least > so I think... Some time ago I proposed a standard format for save files. It also had the advantage that it only saved the actual bits used so would make the save files 1MByte rather than 1.8 MByte for an 8,000,000 exponent in prime95. > Converting a savefile is also going to be a lot of work :-) It needs to be done exactly right - hence by the program which generated it probably. Here is what I wrote :- Reading in other threads about save files it seems to me that it would be worth while standardising the format of save files so that they can be ported across platforms. At the moment I think both Prime95 and the unix programs write the floating point values of the FFT array to a file which is inherently non-portable across machine architectures. It also makes the save files quite a bit larger than necessary (an important consideration if we ever want to hold a central repository of these files). I'd like to propose for discussion a simple cross-platform save file format as below. Width Value Description 8 "MersSave" in ASCII as a file identifier 4 int32 version number of file 16 ASCII space filled name of creating program 4 int32 exponent being tested (n) 4 int32 iteration that this residue represents 4 int32 length of platform specific area x words 4x int32 platform specific len int32 binary data for the residue len = ((n + 0x1F) & ~ 0x1F) >> 3 4 int32 CCITT CRC-32 of the entire file All integers should be represented as little-endian in the file (the majority of the clients are little-endian). The residue should be little endian with the unused 1-31 bits at the end set to 0. A file format like this would have the following advantages 1) You could take a save file from one platform/program to another (via a central repository at primenet maybe) 2) You could take a save file from one FFT size to another (imagine a software upgrade where the FFT limits have changed) 3) They are much smaller (eg 600k vs 2M) It would have the disadvantages 1) Slightly slower and more complex to read and write. However this is an infrequent operation. (That said time to write a save file may be IO limited in which case a shorter save file will be quicker.) 2) We've all got to agree on it ;-) Discuss! -- Nick Craig-Wood [EMAIL PROTECTED] http://www.axis.demon.co.uk/ _ Unsubscribe & list info -- http://www.scruz.net/~luke/signup.htm Mersenne Prime FAQ -- http://www.tasam.com/~lrwiman/FAQ-mers
Re: Mersenne: gzipped binary expansion of the largest prime
On Wed, Oct 13, 1999 at 11:10:29AM -0400, Darxus wrote: > Can somebody give me the last few digits of the decimal expansion of > (2^6972593)-1 so that I can verify my copy's intact ? gp (also know as pari) is a good tool for this... $ gp GP/PARI CALCULATOR Version 2.0.17 (beta) [snip] (08:21) gp > ((2^6972593)-1)%100 %1 = 2924193791 This took no time at all to calculate. Whereas calculating the complete decimal expansion took ages... -- Nick Craig-Wood [EMAIL PROTECTED] http://www.axis.demon.co.uk/ _ Unsubscribe & list info -- http://www.scruz.net/~luke/signup.htm Mersenne Prime FAQ -- http://www.tasam.com/~lrwiman/FAQ-mers
Re: Mersenne: complaint
On Wed, Sep 15, 1999 at 07:42:53PM +0100, Brian J. Beesley wrote: > I'm going to be _very_ interested in how many people choose to run > $100,000 prize candidates using v19. There is an obvious balance > between the time it takes to complete a test & the enthusiasm of > people to participate, even if there is a substantial cash prize > riding on it. If one had the data then it would be reasonably easy to calculate the expectation (in $ per year) for a given exponent size & CPU. You'd need to know the iteration time for any given exponent for that CPU and the probably distribution of mersenne primes at that exponent size. Unfortunately I don't know these otherwise I would have worked it out ;-) (Digression: Though expectation is a useful statistical measure - in real life where there are very small probablilities involved multiplied by very large amounts the results are not useful practically. Eg the UK lottery - for your 1 pound stake your expectation is 50p. Better to go to the bookies where for your pound you get an expectation of 95p...) -- Nick Craig-Wood [EMAIL PROTECTED] http://www.axis.demon.co.uk/ _ Unsubscribe & list info -- http://www.scruz.net/~luke/signup.htm Mersenne Prime FAQ -- http://www.tasam.com/~lrwiman/FAQ-mers
Re: Mersenne: Down With The Evil Thread
On Tue, Jul 27, 1999 at 04:28:41PM +, Henrik Olsen wrote: > US: million 10^6, billion 10^9, trillion 10^12 ... > > Non-US: million 10^6, milliard 10^9, billion 10^12, > billiard 10^15, trillion 10^18 ... Actually I think most UK people would use the US definitions now-a-days (except maybe *really* old people ;-) I certainly would... -- Nick Craig-Wood [EMAIL PROTECTED] http://www.axis.demon.co.uk/ _ Unsubscribe & list info -- http://www.scruz.net/~luke/signup.htm Mersenne Prime FAQ -- http://www.tasam.com/~lrwiman/FAQ-mers
Re: Mersenne: Re: Mersenne Digest V1 #569
On Mon, Jun 07, 1999 at 04:37:32PM -0500, JON STRAYER wrote: > I'm I correct in thinking that since the exponent is five times the > size of current exponents that the LL test will take 25 times as > long (give the same processor)? > > > You need to test exponent (1000 / log 2) = 33219281 or above. Time for an n bit multiply using DWT is O(n log n) plus some fiddly bits to do with decreasing precision available when using longer convolutions - I'd probably estimate this as O(log log n). As it takes n iterations for each test then to test an n digit number using the Lucas Lehmer test with DWT takes O(n^2 . log n . log log n). If we say that the current exponent is about 6.5 million then we need to test one which is approx 5 times bigger. Using the above formula I make this 29.97 times longer for a 5 times increase in exponent size starting at 6.5 million. Probably ;-) -- Nick Craig-Wood [EMAIL PROTECTED] http://www.axis.demon.co.uk/ Unsubscribe & list info -- http://www.scruz.net/~luke/signup.htm
Re: Mersenne: Prime95 on linux, a tech question
On Sun, Jun 06, 1999 at 10:58:13AM -0700, [EMAIL PROTECTED] wrote: > First of all put the following at the bottom of your /etc/rc.d/rc.local > file: > > /Your/path/to/mprime/mprime & I'd recommend starting it as a particular user not as root. I like to generalise this and give the user a script file to run at boot, eg put this at the end of rc.local # start the users' background jobs off if [ -f ~user/.rc.local ]; then su - user -c ./.rc.local & fi And then something like this into ~user/.rc.local #!/bin/sh (cd ~/mprime ; ./mprime -d >>out.txt 2>&1 &) > To start it from the command line do this: > > nohup /Your/path/to/mprime/mprime & Nohup shouldn't be necessary with the & if you are using a modern shell (eg bash) though it certainly is necessary on some old /bin/sh's -- |- Nick Craig-Wood -- http://www.axis.demon.co.uk/ -- [EMAIL PROTECTED] -| Unsubscribe & list info -- http://www.scruz.net/~luke/signup.htm
Re: Mersenne: Java Prime and lotsa questions
In news:local.mersenne> on Tue 04 May, Brian J Beesley wrote: > I'm still "rather hazy" (read, totally baffled) on exactly how the > Crandall-Fagin procedure works, so I don't know if you could adapt NTT to > do the mod 2^p-1 automatically. But my gut feeling is that it ought to be > possible. It is possible. That is how my StrongARM code works. The maths for the irrational base transform all works out provided the relevant fractional powers of 2 exist in the field you are using (ie for a transform of length n there must exists a number x so that x^n = 2). Whether this is the complex field or some other Galois field it doesn't matter. This puts some constraints on which modulus you can use... My code uses GF(2^64-2^32+1) which works for DWTs up to 2^26 bits (67 million). The 'interesting' binary representation of this number makes for lots of short cuts when taking the modulus also. -- Nick Craig-Wood [EMAIL PROTECTED] http://www.axis.demon.co.uk/ Unsubscribe & list info -- http://www.scruz.net/~luke/signup.htm
Re: Mersenne: Integer FFT
Bill Daly wrote: > The usual concept of an integer FFT is to choose a prime q such that q-1 > is divisible by a reasonably large power of 2, say N = 2^n, find a > primitive (integer) N-th root of 1 mod q, say w, then use w and > arithmetic mod q to calculate the FFT. If it also happens that there is > an integer N-th root of 2 mod q, then the FFT can be converted into a > DWT suitable for implementing multiplication mod 2^p - 1 for some large > prime p. For example, with q = 2^64 - 2^32 + 1, we have both roots for N > up to 2^26, and for that matter we can also use an FFT/DWT of order N = > 5*2^n up to 5*2^26. However, this requires calculating products of the > form a*b mod q where a and b are 64-bit integers, which, while > straightforward, will probably take more time than the corresponding > complex floating-point multiply in the ordinary FFT/DWT. That depends on the CPU. On StrongARM it is much quicker to do a 64*64 integer multiply because it doesn't have an FPU... > It is also hard to find appropriate primes q for which this works. Agreed. > Richard Crandall has proposed implementing an FFT in the Galois field > GF(q^2), where q is itself a Mersenne prime, e.g. q = 2^61 - 1. He calls > this the fast Galois transform (FGT). The idea is that for any prime q = > 3 mod 4, the field of Gaussian integers a + b*i mod q is isomorphic to > GF(q^2), thus we can simply replace complex real values with complex > integers mod q. The multiplicative order of GF(q^2) is q^2-1, thus for q > = 2^61 - 1, q^2 - 1 will be divisible by 2^62, thus we can find a > complex integer w such that w^2^62 = 1. Also, since the order of 2 in > GF(q^2) is 61, there will also be a value r such that r^2^62 = 2. We can > use r and w to implement a complex integer DWT mod q, which requires > code to add, subtract and multiply mod q. That is a very interesting idea. Do you have any further references? > However, we still have the problem that the calculation of a*b mod q is > likely to take more time than we would really like. > > I have a suggestion. The FGT requires only a prime q = 3 mod 4, for > which q+1 is divisible by a reasonably large power of 2. Let's consider > primes q = k*2^n - 1 which are slightly less than 2^32. The idea here is > that this will require only 32-bit integer arithmetic. If we use two > such primes, q1 and q2, and we calculate separately the two sets of > convolution sums z1[0..N-1] mod q1 and z2[0..N-1] mod q2, then we can > use the Chinese Remainder Theorem to calculate the convolution sums > z[0..N-1] mod q1*q2. This is less difficult than it might seem, since we > can precalculate values u1 and u2 such that u1 = 1 mod q1, u1 = 0 mod > q2, u2 = 0 mod q1, u2 = 1 mod q2. Then if we have, from the two FGT > calculations, values z1[j] and z2[j] such that z[j] = z1[j] mod q1 and > z[j] = z2[j] mod q2, we have z[j] = u1*z1[j] + u2*z2[j] mod q1*q2, which > is easily verified to be correct. This calculation can be further > simplified because of the fact that u1 + u2 = 1 mod q1*q2 (in fact, u1 + > u2 = q1*q2 + 1). Thus, (u1+u2)*(z1[j]+z2[j]) = z1[j]+z2[j] mod q1*q2, > and z[j] = u1*z1[j] + u2*z2[j] = (z1[j] + z2[j] + (u1-u2)*(z1[j]-z2[j])) > / 2 mod q1*q2. This requires only a single multiply > (u1-u2)*(z1[j]-z2[j]) mod q1*q2, together with some shifts and adds mod > q1*q2. > > The rationale for this is that the efficacy of an integer FFT depends on > the size of the modulus. If the modulus is slightly less than 2^32, then > for N in the range used by GIMPS we may be able to use only 6-bit > coefficients, whereas with a modulus slightly less than 2^64, in this > case q1*q2, we will be able to use 22-bit coefficients, thereby > increasing the range of validity by a factor of nearly 4, in exchange > for using two FGTs instead of one and the Chinese Remainder step at the > end of each iteration. I looked into doing something similar for the StrongARM. Not with a FGT but with 2 or 3 ordinary integer 32 bit FFTs followed by a CRT step. I came to the conclusion that it wasn't worth it. The main cost of the FFT isn't the time to multiply it is the time to get things in and out of memory. I wouldn't have thought a 32 bit FFT would be much more than twice the speed of a 64 bit FFT and I would expect any advantage would get chewed up by the CRT. That said you can combine the CRT with other bits of the DWT I expect so maybe it wouldn't be that bad. On the Pentium it may be a different story entirely. One particular problem with StrongARM is not having a divide instruction it makes the modulo arithmetic rather tricky for arbitrary q and there aren't that many good q's below 2^32. I like very much the idea of using the FGT with 2^61-1 though! -- Nick Craig-Wood [EMAIL PROTECTED] http://www.axis.demon.co.uk/
Mersenne: Integer DWT
Bill Daly wrote: > This is very interesting. I did some experiments using q = 2^64 - 2^32 + > 1 Yes that is exactly the prime I am using. > to see how a multiply mod q would stack up against a complex floating > point multiply, and I found that on a 266Mhz Pentium II the > multiplication mod q was about 15% faster. Unfortunately, I have lost > both the code and the results (my laptop was stolen), but they are > easily reconstructed. I didn't follow up on it because I couldn't see > how to integrate the DWT with an integer FFT, but if you have found a > way, then maybe it is in fact possible to implement an integer version > of LL which is competitive with prime95. It is all to do with the properties of p. You've discovered already that you can only fourier transform with lengths that divide into p-1. Factors of p-1 are 2^32 * 3 * 5 * 17 * 257 * 65537 This means that practically an FFT can be defined of length up to 2^32 with optional factors of 3 and 5. If our transform length is n then to do a Discrete Weighted Transform we need an n-th root of 2. 2 has order 192 mod p (ie 2^192 mod p = 1) so we can have (p-1)/192 = (2^58 - 2^26) / 3 = 2^26 * 5 * 17 * 257 * 65537 th root of 2. This means that we can do the DWT for lengths up to 2^26 with an optional factor of 5 This is extremely fortunate, I did extensive searches looking for primes which satisfy both the FFT and the DWT constraints and this was the only one with decent properties (for FFT, DWT and mod p with no division). It would have been nice to have a factor of 3 but you can't have everything! Some other random facts :- 7 is a primitive root mod p An n-th root of unity can be generated by 7^(5*(p-1)/n) mod p. An n-th root of two can be generated by 7^(5*(p-1)/192/n) mod p So a suitable 5 * 2^26-th root of 1 is 0xED41D05B78D6E286 and the 5 * 2^26-th root of 2 is 0xC47FC73D33F80E14 Note that a 64-th root of unity is 8 mod p. As for how you do the DWT, once you've got your n-th roots of 1 and 2 you just proceed exactly as per the floating point version just in the mod p field rather than the complex field. Peter-Lawrence Montgomery helped me a great deal working this all out. He also worked out some great short cuts for doing 128 bit and 64 bit reductions mod p. I turned all that into lean ARM assembler! > The particular form of q is what makes reduction mod q efficient. [snip] Note that if you choose your root of unity so that the 64-th root of unity is 8 then you can convert lots of the primitive operations into shifts rather than multiplies. This is a big win on the StrongARM. > It is of course fortuitous that q is prime. Fortuitous isn't the word I would have used, it took many days of CPU time for me to fix on this prime! Hopefully I will have the time to tidy up the code and release it to the world soon... -- Nick Craig-Wood [EMAIL PROTECTED] http://www.axis.demon.co.uk/
Re: Mersenne: AMD K7 will
Foghorn Leghorn wrote: > > The Playstation doesn't do floating point. It would be a very slow > > platform for running GIMPS. :) > > Question: does Prime95 use floating point because the algorithm requires > it, or because something about the Intel architecture makes it a good > choice? The algorithm doesn't require it but on most processors it is quickest to do a floating point discrete weighted transform. That said I've written (with quite a bit of help from Peter LM) a DWT for StrongARM which is runs without use of floating point (StrongARM doesn't have an FPU). The individual computations are a bit more expensive because of the need to work modulo a 64 bit prime so this makes the StrongARM routine about 1/3 of the speed of an equivalently clocked Pentium. An Integer DWT has a few advantages though, namely because it doesn't have round off errors it can pack a few more bits in and use slightly smaller FFT sizes. > I'm not clear about how Prime95 is able to use floating point to perform > something that is theoretically an all-integer computation. Clever isn't it! Every iteration prime95 checks each floating point number to see if it is an integer +/- < 0.4. If it is outside this range then it calls an error. > Is there any platform on which an all-integer implementation would be > superior to one that uses floating point? Any processor without an FPU. Like StrongARM or Sony Playstation... Assuming that MIPS codes equivalently to ARM then a 33 MHz MIPS (in the Playstation) is going to come out at a 10 Mhz Pentium equivalent for GIMPs use. That probably isn't going to be worth your while unfortunately. -- Nick Craig-Wood [EMAIL PROTECTED] http://www.axis.demon.co.uk/
Re: Mersenne: Why isn't computing x**2 faster than x*y?
Luke Welsh wrote: > Aside from the fact the we need to do a single FFT instead > of two, why isn't squaring inherently faster than multiplying? You mean two instead of three but I take your point. This is something I've spent many hours thinking about. Everything about squaring screams that it ought to be easier than general multiplication especially the fact that mod multiplies map from and to the same set. However the best I've seen for squaring is a factor of two improvement over the general multiply case. I don't know if anyone has proved anything here. -- Nick Craig-Wood [EMAIL PROTECTED] http://www.axis.demon.co.uk/
Re: Mersenne: Fast shifting on RISC
In news:local.mersenne> on Wed 16 Sep, Bojan Antonovic wrote: > > The PII is able to shift two numbers by n positions in one step. The PII is > > a Post-RISC processor, ie it is able to execute 2 or 3 instructions per > > cycle and each basic instruction takes 1 cycle. The CISC and RISC > > architectures are both obsolete. > > I meant with step 1 cycle (for calculation only). The ARM is like this. The ARM is a 3 operand RISC and you may shift one of the operands left/right/rotate by any amount with with no extra execution time using the barrel shifter. This is one of the reasons assembly programmers like the ARM - you can be extrememly clever if you like, especially given the fact that you can make all instructions conditional. -- Nick Craig-Wood [EMAIL PROTECTED] http://www.axis.demon.co.uk/