Re: Mersenne: Factoring
Chris Nash proposed: > >We could let prime95 decide the next election . > >Give everybody a different prime number. Multiply the primes for > candidate A together, likewise for B. > And like in this election, unless we can completely factor the results, > we wouldn't know who won, either. No need to factor during primery elections. _ Unsubscribe & list info -- http://www.scruz.net/~luke/signup.htm Mersenne Prime FAQ -- http://www.tasam.com/~lrwiman/FAQ-mers
Re: Mersenne: Factoring
Hi folks, Off-topic I know, but... >We could let prime95 decide the next election . >Give everybody a different prime number. Multiply the primes for candidate A together, likewise for B. And like in this election, unless we can completely factor the results, we wouldn't know who won, either. :) _ Unsubscribe & list info -- http://www.scruz.net/~luke/signup.htm Mersenne Prime FAQ -- http://www.tasam.com/~lrwiman/FAQ-mers
Re: Mersenne: P4
Jason Papadopoulos wrote: >> The alpha was already at least 5x faster than a PIII for multiprecision >> arithmetic at the same clock speed; with the P4 it will only get worse. Brian Beesley replied: > Are you sure about this? I think, with Alpha, you have to execute the > instruction twice to get a double-precision multiplication - you can > store either the low half or the high half of a product in one > instruction, but not both. True, you need two instructions to get both halves of a 64x64 ==> 128-bit product on the Alpha. It always seemed wasteful to me to have an instruction (e.g. Alpha umulh) to get the upper half a product, which simply discards the lower half - by way of contrast, MIPS dmult calculates the full 128-bit product in one shot, returning the result in *two* 64-bit integer registers - but apparently the Alpha designers wanted all instructions to be in 2-input, one-output-register format, no exceptions. You might say they felt it would be less than RISCy to do otherwise. Now, on the Alpha 21064, integer muls are unpipelined and slow, so getting a 128-bit product takes around 30 cycles. On the 21164 imuls are partially pipelined - one can start at most every 8 cycles - so a double-width product needs around 16 cycles, very close to what the MIPS needs using its unpipelined dmultu instruction. Now on the Alpha 21264, imuls still have fairly long latency (I believe around 6-8 cycles) but are *fully* pipelined, so even with the need to use 2 instructions to do it, one can effectively get a 128-bit product every 2 cycles, which really screams. The IA-64 will (from what I've read) actually be very much like the 21264 in this respect: 2 instructions for the separate halves of a double-wide integer product (the IA-64 analog of Alpha umulh is called xma.hu), but here the muls will use the x86-style floating multiplier, whose 64-bit mantissa is of convenient length for 64-bit integer multiply. Thus, imuls will benefit from the fact that the fmul unit is already designed to have fairly low latency and, more importantly, is highly pipelined. How is any of this relevant to Mersenne testing? Well, fast 64-bit integer ops should make a well-designed all-integer convolution algorithm competitive with a floating-point-FFT-based one. However, neither of these two is truly satisfying since each basically uses just half the functionality (integer or floating-point) of the processor. We really need an algorithm that does floating and integer convolutions in parallel and uses some method (e.g. Chinese remainder theorem or some other form of error correction) to recorrelate the two result vectors at the end of each convolution step. Such methods are feasible, but highly nontrivial to implement, and even more challenging to implement efficiently. Stay tuned... -ernst _ Unsubscribe & list info -- http://www.scruz.net/~luke/signup.htm Mersenne Prime FAQ -- http://www.tasam.com/~lrwiman/FAQ-mers
Mersenne Digest V1 #796
Mersenne Digest Thursday, November 30 2000 Volume 01 : Number 796 -- Date: Tue, 28 Nov 2000 15:48:11 -0500 From: "Willmore, David (VS Central)" <[EMAIL PROTECTED]> Subject: RE: Mersenne: P4 - a correction George wrote: > One correction to my previous post. I said that the latency to > access the L1 data cache was 2 clocks. This is correct for integer > instructions only. For floating point and SSE2 instructions the latency > is 6 clocks! Interestingly, the L2 cache latency is 7 clocks for both > integer and floating point instructions. > Look at the coupling that the FPU has to the cache for one reason. I would expect that the FPU(s) have more ports on the L1 than that integer units do. Also, if you look at the sensitivity of different types of code to load latency, integer code, by far, is more sensitive than floating point. Think about the length of the floating point pipeline, it's pretty long to start with, so you're gonig to *have* to unroll your code to take advantage of the pipeline, so you might as well cover the additional load to use latency the same way. With enough rename registers, it's all good. :) Cheers, David _ Unsubscribe & list info -- http://www.scruz.net/~luke/signup.htm Mersenne Prime FAQ -- http://www.tasam.com/~lrwiman/FAQ-mers -- Date: Tue, 28 Nov 2000 16:48:13 -0500 From: George Woltman <[EMAIL PROTECTED]> Subject: Mersenne: Re: P4 - a correction Hi, At 07:45 AM 11/28/00 +, Brian J. Beesley wrote: >Surely the latency is much less relevant than the throughput, >provided that there are sufficient registers and pipeline entries. True. The manuals are unclear as to how many XMM registers are available. There are 8 user visible ones and an unknown number of internal ones that you use via the P4's register renaming feature. >The idea is to keep the execution units busy ... I just finshed designing the new code for one of the easier problems - the rounding and carry propagation step after the inverse FFT. It looks like the P4 can do this in 6.5 clocks per double vs. about 19 clocks per double on the P3. This doesn't count any benefits from prefetching data. It also seems that the multiply units are idle enough in the new code so that I can continue to use the compact two-to-fractional-power arrays, rather than two full 5MB arrays. Thus, yesterday's question of 33MB max is probably 23MB max for a 640K FFT. That's good, because I'd feel guilty using up 33MB. Heck, I feel guilty using 23MB. I don't expect the above savings (3 times fewer clocks) to hold for the FFT code. The old FFT code was much better pipelined than the old carry propagation code. >Bear in mind that there may be a severe penalty for switching between >SSE2 and x86 FPU formats. There is no penalty on the P4. The XMM registers do not overlap the FPU registers in the same way that MMX registers do. Having fun, George _ Unsubscribe & list info -- http://www.scruz.net/~luke/signup.htm Mersenne Prime FAQ -- http://www.tasam.com/~lrwiman/FAQ-mers -- Date: Tue, 28 Nov 2000 14:06:41 -0800 From: "Stephan T. Lavavej" <[EMAIL PROTECTED]> Subject: Mersenne: More on Compression > This 'Wonderful' compression technology maybe "Awesome"; however, MY main > objection or perhaps philosophy towards all of this is that Prime95 is > not a large > piece of code. It takes a relatively small amount of time to download over > a modem > compared to other software items that we modem users may download in a weeks > period. > Maybe if Prime95 was ..say .. a 20MB download. Then perhaps shaking things > up to save > some download time would be a good idea, but as things stand now we are only > talking about saving 20 or so seconds. Perhaps that is the reason people > may find it 'objectionable'.. > Maybe its just not worth the hassle at the moment for this particular > application Actually, the download time wouldn't change all that much. (The package would go from 400K to 300K.) But the executable itself, once unzipped, would only be some 200K, as compared to over 1MB before. It's just more efficient. The question is, if compression involves a one-time, five-minute cost on the part of the developer and saves everyone a few seconds of download time and a few K of HD space, then why not? Why have bloated code? I sure like looking at 200K executables instead of megabyte and larger things. Makes me think of my old 486 where everything fit in 120MB. > P.S. The website for the compression program never resolved for me; I'd > like to take a look at it, if someone would send me the IP. http://upx.tsx.org is a redirection URL for http://wildsau.idv.uni-linz.ac.at/mfx/upx.html. > The P4 at 1.5 GHz is ju
Mersenne: Factoring
We could let prime95 decide the next election . Give everybody a different prime number. Multiply the primes for candidate A together, likewise for B. If the factorization of the composite is not square free then we know that 1) Someone voted twice for the same candidate, and 2) We know who that is. If the 2 composites contain the same factor, then we know that 1) Someone voted for both candidates and 2) Again we know who. No lawsuits, no recounts, no chads. Frank
Re: Mersenne: P4
On 29 Nov 00, at 20:31, Jason Stratos Papadopoulos wrote: > > Humm, too much slowdown for single Ia32 instructions, Intel engineers > > will know the reasons. Presumably something to do with not being parallel architecture; there's only one set of flags, so you can't start an ADC until the previous instruction has completed and the flags register has been retired. In IA32, there's nothing to tell you _which_ carry flag to use in an instruction which uses carry as an input, or output. > > This and the 14-18 clock multiply are extremely depressing. For all the > marketing hype about Intel catering to the internet and the next > generation of "active media", they're making it awfully difficult to > implement the cryptography that the internet needs. Well, you're not _forced_ to buy Intel - even if you stick to Windoze! I get the impression that, with the Athlon architecture, AMD are making serious inroads into the market - especially at the lower end. Now it just so happens that the Athlon architecture runs Prime95 rather well. > > The alpha was already at least 5x faster than a PIII for multiprecision > arithmetic at the same clock speed; with the P4 it will only get worse. Are you sure about this? I think, with Alpha, you have to execute the instruction twice to get a double-precision multiplication - you can store either the low half or the high half of a product in one instruction, but not both. Of course, the Alpha's single precision integer arithmetic is 64 bits, not 32. This does help somewhat :) It's also easy to get confused between the latency (time between loading the opcode and finishing execution) and the throughput (the inverse of the longest time an instruction spends in any particular unit, times the number of the critical unit involved in the design). The fact remains that the P4, like all other commercial processors, is a compromise. This is what makes it so darned complicated, and also indicates why the designers have to make performance tradeoffs, some of which don't suit our particular application. It would be possible to optimize the design so that it executed the instructions _we_ find useful much more quickly, but whether such a processor would be capable of running standard commercial benchmarking programs at a reasonable speed - or indeed at all - is open to question ... we'd probably want to reduce the inherent complexity by junking the 16-bit x86 legacy, for a start... Regards Brian Beesley _ Unsubscribe & list info -- http://www.scruz.net/~luke/signup.htm Mersenne Prime FAQ -- http://www.tasam.com/~lrwiman/FAQ-mers