[sage-devel] Re: Implementation of Q(t) and monomials

Sebastian Pancratz Sat, 14 Nov 2009 21:00:42 -0800

Dear William,

I am adding a somewhat more detailed performance report below,
comparing my own FLINT-based C code, MAGMA, SAGE 4.1.2.alpha2 and SAGE
4.1.2.alpha2 with the patch from trac ticket #4000.


To give an impression of the results, for very small examples MAGMA
marginally beats my C implementation, but I am very confident that I
could change this around using a lazy evaluation approach for changing
values away from 0/1 (then to be implemented as NULL/NULL).  At this
stage already, the regular (resp. patched) SAGE code is out by a
factor of 100 (resp. 10).  In the last example with degrees 100 and
140, the C code beats MAGMA by a factor of close to 10 and the regular
SAGE code is a factor of around 150,000 slower (for addition; the I
did not wait for the multiplication to finish), that is to say,
essentially unusable for repeated calculations.  Fortunately, at this
stage, the patched SAGE code is less than a factor of 2 slower than
the C code and clearly beats the MAGMA code.

By the way, I wouldn't want to say that the C code is heavily
optimised at the moment.  When writing it, I have focussed on writing
a clean interface while adding as little overhead as possible, thereby
taking advantage of the speed that FLINT provides.  Nonetheless, apart
from using the lazy evaluation approach mentioned above and trying to
improve the combined operation addmul (used for matrix operations), I
do not have any ideas for improvement at the moment.  But of course,
I'd be happy to incorporate any suggestions.

As a final comment, I want to add that the performance of the patched
SAGE code shouldn't be a surprise.  As part of the work on trac ticket
#4000, I had already written a class
"UnivariateRationalRationalFunction(FractionFieldElement)" to
represent a rational function over the rationals, based on Z[t], which
in turn is based on FLINT already.  It is thus to be expected that it
should run at a similar speed apart from a small constant factor.

Kind regards,

Sebastian

[Performance comparison]
Performance comparison between

    FLINT-based C code,
    MAGMA (Magma V2.15-12),
    SAGE (4.1.2.alpha2),
    SAGE (4.1.2.alpha2 with changes along trac ticket #4000).

MAGMA was run on the departmental machine, with eight CPUs with the
following specifications

    model name      : Dual-Core AMD Opteron(tm) Processor 8220
    cpu MHz         : 1000.000
    cache size      : 1024 KB

and 32GB of memory.

The other three programs were run on my laptop, with two CPUs with the
following specifications

    model name      : Intel(R) Core(TM)2 Duo CPU     P8400  @ 2.26GHz
    cpu MHz         : 800.000
    cache size      : 3072 KB

and 2GB of memory.

We only compare addition and multiplication, based on three examples
of different sizes.  The number of repetitions only applies to the
FLINT-based C code and MAGMA.  We give the output of "timeit" for SAGE
and scale the number up by the number of repetitions to make it easily
comparable.

The C timings are done using setup in the file qfmpz_poly-testp.

The MAGMA code is executed as follows:

    F := FieldOfFractions(PolynomialRing(Rationals()));
    t := F.1;
    b := (3*t^2+2*t+1)/(t+10);
    c := (5*t+7)/13;
    t0 := Cputime();
    for i:=1 to 1000000 do
    for> a := b+c;
    for> end for;
    t1 := Cputime(t0);
    t1;

The SAGE code is time with

    F.<t> = Frac(PolynomialRing(QQ, 't'))
    b = (3*t^2+2*t+1)/(t+10)
    c = (5*t+7)/13
    timeit('a = b+c')

In each case, we give the timings in the same order as the name of the
programs at the top of this text.

In the small case, we use N = 1,000,000 repetitions and

b = (3*t^2+2*t+1)/(t+10)
c = (5*t+7)/13

ADD
3.41
3.010
263 (625 loops, best of 3: 263 µs per loop)
50 (625 loops, best of 3: 50 µs per loop)

MUL
3.74
4.340
391 (625 loops, best of 3: 391 µs per loop)
58.2 (625 loops, best of 3: 58.2 µs per loop)

In the medium case, we use N = 100,000 repetitions and

b =
(-19836522420*t^29+79346089680*t^28+57706247040*t^27+238038269040*t^26+158692179360*t^25+555422627760*t^24+5119102560*t^23-714114807120*t^22+52897393120*t^21+79346089680*t^20-79346089680*t^19-1983652242000*t^16-39673044840*t^14+39673044840*t^13+4959130605*t^12+79346089680*t^11+359032080*t^10-39673044840*t^6-119019134520*t^5+15869217936*t-26448696560)/
(158692179360*t^30+7934608968*t^29-793460896800*t^28+13224348280*t^27-1017257560*t^25+741552240*t^24+52897393120*t^23-79346089680*t^21-46285218980*t^20+29754783630*t^19+158692179360*t^18-555422627760*t^17-158692179360*t^15-39673044840*t^14-7213280880*t^11-39673044840*t^10+79346089680*t^9-449627841520*t^8-79346089680*t^7-5805811440*t^5-79346089680*t^4+79346089680*t^3-26448696560*t-634768717440)
c =
(91560*t^30-183120*t^29+45780*t^27-30520*t^26-6226080*t^24-91560*t^22+366240*t^21+91560*t^20+45780*t^19+91560*t^17+45780*t^16-91560*t^15+137340*t^14-30520*t^13-30520*t^12-45780*t^11+274680*t^10+1007160*t^8-9156*t^7-274680*t^6-11445*t^4+9240*t^3-412020*t^2-39240*t
+366240)/
(45780*t^30-91560*t^29-91560*t^28-91560*t^27-45780*t^25+22890*t^24+91560*t^23+91560*t^22+183120*t^21+183120*t^20+10071600*t^19-18312*t^18+549360*t^16-45780*t^14+549360*t^13-91560*t^12+274680*t^10-45780*t^9-91560*t^8+549360*t^7-45780*t^6+503580*t^5+91560*t^3-137340*t^2+457800*t
+91560)

ADD
16.4
30.890
25200 (5 loops, best of 3: 252 ms per loop)
27 (625 loops, best of 3: 270 µs per loop)

MUL
19.43
47.670
3370000 (1 loops, best of 1: 33.7 s per loop)
34.4 (625 loops, best of 3: 344 µs per loop)

In the large case, we use N = 100,000 repetitions and

b = (t^99-5*t^98+4*t^96+t^95-t^94+4*t^93-8*t^92-310*t^91-
t^89+21*t^86-10*t^85+333*t^84-28*t^81+t^79+5*t^78-4*t^77+5*t^76-5*t^75-5*t^74-
t^73+t^72-2*t^71+2*t^69+t^68+8*t^67-
t^66-2*t^65+6*t^64+t^63+t^62+2*t^61+4*t^60+t^59+6*t^57-2*t^56+t^55-5*t^52-63*t^51-2*t^49-
t^48+t^47-t^46+2*t^45+t^44-
t^43+5*t^42-2*t^41+15*t^40+t^39-2*t^38+t^37-55*t^36+t^35+t^34-7*t^33-
t^31+4*t^30-2*t^29+t^27+2*t^26-3*t^25+t^24-2*t^23+2*t^21+2*t^20+t^19+t^18-
t^17-94*t^16+4*t^15+2*t^14-3*t^13-t^12+4*t^11+t^10-t^9+t^7+2*t^6+t^5-
t^4)/(2*t^100-
t^99-2*t^98+2*t^97+2*t^95+2*t^94+11*t^93+2*t^91+t^90+t^89+t^88-
t^87+2*t^86+12*t^85-2*t^83+3*t^81+9*t^80+t^79-9*t^78-
t^76+2*t^75+50*t^74-t^73+t^72-2*t^71-2*t^70-
t^65+t^64+t^63+3*t^62-2*t^60+t^58-5*t^57-2*t^55-5*t^54+t^53-t^52-t^51-
t^50-2*t^49+7*t^48+2*t^46+3*t^44+2*t^43+60*t^41-
t^38+t^37+7*t^36+7*t^34-t^32+6*t^30-3*t^28+t^27+29*t^26-t^25-
t^24+7*t^23+17*t^21-4*t^20+10*t^19+t^18+t^16-t^15+t^14+t^13-
t^12+t^11+t^10+t^9+t^8-t^7+14*t^6-t^5-t^4+t^3-t^2-t
+1)
c = (-t^137-5*t^136-t^135-t^134-2*t^133+t^131+8*t^130-t^129-
t^128-10*t^127+t^125+t^124+9*t^123-t^122-16*t^121+2*t^120-t^119-
t^118+2*t^117-2*t^116-t^114-8*t^111+t^110+t^108+t^107-
t^106+10*t^104-6*t^103-
t^102+t^101+5*t^99-81*t^98+t^97-2*t^96+2*t^95-5*t^94+t^93-2*t^92+5*t^91+t^90-
t^89+t^87-7*t^86-9*t^85-6*t^84-t^83+23*t^82-2*t^81-t^80+2*t^78-
t^77-2*t^75-3*t^74+t^73+19*t^71-20*t^70+2*t^68-29*t^67-t^66-2*t^65-
t^63+133*t^62+t^61+8*t^60+5*t^59-
t^58-2*t^57+5*t^56+t^55-2*t^54-9*t^51-4*t^50+14*t^49-t^48-
t^47+6*t^46+t^45+5*t^43+3*t^42-9*t^41+t^40-3*t^39+t^38+4*t^37+2*t^36+3*t^35-2*t^34-
t^33-t^31-t^30-3*t^28-t^27+t^25+2*t^24-4*t^23+t^22+t^21-t^20-
t^19+3*t^18+4*t^17+t^16-51*t^15-2*t^14-t^13-
t^12-2*t^11+3*t^10+t^9-2*t^8+t^7-3*t^6+4*t^4+3*t^3-t^2)/
(t^137+t^136-3*t^135-t^134+2*t^133-
t^132+20*t^131-3*t^129+t^128-2*t^126-t^125-t^124-2*t^123-t^121+2*t^120-
t^119+t^118-77*t^116+t^115-t^113-4*t^110-2*t^109-
t^108+t^106+4*t^105-2*t^103+t^102+t^101+10*t^100-23*t^99-
t^98-22*t^97-2*t^94-7*t^93+t^91+2*t^90-2*t^89+2*t^87-6*t^86-
t^85+t^84-2*t^82-28*t^81-
t^79-176*t^78-9*t^77+t^76-6*t^75-3*t^74+t^73-23*t^72-6*t^71+10*t^70+t^69+2*t^68-2*t^67-4*t^66+37*t^64-
t^63+2*t^62-2*t^61+t^60+t^59-2*t^58+67*t^56-10*t^55-8*t^54+25*t^53+2*t^52-5*t^51+t^50-2*t^49-
t^48-t^47-2*t^46-5*t^45-t^43-t^42-
t^41+5*t^39+2*t^35-4*t^34+t^33+3*t^32-
t^30-3*t^29+t^28+2*t^27+2*t^26+t^25+t^24-t^23+t^22+3*t^21-
t^20+t^19-3*t^17+4*t^15-t^14+4*t^13+2*t^11+t^9-t^8+t^7-t^4-t^3-
t^2-12*t-17)

ADD
27.62
180.040
3980000 (1 loops, best of 1: 39.8 s per loop)
36.4 (625 loops, best of 3: 364 µs per loop)

MUL
35.79
285.680
NA (Manually terminated before completion)
57 (625 loops, best of 3: 570 µs per loop)
[/Performance comparison]

On Nov 15, 2:05 am, William Stein <wst...@gmail.com> wrote:
> On Sat, Nov 14, 2009 at 5:59 PM, Sebastian Pancratz <s...@pancratz.org> wrote:
>
> > Dear all,
>
> > Some might remember that in September I put a lot of effort into
> > rewriting Q[t] to use FLINT.  While the patch (see trac #4000) was
> > very usable in practice already, despite the help of many people there
> > remained a few doctest failures throughout SAGE.
>
> Could you add something to this email about performance comparisons with 
> Magma?
> It's one thing to say "Sage is slow compared to my custom optimized C
> code" (no surprise), and another to say "Sage is 1000 times slower
> than some easy-to-write interpreter code in Magma" (ouch, and is the
> case here, right?).
>
> William
>
>
>
>
>
> > In short, while using SAGE with my own PhD work (on computing
> > connection matrices) I came across a couple other issues with a high
> > overhead involved in using SAGE and so I decided to write plain C.
> > Currently, I am occupied with preparing my first year transfer thesis,
> > which I think serves the same purpose as the qualifying exams in the
> > US, but I am hoping that afterwards I can find the time to finish the
> > above ticket, if no-one else closes it before that.  In the meantime,
> > I wanted to briefly describe the problems I ran into using SAGE and
> > let people know about some of the C code I have written to overcome
> > these problems (at least with regards to my own work), in case someone
> > else is interested.
>
> > The first problem concerns the rational function field over Q.  Using
> > this via
>
> >    F.<t> = Frac(PolynomialRing(QQ, 't'))
>
> > was slow without applying the patch at trac #4000, but even with that
> > applied there seemed to still be a very noticeable overhead due to
> > some category related code.  In fairness, I have to admit that I did
> > not go through too much trouble to chase down the exact problem.  But
> > in any case I have written *much* faster C code based on FLINT --- I
> > include a link to this at the very end of this post.  I should also
> > point to my post on flint-devel
>
> >    http://groups.google.com/group/flint-devel/browse_thread/thread/43a57...
>
> > I think that this code is set up quite well and that other people
> > might be interested in this.
>
> > The second problem had to do with monomials in multivariate polynomial
> > rings.  Here perhaps I should first say that I have only been working
> > with three to five variables, that is to say, there is no notion of
> > asymptotic performance as the number of variables increases.  My main
> > motivation to look at this was that, on some examples, my code for
> > computing connection matrices made 5--100 million calls to monomial
> > comparisons alone.  In SAGE, there was a very noticeable improvement
> > in performance at each step when moving from using
> > MPolynomial_polydict objects in a multivariate polynomial ring to
> > working directly with ETuple objects to finally working with a Cython
> > wrapper of a C array of ints.  In my C code, I have now taken this a
> > step further, representing an entire monomial as a primitive type in a
> > 64-bit word.  Benefits are the greatly simplified data structure and
> > that a couple of operations (multiplication, division, inverse
> > lexicographical comparison, to name the most prominent ones) reduce to
> > a simple operation on single words ("+", "-" and "<", respectively).
> > There are a few downsides, in particular with regards to generality.
> > For example, if one wants to use up to eight variables and have a
> > fixed number of bits per exponents regardless of the number of
> > variables (so that one can have implicit coercion between $R[X_0, X_1]
> > \subset \dotsb \subset R[X_0, \dotsc, X_7]$), then one needs to
> > guarantee that no computation involves individual exponents greater
> > than 255.  On the other hand, if one is working with curves in three
> > homogeneous variables, then this approach allows exponents up to
> > 65535, which will probably suffice for most practical purposes.
>
> > Of course, this implementation is much faster than my previous SAGE
> > code.  But it is much more limited and I don't quite see it being of
> > much use in a general framework.  Nonetheless, in case someone is
> > interested in taking a look at this, I have uploaded it to my domain,
> > too.
>
> > I've uploaded these two implementations, together with full online
> > code documentations generated by doxygen, to my domain at
>
> >    http://www.pancratz.org/code/
>
> > Please don't be surprised by the simplistic design of the website, so
> > far I am mostly using it for private purposes.
>
> > Kind regards,
>
> > Sebastian
>
> --
> William Stein
> Associate Professor of Mathematics
> University of Washingtonhttp://wstein.org
--~--~---------~--~----~------------~-------~--~----~
To post to this group, send an email to sage-devel@googlegroups.com
To unsubscribe from this group, send an email to 
sage-devel-unsubscr...@googlegroups.com
For more options, visit this group at http://groups.google.com/group/sage-devel
URL: http://www.sagemath.org
-~----------~----~----~----~------~----~------~--~---

[sage-devel] Re: Implementation of Q(t) and monomials

Reply via email to