To answer your question: I special-case unicode (strings), ints, and floats. I am working on special-casing tuples (can even be different types, just need homogeneity column-wise). The best speedups will be tuples of floats: it'll bypass three layers of useless checks.
If I run it without special-casing floats (just using tp->rich_compare) I only get like a 5-10% speedup. I'm working on rigorous benchmarks for all this stuff, will post a pdf along with the patch once it's all done. But it's certainly <10%. However, part of this is because my special case is really low-level; for strings I've actually found the opposite, using tp->richcompare gives me almost the same results as my special case compare, since it still has to PyUnicode_READY the strings (or whatever it's called). Regarding generalization: the general technique for special-casing is you just substitute all type checks with 1 or 0 by applying the type assumption you're making. That's the only way to guarantee it's safe and compliant. Elliot On Tue, Oct 11, 2016, 5:19 PM Jim J. Jewett <jimjjew...@gmail.com> wrote: > Excellent. > I'm surprised cache didn't save more, but less surprised than I was ... I > hadn't realized that you were skipping the verifications in > PyFloat_RichCompare as well. Does that generalize to other element types > without exposing too much of the per-type internals to list.sort? > > Oh ... and I appreciate your not quoting private email as a general > courtesy, but I hereby give you permission if it was mine that was private. > [Though I think your summary was better than a quote anyhow.] > > -jJ > > On Oct 11, 2016 4:58 PM, "Elliot Gorokhovsky" < > elliot.gorokhov...@gmail.com> wrote: > > So I got excited here. And the reason why is that I got those numbers *on > Tim's benchmark*. When I got these kinds of numbers on my benchmarks, I > figured there was probably a problem with they way I was timing, and > certainly the gains couldn't be as extreme as they suggested. But this is > on a benchmark that's already in the codebase! > > > Here is a detailed explanation of how to reproduce my results, and the > circumstances that would lead them to be invalid: > > ****************************************** > > To reproduce, just activate a virtualenv, and then clone > https://github.com/embg/python-fast-listsort.git. Then python setup.py > install and python sortperf.py. > > > Now let's look at what sortperf.py does and how it relates to Tim's > benchmark at Lib/test/sortperf.py. If you diff the two, you'll find I made > three changes: > > > 1. I added an import, "import fastlist". This obviously would not make > sorting twice faster. > > > 2. I changed the way it formats the output: I changed "fmt = ("%2s %7s" + > " %7s"*len(cases))" to "fmt = ("%2s %7s" + " %6s"*len(cases))". Again > irrelevant. > > > 3. I changed the timing function > > #from this > > > def doit_fast(L): > t0 = time.perf_counter() > L.fastsort() > t1 = time.perf_counter() > print("%6.2f" % (t1-t0), end=' ') > flush() > > > > #to this > > > def doit(L): > F = FastList(L) > f0 = time.perf_counter() > F.fastsort() > f1 = time.perf_counter() > F = FastList(L) > t0 = time.perf_counter() > F.sort() > t1 = time.perf_counter() > print("%6.2f%%" % (100*(1-(f1-f0)/(t1-t0))), end=' ') > flush() > > > ******************************************* > > So what we've shown is that (1) if you trust the existing sorting > benchmark and (2) if my modification to doit() doesn't mess anything up (I > leave this up to you to judge), then the measurements are as valid. Which > is a pretty big deal (50% !!!!!!!), hence my overexcitement. > > **************************************** > > > Now I'd like to respond to responses (the one I'm thinking of was off-list > so I don't want to quote it) I've gotten questioning how it could be > possible for such a small optimization, bypassing the typechecks, to > possibly have such a large effect, even in theory. Here's my answer: > > Let's ignore branch prediction and cache for now and just look at a high > level. The cost of sorting is related to the cost of a single comparison, > because the vast majority of our time (let's say certainly at least 90%, > depending on the list) is spent in comparisons. So let's look at the cost > of a comparison. > > Without my optimization, comparisons for floats (that's what this > benchmark looks at) go roughly like this: > > 1. Test type of left and right for PyObject_RichCompare (which costs two > pointer dereferences) and compare them. "3 ops" (quotes because counting > ops like this is pretty hand-wavy). "2 memory accesses". > > 2. Get the address of the float compare method from > PyFloat_Type->tp_richcompare. "1 op". "1 memory access". > > 3. Call the function whose address we just got. "1 op". "Basically 0 > memory accesses because we count the stack stuff in that 1 op". > > 4. Test type of left and right again in PyFloat_RichCompare and compare > both of them to PyFloat_Type. "4 ops". "2 memory accesses". > > 5. Get floats from the PyObject* by calling PyFloat_AS_DOUBLE or whatever. > "2 ops". "2 memory accesses". > > 6. Compare the floats and return. "2 ops". > > Now let's tally the "cost" (sorry for use of quotes here, just trying to > emphasize this is an intuitive, theoretical explanation for the numbers > which doesn't take into account the hardware): > "13 ops, 7 memory accesses". > > Here's what it looks like in my code: > > 1. Call PyFloat_AS_DOUBLE on left and right. "2 ops". "2 memory acceses". > > 2. Compare the floats and return. "2 ops". > > Tally: "4 ops, 2 memory accesses". > > Now you can argue branch prediction alleviates a lot of this cost, since > we're taking the same branches every time. But note that, branch prediction > or not, we still have to do all of those memory acceses, and since they're > pointers to places all over memory, they miss the cache basically every > time (correct me if I'm wrong). So memory-wise, we really are doing > something like a 7:2 ratio, and op-wise, perhaps not as bad because of > branch prediction, but still, 13:4 is probably bad no matter what's going > on in the hardware. > > Now consider that something like 90% of our time is spent in those steps. > Are my numbers really that unbelievable? > > Thanks for everything, looking forward to writing this up as a nice latex > doc with graphs and perf benchmarks and all the other rigorous goodies, as > well as a special case cmp func for homogeneous tuples and a simple patch > file, > > Elliot > >
_______________________________________________ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/