Hi,

I use libxml xpath engine on quite large (and mostly "flat") xml files. It 
seems that Shellsort, that is used in xmlXPathNodeSetSort is a performance 
bottleneck for my case. I have read some posts about sorting in libxml in the 
libxml archive, but I agree that qsort was not the way to go. I experimented 
with Timsort instead and my results were good for me. For about 10000 nodes, my 
test was about 5x faster with Timsort, for 1000 nodes about 10% faster, for 
small data files, the difference was not measurable.

The patch is attached. It is not very polished, but it can be done on demand. I 
took Timsort implementation from https://github.com/swenson/sort (it has MIT 
Public License) The implementation uses its own BINARY_INSERTION_SORT if the 
size of sorted data is less than 64. I had to make it compile on msvc (msvc 
does not support C99, I had to backport it to C89). I have also removed some 
unused sorting algorithms.

Timsort is able to utilize sorted runs of data (that is the reason why 
Shellsort is used in libxml as far as I know). It is a widely used algorithm, 
e.g. in python. The only disadvantage I know is that it uses O(n) memory. It is 
not a problem for me, but I don't know about others. If this is not acceptable, 
I could try to find some Smoothsort implementation. Smoothsort has similar 
properties like Timsort, but uses only O(1) memory. From what I have read, it 
is quite tricky to implement correctly though.

Are there any performance tests for libxml I could run?

Vojtech

Attachment: SortOptimization.patch
Description: SortOptimization.patch

_______________________________________________
xml mailing list, project page  http://xmlsoft.org/
xml@gnome.org
https://mail.gnome.org/mailman/listinfo/xml

Reply via email to