And I forgot to attach the relevant code (though it is also in my fork)... On Sun, May 13, 2012 at 6:28 PM, Anthony Scopatz <scop...@gmail.com> wrote:
> Hello All, > > This week, while doing some optimization, I found that np.fromstring() > is significantly slower than many alternatives out there. This function > basically does two things: (1) it splits the string and (2) it converts the > data to the desired type. > > There isn't much we can do about the conversion/casting so what I > mean is that the *string splitting implementation is slow*. > > To simplify the discussion, I will just talk about string to 1d float64 > arrays. > I have also issued pull request #279 [1] to numpy with some sample code. > Timings can be seen in the ipython notebook here. > > It turns out that using str.split() and np.array() are 20 - 35% faster, > which > was non-intuitive to me. That is to say: > > rawdata = s.split() > data = np.array(rawdata, dtype=float) > > > is faster than > > data = np.fromstring(s, sep=" ", dtype=float) > > > The next thing to try, naturally, was Cython. This did not change the > timings much for these two strategies. However, being in Cython > allows us to call atof() directly. My implementation is based on a > previous > thread on this topic [2]. However, in the example in [2], the string was > hard coded, contained only one data value, and did not need to be split. > Thus they saw a dramatic 10x speed boost. To deal with the more > realistic case, I first just continued to use str.split(). This took 35 - > 50% > less time than np.fromstring(). > > Finally, using the strtok() function in the C standard library to call > atof() > while we tokenize the string further reduces the speed 50 - 60% of the > baseline np.fromstring() time. > > Timings > ------------ > In [1]: import fromstr > > In [2]: s = "100.0 " * 100000 > > In [3]: timeit fromstr.fromstring(s) > 10 loops, best of 3: 20.7 ms per loop > > In [4]: timeit fromstr.split_and_array(s) > 100 loops, best of 3: 16.1 ms per loop > > In [6]: timeit fromstr.split_atof(s) > 100 loops, best of 3: 13.5 ms per loop > > In [7]: timeit fromstr.token_atof(s) > 100 loops, best of 3: 8.35 ms per loop > > Possible Explanation > ---------------------------------- > Numpy's fromstring() function may be found here [3]. However, this code > is a bit hard to follow but it uses the array_from_text() function [4]. > On the > other hand str.split() [5] uses a macro function SPLIT_ADD(). The > difference > between these is that I believe that str.split() over-allocates the size > of the > list in a more aggressive way than array_from_text(). This leads to fewer > resizes and thus fewer memory copies. > > This would also explain why the tokenize implementation is the fastest > since > this pre-allocates the maximum possible array size and then slices it > down. > No resizes are present in this function, though it requires more memory up > front. > > Summary (tl;dr) > ------------------------ > The np.fromstring() is slow in the mechanism it chooses to split strings > by. > This is likely due to how many resize operations it must perform. While > it > need not be the* *fastest* *thing out there, it should probably be at > least as > fast at Python string splitting. > > No pull-request 'fixing' this issue was provided because I wanted to see > what people thought and if / which option is worth pursuing. > > Be Well > Anthony > > [1] https://github.com/numpy/numpy/pull/279 > [2] http://comments.gmane.org/gmane.comp.python.numeric.general/41504 > [3] > https://github.com/numpy/numpy/blob/master/numpy/core/src/multiarray/ctors.c#L3699 > [4] > https://github.com/numpy/numpy/blob/master/numpy/core/src/multiarray/ctors.c#L3418 > [5] > http://svn.python.org/view/python/tags/r271/Objects/stringlib/split.h?view=markup >
fromstr.pyx
Description: Binary data
setup.py
Description: Binary data
fromstr.ipynb
Description: Binary data
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion