Hello All, This week, while doing some optimization, I found that np.fromstring() is significantly slower than many alternatives out there. This function basically does two things: (1) it splits the string and (2) it converts the data to the desired type.
There isn't much we can do about the conversion/casting so what I mean is that the *string splitting implementation is slow*. To simplify the discussion, I will just talk about string to 1d float64 arrays. I have also issued pull request #279 [1] to numpy with some sample code. Timings can be seen in the ipython notebook here. It turns out that using str.split() and np.array() are 20 - 35% faster, which was non-intuitive to me. That is to say: rawdata = s.split() data = np.array(rawdata, dtype=float) is faster than data = np.fromstring(s, sep=" ", dtype=float) The next thing to try, naturally, was Cython. This did not change the timings much for these two strategies. However, being in Cython allows us to call atof() directly. My implementation is based on a previous thread on this topic [2]. However, in the example in [2], the string was hard coded, contained only one data value, and did not need to be split. Thus they saw a dramatic 10x speed boost. To deal with the more realistic case, I first just continued to use str.split(). This took 35 - 50% less time than np.fromstring(). Finally, using the strtok() function in the C standard library to call atof() while we tokenize the string further reduces the speed 50 - 60% of the baseline np.fromstring() time. Timings ------------ In [1]: import fromstr In [2]: s = "100.0 " * 100000 In [3]: timeit fromstr.fromstring(s) 10 loops, best of 3: 20.7 ms per loop In [4]: timeit fromstr.split_and_array(s) 100 loops, best of 3: 16.1 ms per loop In [6]: timeit fromstr.split_atof(s) 100 loops, best of 3: 13.5 ms per loop In [7]: timeit fromstr.token_atof(s) 100 loops, best of 3: 8.35 ms per loop Possible Explanation ---------------------------------- Numpy's fromstring() function may be found here [3]. However, this code is a bit hard to follow but it uses the array_from_text() function [4]. On the other hand str.split() [5] uses a macro function SPLIT_ADD(). The difference between these is that I believe that str.split() over-allocates the size of the list in a more aggressive way than array_from_text(). This leads to fewer resizes and thus fewer memory copies. This would also explain why the tokenize implementation is the fastest since this pre-allocates the maximum possible array size and then slices it down. No resizes are present in this function, though it requires more memory up front. Summary (tl;dr) ------------------------ The np.fromstring() is slow in the mechanism it chooses to split strings by. This is likely due to how many resize operations it must perform. While it need not be the* *fastest* *thing out there, it should probably be at least as fast at Python string splitting. No pull-request 'fixing' this issue was provided because I wanted to see what people thought and if / which option is worth pursuing. Be Well Anthony [1] https://github.com/numpy/numpy/pull/279 [2] http://comments.gmane.org/gmane.comp.python.numeric.general/41504 [3] https://github.com/numpy/numpy/blob/master/numpy/core/src/multiarray/ctors.c#L3699 [4] https://github.com/numpy/numpy/blob/master/numpy/core/src/multiarray/ctors.c#L3418 [5] http://svn.python.org/view/python/tags/r271/Objects/stringlib/split.h?view=markup
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion