> Hello All,
> This week, while doing some optimization, I found that np.fromstring()
> is significantly slower than many alternatives out there.  This function
> basically does two things: (1) it splits the string and (2) it converts the
> data to the desired type.
> There isn't much we can do about the conversion/casting so what I
> mean is that the *string splitting implementation is slow*.
> To simplify the discussion, I will just talk about string to 1d float64
> arrays.
> I have also issued pull request #279 [1] to numpy with some sample code.
> Timings can be seen in the ipython notebook here.
> It turns out that using str.split() and np.array() are 20 - 35% faster,
> which
> was non-intuitive to me.  That is to say:
> rawdata = s.split()
> data = np.array(rawdata, dtype=float)
> is faster than
> data = np.fromstring(s, sep=" ", dtype=float)
> The next thing to try, naturally, was Cython.  This did not change the
> timings much for these two  strategies.  However, being in Cython
> allows us to call atof() directly.  My implementation is based on a
> previous
> thread on this topic [2].   However, in the example in [2], the string was
> hard coded, contained only one data value, and did not need to be split.
> Thus they saw a dramatic 10x speed boost.   To deal with the more
> realistic case, I first just continued to use str.split().  This took 35 -
> 50%
> less time than np.fromstring().
> Finally, using the strtok() function in the C standard library to call
> atof()
> while we tokenize the string further reduces the speed 50 - 60% of the
> baseline np.fromstring() time.
> Timings
> ------------
> In [1]: import fromstr
> In [2]: s = "100.0 " * 100000
> In [3]: timeit fromstr.fromstring(s)
> 10 loops, best of 3: 20.7 ms per loop
> In [4]: timeit fromstr.split_and_array(s)
> 100 loops, best of 3: 16.1 ms per loop
> In [6]: timeit fromstr.split_atof(s)
> 100 loops, best of 3: 13.5 ms per loop
> In [7]: timeit fromstr.token_atof(s)
> 100 loops, best of 3: 8.35 ms per loop
> Possible Explanation
> ----------------------------------
> Numpy's fromstring() function may be found here [3].  However, this code
> is a bit hard to follow but it uses the array_from_text() function [4].
>  On the
> other hand str.split() [5] uses a macro function SPLIT_ADD().   The
> difference
> between these is that I believe that str.split() over-allocates the size
> of the
> list in a more aggressive way than array_from_text().  This leads to fewer
> resizes and thus fewer memory copies.
> This would also explain why the tokenize implementation is the fastest
> since
> this pre-allocates the maximum possible array size and then slices it
> down.
> No resizes are present in this function, though it requires more memory up
> front.
> Summary (tl;dr)
> ------------------------
> The np.fromstring() is slow in the mechanism it chooses to split strings
> by.
> This is likely due to how many resize operations it must perform.  While
> it
> need not be the* *fastest* *thing out there, it should probably be at
> least as
> fast at Python string splitting.
> No pull-request 'fixing' this issue was provided because I wanted to see
> what people thought and if / which option is worth pursuing.
> Be Well
> Anthony
> [1]
> [2]
> [3]
> [4]
> [5]

