Anthony, Thanks for looking into this. A few other notes about fromstring() ( and fromfile() ).
Frankly they haven't gotten much love -- they are, as you have seen, less than optimized, and kind of buggy (actually, not really buggy, but not robust in the face of malformed input -- and they give results that are wrong in some cases (rather throwing an error, for instance). So they realy do need some attention. On the other hand -- folks are working on various ways to optimize reading data from text files (and maybe strings) so that may be a better way to go. If you google "fromstring barker numpy" you'll find a thread or too with what I learned, and pointers to a couple tickets. What I do remember: The use of atof and friends is complicated because there are python version that extend the C lib versions, and numpy versions that extend those (for better NaN handling, for instance). the source of the lack of robustness stems from the fact that the error checking is not done right when calling atof and friends -- i.e. you need to check if the pointer was incrememnted to see if it successfully read a value. With the layered calls to numpy and python versions, I found it very hard to fix this. Profile carefully to check your theory about limited over-allocation of memory being the source of the performance issues -- when i've tested similar code, it made little difference -- allocating and copying memory is actually pretty fast. If you re-allocate an copy every single append, it's slow, yes, but I found virtually no difference between over-allocating say 10% or 50% (not sure what the bottom reasonable value was there) Good luck, -Chris On Sun, May 13, 2012 at 4:28 PM, Anthony Scopatz <scop...@gmail.com> wrote: > Hello All, > > This week, while doing some optimization, I found that np.fromstring() > is significantly slower than many alternatives out there. This function > basically does two things: (1) it splits the string and (2) it converts the > data to the desired type. > > There isn't much we can do about the conversion/casting so what I > mean is that the string splitting implementation is slow. > > To simplify the discussion, I will just talk about string to 1d float64 > arrays. > I have also issued pull request #279 [1] to numpy with some sample code. > Timings can be seen in the ipython notebook here. > > It turns out that using str.split() and np.array() are 20 - 35% faster, > which > was non-intuitive to me. That is to say: > > rawdata = s.split() > data = np.array(rawdata, dtype=float) > > > is faster than > > data = np.fromstring(s, sep=" ", dtype=float) > > > The next thing to try, naturally, was Cython. This did not change the > timings much for these two strategies. However, being in Cython > allows us to call atof() directly. My implementation is based on a > previous > thread on this topic [2]. However, in the example in [2], the string was > hard coded, contained only one data value, and did not need to be split. > Thus they saw a dramatic 10x speed boost. To deal with the more > realistic case, I first just continued to use str.split(). This took 35 - > 50% > less time than np.fromstring(). > > Finally, using the strtok() function in the C standard library to call > atof() > while we tokenize the string further reduces the speed 50 - 60% of the > baseline np.fromstring() time. > > Timings > ------------ > In [1]: import fromstr > > In [2]: s = "100.0 " * 100000 > > In [3]: timeit fromstr.fromstring(s) > 10 loops, best of 3: 20.7 ms per loop > > In [4]: timeit fromstr.split_and_array(s) > 100 loops, best of 3: 16.1 ms per loop > > In [6]: timeit fromstr.split_atof(s) > 100 loops, best of 3: 13.5 ms per loop > > In [7]: timeit fromstr.token_atof(s) > 100 loops, best of 3: 8.35 ms per loop > > Possible Explanation > ---------------------------------- > Numpy's fromstring() function may be found here [3]. However, this code > is a bit hard to follow but it uses the array_from_text() function [4]. On > the > other hand str.split() [5] uses a macro function SPLIT_ADD(). The > difference > between these is that I believe that str.split() over-allocates the size of > the > list in a more aggressive way than array_from_text(). This leads to fewer > resizes and thus fewer memory copies. > > This would also explain why the tokenize implementation is the fastest > since > this pre-allocates the maximum possible array size and then slices it down. > No resizes are present in this function, though it requires more memory up > front. > > Summary (tl;dr) > ------------------------ > The np.fromstring() is slow in the mechanism it chooses to split strings by. > > This is likely due to how many resize operations it must perform. While it > need not be the *fastest* thing out there, it should probably be at least as > fast at Python string splitting. > > No pull-request 'fixing' this issue was provided because I wanted to see > what people thought and if / which option is worth pursuing. > > Be Well > Anthony > > [1] https://github.com/numpy/numpy/pull/279 > [2] http://comments.gmane.org/gmane.comp.python.numeric.general/41504 > [3] https://github.com/numpy/numpy/blob/master/numpy/core/src/multiarray/ctors.c#L3699 > [4] https://github.com/numpy/numpy/blob/master/numpy/core/src/multiarray/ctors.c#L3418 > [5] http://svn.python.org/view/python/tags/r271/Objects/stringlib/split.h?view=markup > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion@scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception chris.bar...@noaa.gov _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion