Re: [Numpy-discussion] record data previous to Numpy use

Derek Homeier Wed, 05 Jul 2017 15:24:06 -0700

Hi Paul,

> ascii file is an input format (and the only one I can deal with)
> 
> HDF5 one might be an export one (it's one of the options) in order to speed 
> up the post-processing stage
> 
> 
> 
> Paul
> 
> 
> 
> 
> 
> Le 2017-07-05 20:19, Thomas Caswell a écrit :
> 
>> Are you tied to ASCII files?   HDF5 (via h5py or pytables) might be a better 
>> storage format for what you are describing.
>>  
>> Tom
>> 
>> On Wed, Jul 5, 2017 at 8:42 AM <paul.carr...@free.fr> wrote:
>> Dear all
>> 
>> 
>> 
>> I'm sorry if my question is too basic (not fully in relation to Numpy – 
>> while it is to build matrices and to work with Numpy afterward), but I'm 
>> spending a lot of time and effort to find a way to record data from an asci 
>> while, and reassign it into a matrix/array ... with unsuccessfully!
>> 
>> 
>> 
>> The only way I found is to use 'append()' instruction involving dynamic 
>> memory allocation. :-(
>> 
>> 
>> 
>> From my current experience under Scilab (a like Matlab scientific solver), 
>> it is well know:
>> 
>>      • Step 1 : matrix initialization like 'np.zeros(n,n)'
>>      • Step 2 : record the data
>>      • and write it in the matrix (step 3)
>> 
>> 
>> I'm obviously influenced by my current experience, but I'm interested in 
>> moving to Python and its packages
>> 
>> 
>> 
>> For huge asci files (involving dozens of millions of lines), my strategy is 
>> to work by 'blocks' as :
>> 
>>      • Find the line index of the beginning and the end of one block (this 
>> implies that the file is read ounce)
>>      • Read the block
>>      • (process repeated on the different other blocks)
>> 
>> 
>> I tried different codes such as bellow, but each time Python is telling me I 
>> cannot mix iteration and record method
>>


if you are indeed tied to using ASCII input data, you will of course have to 
deal with significant
performance handicaps, but there are at least some gains to be had by using an 
input parser
that does not do all the conversions at the Python level, but with a compiled 
(C) reader - either
pandas as Tom already mentioned, or astropy - see e.g. 
https://github.com/dhomeier/astropy-notebooks/blob/master/io/ascii/ascii_read_bench.ipynb
for the almost one order of magnitude speed gains you may get.

In your example it is not clear what “record” method you were trying to use 
that raised the errors
you mention - we would certainly need a full traceback of the error to find out 
more.

In principle your approach of allocating the numpy matrix first and reading the 
data in chunks
makes sense, as it will avoid the much larger temporary lists created during 
read-in.
But it might be more convenient to just read in the block into a list of lines 
and pass that to a
higher-level reader like np.genfromtxt or the faster astropy.io.ascii.read or 
pandas.read_csv
to speed up the parsing of the numbers themselves.
That said, on most systems these readers should still be able to handle files 
up to a few 10^8
items (expect ~ 25-55 bytes of memory for each input number allocated for 
temporary lists),
so if saving memory is not an absolute priority, directly reading the entire 
file might still be the
best choice (and would also save the first pass reading).

Cheers,
                                        Derek

_______________________________________________
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] record data previous to Numpy use

Reply via email to