OK, you have two performance "issues" 1) memory use: IF yu need to read a file to build a numpy array, and dont know how big it is when you start, you need to accumulate the values first, and then make an array out of them. And numpy arrays are fixed size, so they can not efficiently accumulate values.
The usual way to handle this is to read the data into a list with .append() or the like, and then make an array from it. This is quite fast -- lists are fast and efficient for extending arrays. However, you are then storing (at least) a pointer and a python float object for each value, which is a lot more memory than a single float value in a numpy array, and you need to make the array from it, which means you have the full list and all its pyton floats AND the array in memory at once. Frankly, computers have a lot of memory these days, so this is a non-issue in most cases. Nonetheless, a while back I wrote an extendable numpy array object to address just this issue. You can find the code on gitHub here: https://github.com/PythonCHB/NumpyExtras/blob/master/numpy_extras/accumulator.py I have not tested it with recent numpy's but I expect is still works fine. It's also py2, but wouldn't take much to port. In practice, it uses less memory that the "build a list, then make it into an array", but isnt any faster, unless you add (.extend) a bunch of values at once, rather than one at a time. (if you do it one at a time, the whole python float to numpy float conversion, and function call overhead takes just as long). But it will, generally be as fast or faster than using a list, and use less memory, so a fine basis for a big ascii file reader. However, it looks like while your files may be huge, they hold a number of arrays, so each array may not be large enough to bother with any of this. 2) parsing and converting overhead -- for the most part, python/numpy text file reading code read the text into a python string, converts it to python number objects, then puts them in a list or converts them to native numbers in an array. This whole process is a bit slow (though reading files is slow anyway, so usually not worth worrying about, which is why the built-in file reading methods do this). To improve this, you need to use code that reads the file and parses it in C, and puts it straight into a numpy array without passing through python. This is what the pandas (and I assume astropy) text file readers do. But if you don't want those dependencies, there is the "fromfile()" function in numpy -- it is not very robust, but if you files are well-formed, then it is quite fast. So your code would look something like: with open(the_filename) as infile: while True: line = infile.readline() if not line: break # work with line to figure out the next block if ready_to_read_a_block: arr = np.fromfile(infile, dtype=np.int32, count=num_values, sep=' ') # sep specifies that you are reading text, not binary! arr.shape = the_shape_it_should_be But Robert is right -- get it to work with the "usual" methods -- i.e. put numbers in a list, then make an array out it -- first, and then worry about making it faster. -CHB On Thu, Jul 6, 2017 at 1:49 AM, <paul.carr...@free.fr> wrote: > Dear All > > > First of all thanks for the answers and the information’s (I’ll ding into > it) and let me trying to add comments on what I want to : > > 1. My asci file mainly contains data (float and int) in a single column > 2. (it is not always the case but I can easily manage it – as well I > saw I can use ‘spli’ instruction if necessary) > 3. Comments/texts indicates the beginning of a bloc immediately > followed by the number of sub-blocs > 4. So I need to read/record all the values in order to build a matrix > before working on it (using Numpy & vectorization) > - The columns 2 and 3 have been added for further treatments > - The ‘0’ values will be specifically treated afterward > > > Numpy won’t be a problem I guess (I did some basic tests and I’m quite > confident) on how to proceed, but I’m really blocked on data records … I > trying to find a way to efficiently read and record data in a matrix: > > - avoiding dynamic memory allocation (here using ‘append’ in python > meaning, not np), > - dealing with huge asci file: the latest file I get contains more > than *60 million of lines* > > > Please find in attachment an extract of the input format > (‘example_of_input’), and the matrix I’m trying to create and manage with > Numpy > > > Thanks again for your time > > Paul > > > ####################################### > > ##BEGIN *-> line number x in the original file* > > 42 *-> indicates the number of sub-blocs* > > 1 *-> number of the 1rst sub-bloc* > > 6 *-> gives how many value belong to the sub bloc* > > 12 > > 47 > > 2 > > 46 > > 3 > > 51 > > …. > > 13 * -> another type of sub-bloc with 25 values* > > 25 > > 15 > > 88 > > 21 > > 42 > > 22 > > 76 > > 19 > > 89 > > 0 > > 18 > > 80 > > 23 > > 38 > > 24 > > 73 > > 20 > > 81 > > 0 > > 90 > > 0 > > 41 > > 0 > > 39 > > 0 > > 77 > > … > > 42 *-> another type of sub-bloc with 2 values* > > 2 > > 115 > > 109 > > > ####################################### > > *The matrix result* > > 1 0 0 6 12 47 2 46 3 51 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 > > 2 0 0 6 3 50 11 70 12 51 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 > > 3 0 0 8 11 50 3 49 4 54 5 57 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 > > 4 0 0 8 12 70 11 66 9 65 10 68 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 > > 5 0 0 8 2 47 12 68 10 44 1 43 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 > > 6 0 0 8 5 56 6 58 7 61 11 57 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 > > 7 0 0 8 11 61 7 60 8 63 9 66 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 > > 8 0 0 19 12 47 2 46 3 51 0 13 97 14 92 15 96 0 72 0 48 0 52 0 0 0 0 0 0 > > 9 0 0 19 13 97 14 92 15 96 0 16 86 17 82 18 85 0 95 0 91 0 90 0 0 0 0 0 0 > > 10 0 0 19 3 50 11 70 12 51 0 15 89 19 94 13 96 0 52 0 71 0 72 0 0 0 0 0 0 > > 11 0 0 19 15 89 19 94 13 96 0 18 81 20 84 16 85 0 90 0 77 0 95 0 0 0 0 0 0 > > 12 0 0 25 3 49 4 54 5 57 11 50 0 15 88 21 42 22 76 19 89 0 52 0 53 0 55 0 > 71 > > 13 0 0 25 15 88 21 42 22 76 19 89 0 18 80 23 38 24 73 20 81 0 90 0 41 0 39 > 0 77 > > 14 0 0 25 11 66 9 65 10 68 12 70 0 19 78 25 99 26 98 13 94 0 71 0 67 0 69 > 0 72 > > …. > > > ####################################### > > *An example of the code I started to write* > > # -*- coding: utf-8 -*- > > import time, sys, os, re > > import itertools > > import numpy as np > > > PATH = str(os.path.abspath('')) > > > input_file_name ='/example_of_input.txt' > > > > > ## check if the file exists, then if it's empty or not > > if (os.path.isfile(PATH + input_file_name)): > > if (os.stat(PATH + input_file_name).st_size > 0): > > > > ## go through the file in order to find specific sentences > > ## specific blocks will be defined afterward > > Block_position = []; j=0; > > with open(PATH + input_file_name, "r") as data: > > for line in data: > > if '##BEGIN' in line: > > Block_position.append(j) > > j=j+1 > > > > > > ## just to tests to get all the values > > # i = 0 > > # data = np.zeros( (505), dtype=np.int ) > > # with open(PATH + input_file_name, "r") as f: > > # for i in range (0,505): > > # data[i] = int(f.read(Block_position[0]+1+i)) > > # print ("i = ", i) > > > > > > # for line in itertools.islice(f,Block_position[0],516): > > # data[i]=f.read(0+i) > > # i=i+1 > > > > > > > else: > > print "The file %s is empty : post-processing cannot be performed > !!!\n" % input_file_name > > > > > else: > > print "Error : the file %s does not exist: post-processing stops > !!!\n" % input_file_name > > > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion@python.org > https://mail.python.org/mailman/listinfo/numpy-discussion > > -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception chris.bar...@noaa.gov
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion