On 7 Jul 2017, at 1:59 am, Chris Barker <chris.bar...@noaa.gov> wrote: > > On Thu, Jul 6, 2017 at 10:55 AM, <paul.carr...@free.fr> wrote: > It's is just a reflexion, but for huge files one solution might be to > split/write/build first the array in a dedicated file (2x o(n) iterations - > one to identify the blocks size - additional one to get and write), and then > to load it in memory and work with numpy - > > > I may have your use case confused, but if you have a huge file with multiple > "blocks" in it, there shouldn't be any problem with loading it in one go -- > start at the top of the file and load one block at a time (accumulating in a > list) -- then you only have the memory overhead issues for one block at a > time, should be no problem. > > at this stage the dimension is known and some packages will be fast and more > adapted (pandas or astropy as suggested). > > pandas at least is designed to read variations of CSV files, not sure you > could use the optimized part to read an array out of part of an open file > from a particular point or not. > The fragmented structure indeed would probably be the biggest challenge, although astropy, while it cannot read from an open file handle, at least should be able to directly parse a block of input lines, e.g. collected with readline() in a list. Guess pandas could do the same. Alternatively the line positions of the blocks could be directly passed to the data_start and data_end keywords, but that would require opening and at least partially reading the file multiple times. In fact, if the blocks are relatively small, the overhead may be too large to make it worth using the faster parsers - if you look at the timing notebooks I had linked to earlier, it takes at least ~100 input lines before they show any speed gains over genfromtxt, and ~1000 to see roughly linear scaling. In that case writing your own customised reader could be the best option after all.
Cheers, Derek _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion