On 7 Jul 2017, at 1:59 am, Chris Barker <chris.bar...@noaa.gov> wrote:
> 
> On Thu, Jul 6, 2017 at 10:55 AM,  <paul.carr...@free.fr> wrote:
> It's is just a reflexion, but for huge files one solution might be to 
> split/write/build first the array in a dedicated file (2x o(n) iterations - 
> one to identify the blocks size - additional one to get and write), and then 
> to load it in memory and work with numpy - 
> 
> 
> I may have your use case confused, but if you have a huge file with multiple 
> "blocks" in it, there shouldn't be any problem with loading it in one go -- 
> start at the top of the file and load one block at a time (accumulating in a 
> list) -- then you only have the memory overhead issues for one block at a 
> time, should be no problem.
> 
> at this stage the dimension is known and some packages will be fast and more 
> adapted (pandas or astropy as suggested).
> 
> pandas at least is designed to read variations of CSV files, not sure you 
> could use the optimized part to read an array out of part of an open file 
> from a particular point or not.
> 
The fragmented structure indeed would probably be the biggest challenge, 
although astropy,
while it cannot read from an open file handle, at least should be able to 
directly parse a block
of input lines, e.g. collected with readline() in a list. Guess pandas could do 
the same.
Alternatively the line positions of the blocks could be directly passed to the 
data_start and
data_end keywords, but that would require opening and at least partially 
reading the file
multiple times. In fact, if the blocks are relatively small, the overhead may 
be too large to
make it worth using the faster parsers - if you look at the timing notebooks I 
had linked to
earlier, it takes at least ~100 input lines before they show any speed gains 
over genfromtxt,
and ~1000 to see roughly linear scaling. In that case writing your own 
customised reader
could be the best option after all.

Cheers,
                                        Derek
_______________________________________________
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion

Reply via email to