[Numpy-discussion] Trying to read 500M txt file using numpy.genfromtxt within ipython shell
Dear all, I received a file from others which contains ~30 million lines and in size of ~500M. I try read it with numpy.genfromtxt in ipython interactive mode. Then ipython crashed. The data contains lat,lon,var1,year, the year ranges from 1001 to 2006. Finally I want to write the data to netcdf for separate years and feed them into the model. I guess I need a better way to do this? anyone would be any idea is highly appreciated. lon,lat,year,area_burned -180.0,65.0,1001,0 -180.0,65.0,1002,0 -180.0,65.0,1003,0 -180.0,65.0,1004,0 -180.0,65.0,1005,0 -180.0,65.0,1006,0 -180.0,65.0,1007,0 thanks and cheers, Chao -- *** Chao YUE Laboratoire des Sciences du Climat et de l'Environnement (LSCE-IPSL) UMR 1572 CEA-CNRS-UVSQ Batiment 712 - Pe 119 91191 GIF Sur YVETTE Cedex Tel: (33) 01 69 08 29 02; Fax:01.69.08.77.16 ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Trying to read 500M txt file using numpy.genfromtxt within ipython shell
Hi, I think writing a Python script that convert your txt file to one netcdf file, reading the txt file one line at a time, and then use the netcdf file normally would be a good solution! Best, David Excerpts from Chao YUE's message of mar. mars 20 13:33:56 +0100 2012: Dear all, I received a file from others which contains ~30 million lines and in size of ~500M. I try read it with numpy.genfromtxt in ipython interactive mode. Then ipython crashed. The data contains lat,lon,var1,year, the year ranges from 1001 to 2006. Finally I want to write the data to netcdf for separate years and feed them into the model. I guess I need a better way to do this? anyone would be any idea is highly appreciated. lon,lat,year,area_burned -180.0,65.0,1001,0 -180.0,65.0,1002,0 -180.0,65.0,1003,0 -180.0,65.0,1004,0 -180.0,65.0,1005,0 -180.0,65.0,1006,0 -180.0,65.0,1007,0 thanks and cheers, Chao ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Trying to read 500M txt file using numpy.genfromtxt within ipython shell
I would be in agree. thanks! I use gawk to separate the file into many files by year, then it would be easier to handle. anyway, it's not a good practice to produce such huge line txt files Chao 2012/3/20 David Froger david.fro...@gmail.com Hi, I think writing a Python script that convert your txt file to one netcdf file, reading the txt file one line at a time, and then use the netcdf file normally would be a good solution! Best, David Excerpts from Chao YUE's message of mar. mars 20 13:33:56 +0100 2012: Dear all, I received a file from others which contains ~30 million lines and in size of ~500M. I try read it with numpy.genfromtxt in ipython interactive mode. Then ipython crashed. The data contains lat,lon,var1,year, the year ranges from 1001 to 2006. Finally I want to write the data to netcdf for separate years and feed them into the model. I guess I need a better way to do this? anyone would be any idea is highly appreciated. lon,lat,year,area_burned -180.0,65.0,1001,0 -180.0,65.0,1002,0 -180.0,65.0,1003,0 -180.0,65.0,1004,0 -180.0,65.0,1005,0 -180.0,65.0,1006,0 -180.0,65.0,1007,0 thanks and cheers, Chao ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion -- *** Chao YUE Laboratoire des Sciences du Climat et de l'Environnement (LSCE-IPSL) UMR 1572 CEA-CNRS-UVSQ Batiment 712 - Pe 119 91191 GIF Sur YVETTE Cedex Tel: (33) 01 69 08 29 02; Fax:01.69.08.77.16 ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Trying to read 500M txt file using numpy.genfromtxt within ipython shell
On 20 Mar 2012, at 14:40, Chao YUE wrote: I would be in agree. thanks! I use gawk to separate the file into many files by year, then it would be easier to handle. anyway, it's not a good practice to produce such huge line txt files Indeed it's not, but it's also not good practice to load the entire content of text files as python lists into memory, as unfortunately all the numpy readers are still doing. But this has been discussed on this list and improvements are under way. For your problem at hand the textreader Warren Weckesser recently made known - can't find the post right now, but you can find it at https://github.com/WarrenWeckesser/textreader might be helpful. It is still under construction, but for a plain csv file such as yours it should be working already. And since the text parsing is implemented in C, it should also give you a huge speedup for your 1/2 GB! For additional profiling, similar to what David suggested, it would certainly be a good idea to read in smaller chunks of the file and write it directly to the netCDF file. Note that you can already read single lines at a time with the likes of from StringIO import StringIO f = open('file.txt'. 'r') np.genfromtxt(StringIO(f.next()), delimiter=',') but I don't think it would work this way with textreader, and iterating such a small loop over lines in Python would beat the point of using a fast reader. As your actual data would be 1GB in numpy, memory usage with textreader should also not be critical yet. Cheers, Derek ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion