[Pytables-users] Pytables file reading
Hello all, I´m managing a file close to 26 Gb size. It´s main structure is a table with a bit more than 8 million rows. The table is made by four columns, the first two columns store names, the 3rd one has a 53 items array in each cell and the last column has a 133x6 matrix in each cell. I use to work with a Linux workstation with 24 Gb. My usual way of working with the file is to retrieve, from each cell in the 4th column of the table, the same row from the 133x6 matrix. I store the information in a bumpy array with shape 8e6x6. In this process I almost use the whole workstation memory. Is there anyway to optimize the memory usage? If not, I have been thinking about splitting the file. Thank you, Juanma -- Live Security Virtual Conference Exclusive live event will cover all the ways today's security and threat landscape has changed and how IT managers can respond. Discussions will include endpoint security, mobile security and the latest in malware threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/___ Pytables-users mailing list Pytables-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/pytables-users
Re: [Pytables-users] Pytables file reading
Hi Juan Manuel, Il 04/08/2012 01:55, Juan Manuel Vázquez Tovar ha scritto: > Hello all, > > I´m managing a file close to 26 Gb size. It´s main structure is a table > with a bit more than 8 million rows. The table is made by four columns, the > first two columns store names, the 3rd one has a 53 items array in each > cell and the last column has a 133x6 matrix in each cell. > I use to work with a Linux workstation with 24 Gb. My usual way of working > with the file is to retrieve, from each cell in the 4th column of the > table, the same row from the 133x6 matrix. > I store the information in a bumpy array with shape 8e6x6. In this process > I almost use the whole workstation memory. > Is there anyway to optimize the memory usage? I'm not sure to understand. My impression is that you do not actually need to have the entire 8e6x6 matrix in memory at once, is it correct? In that case you could simply try to load less data using something like data = table.read(0, 5e7, field='name of the 4-th field') process(data) data = table.read(5e7, 1e8, field='name of the 4-th field') process(data) See also [1] and [2]. Does it make sense for you? [1] http://pytables.github.com/usersguide/libref.html#table-methods-reading [2] http://pytables.github.com/usersguide/libref.html#tables.Table.read > If not, I have been thinking about splitting the file. > > Thank you, > > Juanma cheers -- Antonio Valentino -- Live Security Virtual Conference Exclusive live event will cover all the ways today's security and threat landscape has changed and how IT managers can respond. Discussions will include endpoint security, mobile security and the latest in malware threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ ___ Pytables-users mailing list Pytables-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/pytables-users
Re: [Pytables-users] Pytables file reading
Hi Antonio, You are right, I don´t need to load the entire table into memory. The fourth column has multidimensional cells and when I read a single row from every cell in the column, I almost fill the workstation memory. I didn´t expect that process to use so much memory, but the fact is that it uses it. May be I didn´t explain very well last time. Thank you, Juanma 2012/8/5 Antonio Valentino > Hi Juan Manuel, > > Il 04/08/2012 01:55, Juan Manuel Vázquez Tovar ha scritto: > > Hello all, > > > > I´m managing a file close to 26 Gb size. It´s main structure is a table > > with a bit more than 8 million rows. The table is made by four columns, > the > > first two columns store names, the 3rd one has a 53 items array in each > > cell and the last column has a 133x6 matrix in each cell. > > I use to work with a Linux workstation with 24 Gb. My usual way of > working > > with the file is to retrieve, from each cell in the 4th column of the > > table, the same row from the 133x6 matrix. > > I store the information in a bumpy array with shape 8e6x6. In this > process > > I almost use the whole workstation memory. > > Is there anyway to optimize the memory usage? > > I'm not sure to understand. > My impression is that you do not actually need to have the entire 8e6x6 > matrix in memory at once, is it correct? > > In that case you could simply try to load less data using something like > > data = table.read(0, 5e7, field='name of the 4-th field') > process(data) > data = table.read(5e7, 1e8, field='name of the 4-th field') > process(data) > > See also [1] and [2]. > > Does it make sense for you? > > > [1] > http://pytables.github.com/usersguide/libref.html#table-methods-reading > [2] http://pytables.github.com/usersguide/libref.html#tables.Table.read > > > If not, I have been thinking about splitting the file. > > > > Thank you, > > > > Juanma > > > cheers > > -- > Antonio Valentino > > > -- > Live Security Virtual Conference > Exclusive live event will cover all the ways today's security and > threat landscape has changed and how IT managers can respond. Discussions > will include endpoint security, mobile security and the latest in malware > threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ > ___ > Pytables-users mailing list > Pytables-users@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/pytables-users > -- Live Security Virtual Conference Exclusive live event will cover all the ways today's security and threat landscape has changed and how IT managers can respond. Discussions will include endpoint security, mobile security and the latest in malware threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/___ Pytables-users mailing list Pytables-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/pytables-users
Re: [Pytables-users] Pytables file reading
Hi Juan Manuel, Il 05/08/2012 22:28, Juan Manuel Vázquez Tovar ha scritto: > Hi Antonio, > > You are right, I don´t need to load the entire table into memory. > The fourth column has multidimensional cells and when I read a single row > from every cell in the column, I almost fill the workstation memory. > I didn´t expect that process to use so much memory, but the fact is that it > uses it. > May be I didn´t explain very well last time. > > Thank you, > > Juanma > Sorry, still don't understand. Can you please post a short code snipped that shows how exactly do you read data into your program? My impression is that somewhere you use some instruction that triggers loading of unnecessary data into memory. > 2012/8/5 Antonio Valentino > >> Hi Juan Manuel, >> >> Il 04/08/2012 01:55, Juan Manuel Vázquez Tovar ha scritto: >>> Hello all, >>> >>> I´m managing a file close to 26 Gb size. It´s main structure is a table >>> with a bit more than 8 million rows. The table is made by four columns, >> the >>> first two columns store names, the 3rd one has a 53 items array in each >>> cell and the last column has a 133x6 matrix in each cell. >>> I use to work with a Linux workstation with 24 Gb. My usual way of >> working >>> with the file is to retrieve, from each cell in the 4th column of the >>> table, the same row from the 133x6 matrix. >>> I store the information in a bumpy array with shape 8e6x6. In this >> process >>> I almost use the whole workstation memory. >>> Is there anyway to optimize the memory usage? >> >> I'm not sure to understand. >> My impression is that you do not actually need to have the entire 8e6x6 >> matrix in memory at once, is it correct? >> >> In that case you could simply try to load less data using something like >> >> data = table.read(0, 5e7, field='name of the 4-th field') >> process(data) >> data = table.read(5e7, 1e8, field='name of the 4-th field') >> process(data) >> >> See also [1] and [2]. >> >> Does it make sense for you? >> >> >> [1] >> http://pytables.github.com/usersguide/libref.html#table-methods-reading >> [2] http://pytables.github.com/usersguide/libref.html#tables.Table.read >> >>> If not, I have been thinking about splitting the file. >>> >>> Thank you, >>> >>> Juanma >> >> >> cheers >> >> -- >> Antonio Valentino >> -- Antonio Valentino -- Live Security Virtual Conference Exclusive live event will cover all the ways today's security and threat landscape has changed and how IT managers can respond. Discussions will include endpoint security, mobile security and the latest in malware threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ ___ Pytables-users mailing list Pytables-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/pytables-users
Re: [Pytables-users] Pytables file reading
Hi Antonio, This is the piece of code I use to read the part of the table I need: data = [case[´loads´][i] for case in table] where i is the index of the row that I need to read from the matrix (133x6) stored in each cell of the column "loads". Juanma 2012/8/5 Antonio Valentino > Hi Juan Manuel, > > Il 05/08/2012 22:28, Juan Manuel Vázquez Tovar ha scritto: > > Hi Antonio, > > > > You are right, I don´t need to load the entire table into memory. > > The fourth column has multidimensional cells and when I read a single row > > from every cell in the column, I almost fill the workstation memory. > > I didn´t expect that process to use so much memory, but the fact is that > it > > uses it. > > May be I didn´t explain very well last time. > > > > Thank you, > > > > Juanma > > > > Sorry, still don't understand. > Can you please post a short code snipped that shows how exactly do you > read data into your program? > > My impression is that somewhere you use some instruction that triggers > loading of unnecessary data into memory. > > > > > 2012/8/5 Antonio Valentino > > > >> Hi Juan Manuel, > >> > >> Il 04/08/2012 01:55, Juan Manuel Vázquez Tovar ha scritto: > >>> Hello all, > >>> > >>> I´m managing a file close to 26 Gb size. It´s main structure is a > table > >>> with a bit more than 8 million rows. The table is made by four columns, > >> the > >>> first two columns store names, the 3rd one has a 53 items array in each > >>> cell and the last column has a 133x6 matrix in each cell. > >>> I use to work with a Linux workstation with 24 Gb. My usual way of > >> working > >>> with the file is to retrieve, from each cell in the 4th column of the > >>> table, the same row from the 133x6 matrix. > >>> I store the information in a bumpy array with shape 8e6x6. In this > >> process > >>> I almost use the whole workstation memory. > >>> Is there anyway to optimize the memory usage? > >> > >> I'm not sure to understand. > >> My impression is that you do not actually need to have the entire 8e6x6 > >> matrix in memory at once, is it correct? > >> > >> In that case you could simply try to load less data using something like > >> > >> data = table.read(0, 5e7, field='name of the 4-th field') > >> process(data) > >> data = table.read(5e7, 1e8, field='name of the 4-th field') > >> process(data) > >> > >> See also [1] and [2]. > >> > >> Does it make sense for you? > >> > >> > >> [1] > >> http://pytables.github.com/usersguide/libref.html#table-methods-reading > >> [2] http://pytables.github.com/usersguide/libref.html#tables.Table.read > >> > >>> If not, I have been thinking about splitting the file. > >>> > >>> Thank you, > >>> > >>> Juanma > >> > >> > >> cheers > >> > >> -- > >> Antonio Valentino > >> > > -- > Antonio Valentino > > > -- > Live Security Virtual Conference > Exclusive live event will cover all the ways today's security and > threat landscape has changed and how IT managers can respond. Discussions > will include endpoint security, mobile security and the latest in malware > threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ > ___ > Pytables-users mailing list > Pytables-users@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/pytables-users > -- Live Security Virtual Conference Exclusive live event will cover all the ways today's security and threat landscape has changed and how IT managers can respond. Discussions will include endpoint security, mobile security and the latest in malware threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/___ Pytables-users mailing list Pytables-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/pytables-users
Re: [Pytables-users] Pytables file reading
Hi Juan Manuel, Il 05/08/2012 22:52, Juan Manuel Vázquez Tovar ha scritto: > Hi Antonio, > > This is the piece of code I use to read the part of the table I need: > > data = [case[´loads´][i] for case in table] > > where i is the index of the row that I need to read from the matrix (133x6) > stored in each cell of the column "loads". > > Juanma > that looks perfectly fine to me. No idea about what could be the issue :/ You can perfform patrial reads using Table.iterrows: data = [case[´loads´][i] for case in table.iterrows(start, stop)] Please also consider that using a single np.array with 1e8 rows instead of a list of arrays will allows you to save the memory overhead of 1e8 array objects. Considering that 6 doubles are 48 bytes while an empty np.array takes 80 bytes In [64]: sys.getsizeof(np.zeros((0,))) Out[64]: 80 you should be able to reduce the memory footprint by far more than an half. cheers > 2012/8/5 Antonio Valentino > >> Hi Juan Manuel, >> >> Il 05/08/2012 22:28, Juan Manuel Vázquez Tovar ha scritto: >>> Hi Antonio, >>> >>> You are right, I don´t need to load the entire table into memory. >>> The fourth column has multidimensional cells and when I read a single row >>> from every cell in the column, I almost fill the workstation memory. >>> I didn´t expect that process to use so much memory, but the fact is that >> it >>> uses it. >>> May be I didn´t explain very well last time. >>> >>> Thank you, >>> >>> Juanma >>> >> >> Sorry, still don't understand. >> Can you please post a short code snipped that shows how exactly do you >> read data into your program? >> >> My impression is that somewhere you use some instruction that triggers >> loading of unnecessary data into memory. -- Antonio Valentino -- Live Security Virtual Conference Exclusive live event will cover all the ways today's security and threat landscape has changed and how IT managers can respond. Discussions will include endpoint security, mobile security and the latest in malware threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ ___ Pytables-users mailing list Pytables-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/pytables-users
Re: [Pytables-users] Pytables file reading
Thank you Antonio, I will try Cheers Juanma El Aug 5, 2012, a las 17:32, Antonio Valentino escribió: > Hi Juan Manuel, > > Il 05/08/2012 22:52, Juan Manuel Vázquez Tovar ha scritto: >> Hi Antonio, >> >> This is the piece of code I use to read the part of the table I need: >> >> data = [case[´loads´][i] for case in table] >> >> where i is the index of the row that I need to read from the matrix (133x6) >> stored in each cell of the column "loads". >> >> Juanma >> > > that looks perfectly fine to me. > No idea about what could be the issue :/ > > You can perfform patrial reads using Table.iterrows: > > data = [case[´loads´][i] for case in table.iterrows(start, stop)] > > Please also consider that using a single np.array with 1e8 rows instead > of a list of arrays will allows you to save the memory overhead of 1e8 > array objects. > Considering that 6 doubles are 48 bytes while an empty np.array takes 80 > bytes > > In [64]: sys.getsizeof(np.zeros((0,))) > Out[64]: 80 > > you should be able to reduce the memory footprint by far more than an half. > > > cheers > > >> 2012/8/5 Antonio Valentino >> >>> Hi Juan Manuel, >>> >>> Il 05/08/2012 22:28, Juan Manuel Vázquez Tovar ha scritto: Hi Antonio, You are right, I don´t need to load the entire table into memory. The fourth column has multidimensional cells and when I read a single row from every cell in the column, I almost fill the workstation memory. I didn´t expect that process to use so much memory, but the fact is that >>> it uses it. May be I didn´t explain very well last time. Thank you, Juanma >>> >>> Sorry, still don't understand. >>> Can you please post a short code snipped that shows how exactly do you >>> read data into your program? >>> >>> My impression is that somewhere you use some instruction that triggers >>> loading of unnecessary data into memory. > > > -- > Antonio Valentino > > -- > Live Security Virtual Conference > Exclusive live event will cover all the ways today's security and > threat landscape has changed and how IT managers can respond. Discussions > will include endpoint security, mobile security and the latest in malware > threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ > ___ > Pytables-users mailing list > Pytables-users@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/pytables-users -- Live Security Virtual Conference Exclusive live event will cover all the ways today's security and threat landscape has changed and how IT managers can respond. Discussions will include endpoint security, mobile security and the latest in malware threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ ___ Pytables-users mailing list Pytables-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/pytables-users
Re: [Pytables-users] Pytables file reading
Hi Antonio, Last question about this, from pytables point of view and based on your experience, is it better to manage a table with 3 million rows and multidimensional cells or a table with 300 million rows and plain cells? Thank you, Juanma El Aug 5, 2012, a las 17:32, Antonio Valentino escribió: > Hi Juan Manuel, > > Il 05/08/2012 22:52, Juan Manuel Vázquez Tovar ha scritto: >> Hi Antonio, >> >> This is the piece of code I use to read the part of the table I need: >> >> data = [case[´loads´][i] for case in table] >> >> where i is the index of the row that I need to read from the matrix (133x6) >> stored in each cell of the column "loads". >> >> Juanma >> > > that looks perfectly fine to me. > No idea about what could be the issue :/ > > You can perfform patrial reads using Table.iterrows: > > data = [case[´loads´][i] for case in table.iterrows(start, stop)] > > Please also consider that using a single np.array with 1e8 rows instead > of a list of arrays will allows you to save the memory overhead of 1e8 > array objects. > Considering that 6 doubles are 48 bytes while an empty np.array takes 80 > bytes > > In [64]: sys.getsizeof(np.zeros((0,))) > Out[64]: 80 > > you should be able to reduce the memory footprint by far more than an half. > > > cheers > > >> 2012/8/5 Antonio Valentino >> >>> Hi Juan Manuel, >>> >>> Il 05/08/2012 22:28, Juan Manuel Vázquez Tovar ha scritto: Hi Antonio, You are right, I don´t need to load the entire table into memory. The fourth column has multidimensional cells and when I read a single row from every cell in the column, I almost fill the workstation memory. I didn´t expect that process to use so much memory, but the fact is that >>> it uses it. May be I didn´t explain very well last time. Thank you, Juanma >>> >>> Sorry, still don't understand. >>> Can you please post a short code snipped that shows how exactly do you >>> read data into your program? >>> >>> My impression is that somewhere you use some instruction that triggers >>> loading of unnecessary data into memory. > > > -- > Antonio Valentino > > -- > Live Security Virtual Conference > Exclusive live event will cover all the ways today's security and > threat landscape has changed and how IT managers can respond. Discussions > will include endpoint security, mobile security and the latest in malware > threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ > ___ > Pytables-users mailing list > Pytables-users@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/pytables-users -- Live Security Virtual Conference Exclusive live event will cover all the ways today's security and threat landscape has changed and how IT managers can respond. Discussions will include endpoint security, mobile security and the latest in malware threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ ___ Pytables-users mailing list Pytables-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/pytables-users