Re: Memory explodes loading CSV into hash
Stas Bekman wrote: Ideally when such a situation happens, and you must load all the data into the memory, which is at short, your best bet is to rewrite the datastorage layer in XS/C, and use a tie interface to make it transparent to your perl code. So you will still use the hash but the refs to arrays will be actually C arrays. Sorry, I'm not familiar with C(hinese) - but if someone could develop a XS/Pascal interface ;-)) Ernest Lergon wrote: Another thing I found is, that Apache::Status seems not always report complete values. Therefore I recorded the sizes from top, too. Were you running a single process? If you aren't Apache::Status could have shown you a different process. Running httpd -X shows the same results. I will use the named %index structure for now. Thanks to the modular OO perl I can re-code my data package later, if the memory explosion hits me again ;-)) Ernest -- * * VIRTUALITAS Inc. * * ** * * European Consultant Office * http://www.virtualitas.net * * Internationales Handelszentrum * contact:Ernest Lergon * * Friedrichstraße 95 *mailto:[EMAIL PROTECTED] * * 10117 Berlin / Germany * ums:+49180528132130266 * * PGP-Key http://www.virtualitas.net/Ernest_Lergon.asc
Re: Memory explodes loading CSV into hash
Hi Stas, having a look at Apache::Status and playing around with your tips on http://www.apacheweek.com/features/mod_perl11 I found some interesting results and a compromising solution: In a module I load a CSV file as class data into different structures and compared the output of Apache::Status with top. Enclosed you'll find a test report. The code below 'building' shows, how the lines are put into the structures. The lines below 'perl-status' show the output of Apache::Status. The line below 'top' shows the output of top. Examples for the tested structures are: $buffer = '1\tr1v1\tr1v2\tr1v3\n2\tr2v1\tr2v2\tr2v3\n' ... @lines = ( '1\tr1v1\tr1v2\tr1v3', '2\tr2v1\tr2v2\tr2v3', ... ) %data = ( 1 = [ 1, 'r1v1' , 'r1v2' , 'r1v3' ], 2 = [ 2, 'r2v1' , 'r2v2' , 'r2v3' ], ... ) $pack = { 1 = [ 1, 'r1v1' , 'r1v2' , 'r1v3' ], 2 = [ 2, 'r2v1' , 'r2v2' , 'r2v3' ], ... } %index = ( 1 = '1\tr1v1\tr1v2\tr1v3', 2 = '2\tr2v1\tr2v2\tr2v3', ... ) One thing I realized using Devel::Peek is, that using a hash of array-ref, each item in the array has the full blown perl flags etc. That seems to be the reason for the 'memory explosion'. Another thing I found is, that Apache::Status seems not always report complete values. Therefore I recorded the sizes from top, too. Especially for the the hash of array-refs (%data) and the hash-ref of array-refs ($pack) perl-status reports only a part of the used memory size: for $pack only the pointer (16 bytes), for %data only the keys (?). As compromise I'll use the %index structure. It is small enough while providing a fast access. A further optimization will be to remove the redundant key-field from each line. Success: A reduction from 26 MB to 7 MB - what I estimated in my first mail. A last word from perldebguts.pod: || Perl is a profligate wastrel when it comes to memory use. There is a || saying that to estimate memory usage of Perl, assume a reasonable || algorithm for memory allocation, multiply that estimate by 10, and || while you still may miss the mark, at least you won't be quite so || astonished. This is not absolutely true, but may prvide a good grasp || of what happens. || || [...] || || Anecdotal estimates of source-to-compiled code bloat suggest an || eightfold increase. Perhaps my experiences could be added to the long row of anecdotes ;-)) Thank you all again for escorting me on this deep dive. Ernest -- * * VIRTUALITAS Inc. * * ** * * European Consultant Office * http://www.virtualitas.net * * Internationales Handelszentrum * contact:Ernest Lergon * * Friedrichstraße 95 *mailto:[EMAIL PROTECTED] * * 10117 Berlin / Germany * ums:+49180528132130266 * * PGP-Key http://www.virtualitas.net/Ernest_Lergon.asc TEST REPORT === CSV file: 14350 records CSV 2151045 bytes = 2101 Kbytes CSV_2 2136695 bytes = 2086 Kbytes (w/o CR) 1 all empty = building: none perl-status: *buffer{SCALAR} 25 bytes *lines{ARRAY} 56 bytes *data{HASH} 228 bytes *pack{SCALAR} 16 bytes *index{HASH} 228 bytes top: 12992 12M 12844 base 2 buffer == building: $buffer .= $_ . \n; perl-status: *buffer{SCALAR} 2151069 bytes = CSV + 24 bytes *lines{ARRAY} 56 bytes *data{HASH} 228 bytes *pack{SCALAR} 16 bytes *index{HASH} 228 bytes top: 17200 16M 17040 base + 4208 Kbytes = CSV + 2107 KBytes 3 lines = building: push @lines, $_; perl-status: *buffer{SCALAR} 25 bytes *lines{ARRAY}2519860 bytes = CSV_2 + 383165 bytes (approx. 27 * 14350 ) *data{HASH} 228 bytes *pack{SCALAR} 16 bytes *index{HASH} 228 bytes top: 18220 17M 18076 base + 5228 Kbytes = CSV_2 + 3142 Kbytes 4 data building: @record = split ( \t, $_ ); $key = 0 + $record[0]; $data{$key} = [ @record ]; perl-status: *buffer{SCALAR} 25 bytes *lines{ARRAY} 56 bytes *data{HASH} 723302 bytes = approx. 50 * 14350 ( key + ref ) (where is the data?) *pack{SCALAR} 16 bytes *index{HASH} 228 bytes top: 40488 38M 39208 base + 27566 Kbytes = CSV_2 + 25480 Kbytes
Re: Memory explodes loading CSV into hash
Ernest Lergon wrote: having a look at Apache::Status and playing around with your tips on http://www.apacheweek.com/features/mod_perl11 I found some interesting results and a compromising solution: Glad to hear that Apache::Status was of help to you. Ideally when such a situation happens, and you must load all the data into the memory, which is at short, your best bet is to rewrite the datastorage layer in XS/C, and use a tie interface to make it transparent to your perl code. So you will still use the hash but the refs to arrays will be actually C arrays. Another thing I found is, that Apache::Status seems not always report complete values. Therefore I recorded the sizes from top, too. Were you running a single process? If you aren't Apache::Status could have shown you a different process. Also you can use GTop, if you have libgtop on your system, which gives you a perl interface to the proc's memory usage. See the guide for many examples. Success: A reduction from 26 MB to 7 MB - what I estimated in my first mail. :) __ Stas BekmanJAm_pH -- Just Another mod_perl Hacker http://stason.org/ mod_perl Guide --- http://perl.apache.org mailto:[EMAIL PROTECTED] http://use.perl.org http://apacheweek.com http://modperlbook.org http://apache.org http://ticketmaster.com
Re: Memory explodes loading CSV into hash
Kee Hinckley wrote: At 17:18 28.04.2002, Ernest Lergon wrote: Now I'm scared about the memory consumption: The CSV file has 14.000 records with 18 fields and a size of 2 MB (approx. 150 Bytes per record). Now a question I would like to ask: do you *need* to read the whole CSV info into memory? There are ways to overcome this. For example, looking at When I have a csv to play with and it's not up to being transfered to a real database I use the DBD CSV module which puts a nice sql wrapper around it. I've installed DBD::CSV and tested it with my data: $dbh = DBI-connect(DBI:CSV:csv_sep_char=\t;csv_eol=\n;csv_escape_char=) $dbh-{'csv_tables'}-{'foo'} = { 'file' = 'foo.data'}; 3 MB memory used. $sth = $dbh-prepare(SELECT * FROM foo); 3 MB memory used. $sth-execute(); 16 MB memory used! If I do it record by record like $sth = $dbh-prepare(SELECT * FROM foo WHERE id=?); than memory usage will grow query by query due to caching. Moreover it becomes VERY slow because of reading every time the whole file again; an index can't be created/used. No win :-( Ernest -- * * VIRTUALITAS Inc. * * ** * * European Consultant Office * http://www.virtualitas.net * * Internationales Handelszentrum * contact:Ernest Lergon * * Friedrichstraße 95 *mailto:[EMAIL PROTECTED] * * 10117 Berlin / Germany * ums:+49180528132130266 * * PGP-Key http://www.virtualitas.net/Ernest_Lergon.asc
Re: Memory explodes loading CSV into hash
Perrin Harkins wrote: $foo-{$i} = [ @record ]; You're creating 14000 arrays, and references to them (refs take up space too!). That's where the memory is going. See if you can use a more efficient data structure. For example, it takes less space to make 4 arrays with 14000 entries in each than to make 14000 arrays with 4 entries each. So I turned it around: $col holds now 18 arrays with 14000 entries each and prints the correct results: #!/usr/bin/perl -w $col = {}; $line = \t\t\t; # 4 string fields (4 chars) $line .= \t10.99x9; # 9 float fields (5 chars) $line .= \t . 'A'x17; # 5 string fields (rest) $line .= \t . 'B'x17; # $line .= \t . 'C'x17; # $line .= \t . 'D'x17; # $line .= \t . 'E'x17; # @record = split \t, $line; foreach $j ( 0 .. $#record ) { $col-{$j} = []; } for ( $i = 0; $i 14000; $i++ ) { map { $_++ } @record; foreach $j ( 0 .. $#record ) { push @ { $col-{$j} }, $record [$j]; } print $i\t$col-{0}-[$i],$col-{5}-[$i]\n unless $i % 1000; } ; 1; and gives: SIZE RSS SHARE 12364 12M 1044 Wow, 2 MB saved ;-)) I think, a reference is a pointer of 8 Bytes, so: 14.000 * 8 = approx. 112 KBytes - right? This doesn't explain the difference of 7 MB calculated and 14 MB measured. Ernest -- * * VIRTUALITAS Inc. * * ** * * European Consultant Office * http://www.virtualitas.net * * Internationales Handelszentrum * contact:Ernest Lergon * * Friedrichstraße 95 *mailto:[EMAIL PROTECTED] * * 10117 Berlin / Germany * ums:+49180528132130266 * * PGP-Key http://www.virtualitas.net/Ernest_Lergon.asc
Re: Memory explodes loading CSV into hash
Hi, thank you all for your hints, BUT (with capital letters ;-) I think, it's a question of speed: If I hold my data in a hash in memory, access should be faster than using any kind of external database. What makes me wonder is the extremely blown up size (mod)perl uses for datastructures. Ernest -- * * VIRTUALITAS Inc. * * ** * * European Consultant Office * http://www.virtualitas.net * * Internationales Handelszentrum * contact:Ernest Lergon * * Friedrichstraße 95 *mailto:[EMAIL PROTECTED] * * 10117 Berlin / Germany * ums:+49180528132130266 * * PGP-Key http://www.virtualitas.net/Ernest_Lergon.asc
Re: Memory explodes loading CSV into hash
Have you tried DBD::AnyData? It's pure Perl so it might not be as fast but you never know? -- Simon Oliver
Re: Memory explodes loading CSV into hash
Ernest Lergon wrote: So I turned it around: $col holds now 18 arrays with 14000 entries each and prints the correct results: ... and gives: SIZE RSS SHARE 12364 12M 1044 Wow, 2 MB saved ;-)) That's pretty good, but obviously not what you were after. I tried using the pre-size array syntax ($#array = 14000), but it didn't help any. Incidentally, that map statement in your script isn't doing anything that I can see. I think, a reference is a pointer of 8 Bytes, so: 14.000 * 8 = approx. 112 KBytes - right? Probably more. Perl data types are complex. They hold a lot of meta data (is the ref blessed, for example). This doesn't explain the difference of 7 MB calculated and 14 MB measured. The explanation of this is that perl uses a lot of memory. For one thing, it allocates RAM in buckets. When you hit the limit of the allocated memory, it grabs more, and I believe it grabs an amount in proportion to what you've already used. That means that as your structures get bigger, it grabs bigger chunks. The whole 12MB may not be in use, although perl has reserved it for possible use. (Grabbing memory byte by byte would be less wasteful, but much too slow.) The stuff in perldebguts is the best reference on this, and you've already looked at that. I think your original calculation failed to account for the fact that the minimum numbers given there for scalars are minimums (i.e. scalars with something in them won't be that small) and that you are accessing many of these in more than one way (i.e. as string, float, and integer), which increases their size. You can try playing with compile options (your choice of malloc affects this a little), but at this point it's probably not worth it. There's nothing wrong with 12MB of shared memory, as long as it stays shared. If that doesn't work for you, your only choice will be to trade some speed for reduced memory useage, by using a disk-based structure. At any rate, mod_perl doesn't seem to be at fault here. It's just a general perl issue. - Perrin
Re: Memory explodes loading CSV into hash
Perrin Harkins wrote: [snip] Incidentally, that map statement in your script isn't doing anything that I can see. It simulates different values for each record - e.g.: $line = \t\t1000\t10.99; @record = split \t, $line; for ( $i = 0; $i 14000; $i++ ) { map { $_++ } @record; # $i=0@record=('AAAB','BBBC',1001,11.99); # $i=1@record=('AAAC','BBBD',1002,12.99); # $i=2@record=('AAAD','BBBE',1003,13.99); # etc. } [snip] Thanks for your explanations about perl's memory usage. At any rate, mod_perl doesn't seem to be at fault here. It's just a general perl issue. I think so, too. Ernest -- * * VIRTUALITAS Inc. * * ** * * European Consultant Office * http://www.virtualitas.net * * Internationales Handelszentrum * contact:Ernest Lergon * * Friedrichstraße 95 *mailto:[EMAIL PROTECTED] * * 10117 Berlin / Germany * ums:+49180528132130266 * * PGP-Key http://www.virtualitas.net/Ernest_Lergon.asc
Re: Memory explodes loading CSV into hash
Ernest Lergon wrote: Hi, thank you all for your hints, BUT (with capital letters ;-) I think, it's a question of speed: If I hold my data in a hash in memory, access should be faster than using any kind of external database. What makes me wonder is the extremely blown up size (mod)perl uses for datastructures. Looks like you've skipped over my suggestion to use Apache::Status. It uses B::Size and B::TerseSize to show you *exactly* how much memory each variable, opcode and what not uses. No need to guess. You can use B:: modules directly, but since you say that outside of mod_perl the memory usage pattern is different I'd suggest using Apache::Status. __ Stas BekmanJAm_pH -- Just Another mod_perl Hacker http://stason.org/ mod_perl Guide --- http://perl.apache.org mailto:[EMAIL PROTECTED] http://use.perl.org http://apacheweek.com http://modperlbook.org http://apache.org http://ticketmaster.com
Re: Memory explodes loading CSV into hash
Ernest Lergon wrote: Hi, in a mod_perl package I load a CSV file on apache startup into a simple hash as read-only class data to be shared by all childs. A loading routine reads the file line by line and uses one numeric field as hash entry (error checks etc. omitted): package Data; my $class_data = {}; ReadFile ( 'data.txt', $class_data, 4 ); sub ReadFile { my $filename = shift; # path string my $data = shift; # ref to hash my $ndx_field = shift; # field number my ( @record, $num_key ); local $_; open ( INFILE, $filename ); while ( INFILE ) { chomp; @record = split \t; $num_key = $record[$ndx_field]; $data-{$num_key} = [ @record ]; } close ( INFILE ); } sub new... creates an object for searching the data, last result, errors etc. sub find... method with something like: if exists $class_data-{$key} return... etc. Now I'm scared about the memory consumption: The CSV file has 14.000 records with 18 fields and a size of 2 MB (approx. 150 Bytes per record). Omitting the loading, top shows, that each httpd instance has 10 MB (all shared as it should be). But reading in the file explodes the memory to 36 MB (ok, shared as well)! So, how comes, that 2 MB data need 26 MB of memory, if it is stored as a hash? Reading perldebguts.pod I did not expect such an increasing: Description (avg.) CSV Perl 4 string fields (4 chars)16 bytes (32 bytes) 128 bytes 9 float fields (5 chars) 45 bytes (24 bytes) 216 bytes 5 string fields (rest) 89 bytes (32 bytes) 160 bytes the integer key (20 bytes) 20 bytes 150 bytes 524 bytes That will give 14.000 x 524 = approx. 7 MB, but not 26 MB !? Lost in space... Use Apache::Status, which can show you exactly where all the bytes go. See the guide or its manpage for more info. I suggest that you experiment with a very small data set and look at how much memory each record takes. __ Stas BekmanJAm_pH -- Just Another mod_perl Hacker http://stason.org/ mod_perl Guide --- http://perl.apache.org mailto:[EMAIL PROTECTED] http://use.perl.org http://apacheweek.com http://modperlbook.org http://apache.org http://ticketmaster.com
Re: Memory explodes loading CSV into hash
Jeffrey Baker wrote: I tried this program in Perl (outside of modperl) and the memory consumption is only 4.5MB: #!/usr/bin/perl -w $foo = {}; for ($i = 0; $i 14000; $i++) { $foo-{sprintf('%020d', $i)} = 'A'x150; } ; 1; So I suggest something else might be going on causing your memory problems. Hi Jeffrey, good idea to boil it down. Yes, your prog gave me only: SIZE RSS SHARE 4696 4696 964 Running my code snippet outside mod_perl (with real data) still gives: SIZE RSS SHARE 14932 14M 1012 A simulation like this: #!/usr/bin/perl -w $foo = {}; $line = \t\t\t; # 4 string fields (4 chars) $line .= \t10.99x9; # 9 float fields (5 chars) $line .= \t . 'A'x17; # 5 string fields (rest) $line .= \t . 'B'x17; # $line .= \t . 'C'x17; # $line .= \t . 'D'x17; # $line .= \t . 'E'x17; # @record = split \t, $line; for ($i = 0; $i 14000; $i++) { map { $_++ } @record; $foo-{$i} = [ @record ]; print $i\t$foo-{$i}-[0],$foo-{$i}-[5]\n unless $i % 1000; } ; 1; prints: 0 AAAB,11.99 1000ABMN,1011.99 2000ACYZ,2011.99 3000AELL,3011.99 4000AFXX,4011.99 5000AHKJ,5011.99 6000AIWV,6011.99 7000AKJH,7011.99 8000ALVT,8011.99 9000ANIF,9011.99 1 AOUR,10011.99 11000 AQHD,11011.99 12000 ARTP,12011.99 13000 ATGB,13011.99 and gives: SIZE RSS SHARE 14060 13M 1036 There is no difference between real and random data. But I think, there is an optimization mechanism in perl concerning strings, so you need less memory for your code. So what is going on? 2 MB - 14 MB ? Still lost in space ;-)) Ernest -- * * VIRTUALITAS Inc. * * ** * * European Consultant Office * http://www.virtualitas.net * * Internationales Handelszentrum * contact:Ernest Lergon * * Friedrichstraße 95 *mailto:[EMAIL PROTECTED] * * 10117 Berlin / Germany * ums:+49180528132130266 * * PGP-Key http://www.virtualitas.net/Ernest_Lergon.asc
Re: Memory explodes loading CSV into hash
$foo-{$i} = [ record ]; You're creating 14000 arrays, and references to them (refs take up space too!). That's where the memory is going. See if you can use a more efficient data structure. For example, it takes less space to make 4 arrays with 14000 entries in each than to make 14000 arrays with 4 entries each. There is some discussion of this in the Advanced Perl Programming book, and probably some CPAN modules that can help. - Perrin
Re: Memory explodes loading CSV into hash
At 17:18 28.04.2002, Ernest Lergon wrote: Now I'm scared about the memory consumption: The CSV file has 14.000 records with 18 fields and a size of 2 MB (approx. 150 Bytes per record). Now a question I would like to ask: do you *need* to read the whole CSV info into memory? There are ways to overcome this. For example, looking at your data I suppose you might want to look for specific IDs, in that case it would be much more efficient to read one line at a time and check if it's the correct one. Otherwise you might want to make the move th a relational database, this is the kind of thing RDMSes excel at. Just some tips. -- Per Einar Ellefsen [EMAIL PROTECTED]
Re: Memory explodes loading CSV into hash
On Sun, 28 Apr 2002, Per Einar Ellefsen wrote: At 17:18 28.04.2002, Ernest Lergon wrote: Now I'm scared about the memory consumption: The CSV file has 14.000 records with 18 fields and a size of 2 MB (approx. 150 Bytes per record). Now a question I would like to ask: do you *need* to read the whole CSV info into memory? There are ways to overcome this. For example, looking at your data I suppose you might want to look for specific IDs, in that case it would be much more efficient to read one line at a time and check if it's the correct one. Otherwise you might want to make the move th a relational database, this is the kind of thing RDMSes excel at. You might also want to look at loading your CSV data into a MLDBM file, and then having your apache processes access it from there. That way most of your data stays on disk, and you access it in much the same way as before through a hash of arrays. Andrew McNaughton
Re: Re: Memory explodes loading CSV into hash
At 17:18 28.04.2002, Ernest Lergon wrote: Now I'm scared about the memory consumption: The CSV file has 14.000 records with 18 fields and a size of 2 MB (approx. 150 Bytes per record). Now a question I would like to ask: do you *need* to read the whole CSV info into memory? There are ways to overcome this. For example, looking at When I have a csv to play with and it's not up to being transfered to a real database I use the DBD CSV module which puts a nice sql wrapper around it.
Re: Memory explodes loading CSV into hash
On Sun, Apr 28, 2002 at 05:18:24PM +0200, Ernest Lergon wrote: Hi, in a mod_perl package I load a CSV file on apache startup into a simple hash as read-only class data to be shared by all childs. A loading routine reads the file line by line and uses one numeric field as hash entry (error checks etc. omitted): package Data; my $class_data = {}; ReadFile ( 'data.txt', $class_data, 4 ); sub ReadFile { my $filename = shift; # path string my $data = shift; # ref to hash my $ndx_field = shift; # field number my ( record, $num_key ); local $_; open ( INFILE, $filename ); while ( INFILE ) { chomp; record = split \t; $num_key = $record[$ndx_field]; $data-{$num_key} = [ record ]; } close ( INFILE ); } sub new... creates an object for searching the data, last result, errors etc. sub find... method with something like: if exists $class_data-{$key} return... etc. Now I'm scared about the memory consumption: The CSV file has 14.000 records with 18 fields and a size of 2 MB (approx. 150 Bytes per record). Omitting the loading, top shows, that each httpd instance has 10 MB (all shared as it should be). But reading in the file explodes the memory to 36 MB (ok, shared as well)! So, how comes, that 2 MB data need 26 MB of memory, if it is stored as a hash? Reading perldebguts.pod I did not expect such an increasing: Description (avg.) CSV Perl 4 string fields (4 chars)16 bytes (32 bytes) 128 bytes 9 float fields (5 chars) 45 bytes (24 bytes) 216 bytes 5 string fields (rest) 89 bytes (32 bytes) 160 bytes the integer key (20 bytes) 20 bytes 150 bytes 524 bytes That will give 14.000 x 524 = approx. 7 MB, but not 26 MB !? I tried this program in Perl (outside of modperl) and the memory consumption is only 4.5MB: #!/usr/bin/perl -w $foo = {}; for ($i = 0; $i 14000; $i++) { $foo-{sprintf('%020d', $i)} = 'A'x150; } ; 1; So I suggest something else might be going on causing your memory problems. -jwb