Michael et al,
Thanks for your reply. I was wondering, if you can't get any details about
the overall volume of data perl has in memory, can you find out information
about an individual variable's usage? This was the core loop within my code
which has to constantly compute the length of each input line just to have
some sort of handle on memory usage. Asking for keys %seen isn't useful as
individual keys could be huge.
keys %seen =1e6;
while (<>) {
next if $seen{$_};$seen{$_}=1;
#### I'd rather not have to call "length" with every loop
$hash_keys_size+= length;
####
if ($hash_keys_size > $Max_Possible_Hash_Key_Size ) {
push @small_files, open_new_temp_file();
for my $line (sort keys %seen) {
print TEMPFILE $line;
}
close_temp_file();
$hash_keys_size=0;
%seen=(); # This is slow in Activestate 5.6 for
Win32
}
}
push @small_files, open_new_temp_file();
for my $line (sort keys %seen) {
print TEMPFILE $line;
}
close_temp_file();
Once you write out to a temporary file the performance goes down enormously,
so I want to be able to use as much memory as possible. Do you have any
suggestions as to how this could be improved?
If your interested, there's a few oddities I discovered en route too:
1) The code:
next if $seen{$_}; $seen{$_}=1;
according to my benchmaring is faster than:
unless ($seen{$_}++) {...}
even though the former looks up $seen{$_} twice, and ++ is a pretty trivial
operator. Its not where I'd have put my money.
2) There is a bug Activestate Perl 5.6.0 and onwards for Win32 where garbage
collection of %seen=() or %seen=undef is slow. I reported the bug back in
December, but nothing seems to have happend since.
http://bugs.activestate.com//ActivePerl/show_bug.cgi?id=18559
I'm surprised there's been no action on this one as garbage collecting
hashes I would have thought was critical to object oriented programming.
It was mostly because of the second problem that I never went through with
my challenge. It's very embarassing to have to use and old version when
showing off how cool open source code is.
Any of your FWPer's out there have any other suggestions as to the fastest
sort-deduplicate algorithm in perl, optimised for obsurdly large input and
ideally using all of the elegant built in features of perl like hashes, sort
and Regex's if you can wangle it.
This is still fun isn't it?
Alistair
> ----------------------------------------------------------------------
> Alistair McGlinchy, [EMAIL PROTECTED]
> Sizing and Performance, Central IT, ext. 5012, ph +44 20 7268-5012
> Marks and Spencer, 3 Longwalk Rd, Stockley Park, Uxbridge UB11 1AW, UK
>
> -----Original Message-----
> From: Michael G Schwern [SMTP:[EMAIL PROTECTED]]
> Sent: Tuesday, June 11, 2002 9:13 PM
> To: [EMAIL PROTECTED]
> Cc: [EMAIL PROTECTED]
> Subject: Re: Reading Huge Files into Memory (was RE: reading lines
> backwards )
>
> On Tue, Jun 11, 2002 at 06:11:37PM +0100,
> [EMAIL PROTECTED] wrote:
> > NT OS how much perl had requested but this number doesn't go down when
> you
> > execute %unique=().
>
> Perl doesn't free the memory used by %unique on the assumption it would
> just
> have to reallocate it when you use %unique again, which in this case is
> correct. If you want to free up the memory in %unique (back to Perl's
> pool,
> not the system), "undef %unique".
>
> AFAIK there is no easy way to force Perl to return memory to the system,
> nor
> is there to figure out how much memory Perl is using and more importantly,
> how much Perl is holding in it's memory pool and how much it's actually
> using to store data.
>
> PS A small hack which might make the above faster is to pre-extend
> %unique
> using "keys %unqiue = SOME_BIG_NUMBER"
>
>
> --
> This sig file temporarily out of order.
-----------------------------------------------------------------------
Registered Office:
Marks & Spencer p.l.c
Michael House, Baker Street,
London, W1U 8EP
Registered No. 214436 in England and Wales.
Telephone (020) 7935 4422
Facsimile (020) 7487 2670
www.marksandspencer.com
Please note that electronic mail may be monitored.
This e-mail is confidential. If you received it by mistake, please let us know and
then delete it from your system; you should not copy, disclose, or distribute its
contents to anyone nor act in reliance on this e-mail, as this is prohibited and may
be unlawful.
The registered office of Marks and Spencer Financial Services Limited, Marks and
Spencer Unit Trust Management Limited, Marks and Spencer Life Assurance Limited and
Marks and Spencer Savings and Investments Limited is Kings Meadow, Chester, CH99 9FB.