Re: Reading Huge Files into Memory (was RE: reading lines backw ards)

Michael G Schwern Tue, 11 Jun 2002 15:52:10 -0700

On Tue, Jun 11, 2002 at 10:53:44PM +0100, [EMAIL PROTECTED] 
wrote:
> Thanks for your reply. I was wondering, if you can't get any details about
> the overall volume of data perl has in memory, can you find out information
> about an individual variable's usage?


Not really.  "Individual" variables are actually complex conglomerations of
lots of internal variables.  For example, a Perl hash is really a C struct
containing various integers, pointers and strings and a C array of SV
pointers to Perl scalar values.  Plus some empty slots in the array.

In theory it is possible to calculate how much memory a hash is using by
examining it's internal structure using XS, but to give you an idea:

$ perl5.8.0 -MDevel::Peek -wle '%h = (foo => 42, bar => 23, baz => 27);  print Dump 
\%h'
SV = RV(0x10226420) at 0x101d2098
  REFCNT = 1
  FLAGS = (TEMP,ROK)
  RV = 0x102136b4
  SV = PVHV(0x10211f50) at 0x102136b4
    REFCNT = 2
    FLAGS = (SHAREKEYS)
    IV = 3
    NV = 0
    ARRAY = 0x10217008  (0:5, 1:3)
    hash quality = 150.0%
    KEYS = 3
    FILL = 3
    MAX = 7
    RITER = -1
    EITER = 0x0
    Elt "bar" HASH = 0x80409109
    SV = IV(0x10216690) at 0x101d22a8
      REFCNT = 1
      FLAGS = (IOK,pIOK)
      IV = 23
    Elt "baz" HASH = 0xffb60ff2
    SV = IV(0x10216698) at 0x101d22cc
      REFCNT = 1
      FLAGS = (IOK,pIOK)
      IV = 27
    Elt "foo" HASH = 0x238678dd
    SV = IV(0x10216688) at 0x101d217c
      REFCNT = 1
      FLAGS = (IOK,pIOK)
      IV = 42

but this might get a little rediculous.  Just looking at the length() of
each individual key/value pair is close enough.


> This was the core loop within my code
> which has to constantly compute the length of each input line just to have
> some sort of handle on memory usage. 

Fortunately for you, Perl's strings are Pascal style, with the length
pre-calculated.

$ perl5.8.0 -MDevel::Peek -wle '$foo = "wibble";  print Dump \$foo'
SV = RV(0x10226420) at 0x101d217c
  REFCNT = 1
  FLAGS = (TEMP,ROK)
  RV = 0x10212e90
  SV = PV(0x101d2468) at 0x10212e90
    REFCNT = 2
    FLAGS = (POK,pPOK)
    PV = 0x101d8bc8 "wibble"\0
    CUR = 6
    LEN = 7

so length() is very cheap.  Unlike strlen() in C, it doesn't have to walk
the string looking for a null byte.


> Once you write out to a temporary file the performance goes down enormously,

*mumble*Use an OS with real disk caching*mumble*  Hmmm, what?  I didn't say
anything. :)

> so I want to be able to use as much memory as possible. Do you have any
> suggestions as to how this could be improved? 

I believe Programming Perls has a chapter on this, or something similar,
under merge sort.


> If your interested, there's a few oddities I discovered en route too:
> 1) The code: 
>       next if $seen{$_}; $seen{$_}=1;
> according to my benchmaring is faster than: 
>       unless ($seen{$_}++)  {...}
> even though the former looks up $seen{$_} twice, and ++ is a pretty trivial
> operator. Its not where I'd have put my money.

Here's what I get on Linux.

$ perl5.6.1 ~/src/bench/next_vs_unless 200000
Benchmark: timing 200000 iterations of control, next, unless...
   control:  1 wallclock secs ( 0.18 usr +  0.00 sys =  0.18 CPU) @ 1111111.11/s 
(n=200000)
            (warning: too few iterations for a reliable count)
      next:  2 wallclock secs ( 1.74 usr +  0.00 sys =  1.74 CPU) @ 114942.53/s 
(n=200000)
    unless:  1 wallclock secs ( 1.17 usr +  0.02 sys =  1.19 CPU) @ 168067.23/s 
(n=200000)
             Rate    next  unless control
next     114943/s      --    -32%    -90%
unless   168067/s     46%      --    -85%
control 1111111/s    867%    561%      --

When things are this small and this close, changes in OS, perl version, how
perl was compiled, etc... will all cause perl's "speed" to vary.  (Benchmark
code attached).


> 2) There is a bug Activestate Perl 5.6.0 and onwards for Win32 where garbage
> collection  of %seen=() or %seen=undef is slow.  I reported the bug back in
> December, but nothing seems to have happend since.
> http://bugs.activestate.com//ActivePerl/show_bug.cgi?id=18559
> I'm surprised there's been no action on this one as garbage collecting
> hashes I would have thought was critical to object oriented programming.

According to Bugzilla, Guru was unable to reproduce the problem:

  ------- Comments from Gurusamy Sarathy 2001-12-06 20:02
���
  I don't see a significant difference among any of the 6xx builds,
  so I think you are mistaken about 5.6.0 being faster.  The 5xx
  builds (5.0050x based) could have been faster for subsequent runs,
  since they used a different memory allocator.
  Thanks for the interesting test case.  We'll continue to
  investigate this.


-- 
This sig file temporarily out of order.

#!/usr/bin/perl -w

use Benchmark qw(cmpthese);


cmpthese(shift || -3, {
    next    => sub { do { next if $hash{foo};  $hash{foo} = 1; 
                          delete $hash{foo} 
                     } 
               },
    unless  => sub { unless( $hash{foo}++ ) { delete $hash{foo} } },
    control => sub { delete $hash{foo} }
  }
)

Re: Reading Huge Files into Memory (was RE: reading lines backw ards)

Reply via email to