Re: Primitive benchmark comparison (parsing LDIF)

Norman Gaywood Wed, 27 Oct 2021 23:13:59 -0700

Oh, and I welcome suggestions on how I might do the task more quickly,
elegantly, differently, etc :-)
And critiques of the code also welcome. I still have a strong perl5 accent
I suspect.


On Thu, 28 Oct 2021 at 13:15, Norman Gaywood <ngayw...@une.edu.au> wrote:

> Executive summary:
>      - comparing raku 2021.10 with raku 2021.9
>      -comparing 3 ways of parsing (although the 2 string function ways are
> similar)
>     - raku 2021.10 is better than 2 times as fast as 2021.9 using the
> string functions
>     - raku 2021.10 is about the same as 2021.9 using a more general
> regular expression
>     - regular expressions are still slow in 2021.10
>
> Side note: not shown here is also parsing with Text::LDIF. In 2021.9 it
> was comparable to the regex method. Not tried with 2021.10.
>
> I need to parse a 40K entry LDIF file.
>
> Below is some code that uses 3 ways to parse.
> There are 3 MAIN subs that differ in a few last lines of the for loop.
> The loop reads the LDIF entries and populates %ldap keyed on the "uid" of
> the LDIF entry.
> The values of %ldap are User objects.
> A %f hash is used to build the values of User on each LDIF entry
>
> The aim is to show the difference in timings between 3 ways of parsing the
> LDIF
>
> The 1st MAIN (regex) uses this general regular expression to build %f
>          next unless $line ~~ m/ ^ (@attributes) ':' \s (.+) $ /;
>         %f{$0} = "$1";
>
> The "starts" MAIN uses starts-with() to build %f
>        for @attributes -> $a {
>             if $line.starts-with( $a ~ ": " ) {
>                %f{$a} = (split( ": ", $line, 2))[1];
>                last;
>     }
>
> And finally the "split" MAIN uses split() but also uses the feature that
> User.new() will ignore attributes that are not used.
>         ($k, $v) = split( ": ", $line, 2);
>         %f{$k} = $v;
>
> That's the difference between the MAIN()'s below. Sorry I couldn't golf it
> down more.
> Running the benchmarks multiple times does vary the times slightly but not
> significantly.
>
> Results for rakudo-pkg-2021.9.0-01:
> $ ./icheck.raku regex
> 41391 entries by regex in 27.859560887 seconds
> $ ./icheck.raku starts
> 41391 entries by starts-with in 5.970667533 seconds
> $ ./icheck.raku split
> 41391 entries by split in 5.12252741 seconds
>
> Results for rakudo-pkg-2021.10.0-01
> $ ./icheck.raku regex
> 41391 entries by regex in 27.833870158 seconds
> $ ./icheck.raku starts
> 41391 entries by starts-with in 2.560101599 seconds
> $ ./icheck.raku split
> 41391 entries by split in 2.307679407 seconds
>
> -------------------------------------
> #!/usr/bin/env raku
>
> class User {
>     has $.uid;
>     has $.uidNumber;
>     has $.gidNumber;
>     has $.homeDirectory;
>     has $.mode = 0;
>
>     method attributes {
>        # return <uid uidNumber gidNumber homeDirectory mode>;
>        User.^attributes(:local)>>.name>>.substr(2);  # Is the order
> guaranteed?
>     }
> }
>
> # Read user info from LDIF file
> my %ldap;
> my @attributes = User.attributes;
>
> multi MAIN ( "regex", $ldif-fn = "db/icheck.ldif" ) {
>     my ( %f );
>     for $ldif-fn.IO.lines -> $line {
>         when not $line {  # blank line is LDIF entry terminator
>             %ldap{%f<uid>} = User.new( |%f );
>         }
>         when $line.starts-with( 'dn: ' ) { %f = () }   # dn: starts a new
> entry
>
>         next unless $line ~~ m/ ^ (@attributes) ':' \s (.+) $ /;
>         %f{$0} = "$1";
>     }
>     say "{%ldap.elems} entries by regex in {now - BEGIN now} seconds";
> }
>
> multi MAIN ( "starts", $ldif-fn = "db/icheck.ldif" ) {
>     my ( %f );
>     for $ldif-fn.IO.lines -> $line {
>         when not $line {  # blank line is LDIF entry terminator
>             %ldap{%f<uid>} = User.new( |%f );
>         }
>         when $line.starts-with( 'dn: ' ) { %f = () }   # dn: starts a new
> entry
>
>         for @attributes -> $a {
>             if $line.starts-with( $a ~ ": " ) {
>                %f{$a} = (split( ": ", $line, 2))[1];
>                last;
>             }
>          }
>
>     }
>     say "{%ldap.elems} entries by starts-with in {now - BEGIN now}
> seconds";
> }
>
> multi MAIN ( "split", $ldif-fn = "db/icheck.ldif" ) {
>     my ( %f, $k, $v );
>     for $ldif-fn.IO.lines -> $line {
>         when not $line {  # blank line is LDIF entry terminator
>             %ldap{%f<uid>} = User.new( |%f );         # attributes not
> used are ignored
>         }
>         when $line.starts-with( 'dn: ' ) { %f = () }   # dn: starts a new
> entry
>
>         ($k, $v) = split( ": ", $line, 2);
>         %f{$k} = $v;
>     }
>     say "{%ldap.elems} entries by split in {now - BEGIN now} seconds";
> }
>
> --
> Norman Gaywood, Computer Systems Officer
> School of Science and Technology
> University of New England
> Armidale NSW 2351, Australia
>
> ngayw...@une.edu.au  http://turing.une.edu.au/~ngaywood
> Phone: +61 (0)2 6773 2412  Mobile: +61 (0)4 7862 0062
>
> Please avoid sending me Word or Power Point attachments.
> See http://www.gnu.org/philosophy/no-word-attachments.html
>


-- 
Norman Gaywood, Computer Systems Officer
School of Science and Technology
University of New England
Armidale NSW 2351, Australia

ngayw...@une.edu.au  http://turing.une.edu.au/~ngaywood
Phone: +61 (0)2 6773 2412  Mobile: +61 (0)4 7862 0062

Please avoid sending me Word or Power Point attachments.
See http://www.gnu.org/philosophy/no-word-attachments.html

Re: Primitive benchmark comparison (parsing LDIF)

Reply via email to