Re: Primitive benchmark comparison (parsing LDIF)
On Fri, 29 Oct 2021 at 09:14, Norman Gaywood wrote: > > Might have to test this example again on 2021.10 (not easy for me). > I mean, test again on 2021.9 to see if there was a regex speed up in 2021.10 -- Norman Gaywood, Computer Systems Officer School of Science and Technology University of New England Armidale NSW 2351, Australia ngayw...@une.edu.au http://turing.une.edu.au/~ngaywood Phone: +61 (0)2 6773 2412 Mobile: +61 (0)4 7862 0062 Please avoid sending me Word or Power Point attachments. See http://www.gnu.org/philosophy/no-word-attachments.html
Re: Primitive benchmark comparison (parsing LDIF)
On Fri, 29 Oct 2021 at 00:46, yary wrote: > A small thing to begin with in the regex m/ ^ (@attributes) ':' \s (.+) > $ /; > m/ ^ (@attributes) ': ' (.*) $ /; > Yes, nice cleanup. Thanks. > Next, how about adding a 2nd regex test similar to the "split" that also > relies on User ignoring unknown fields? This accepts an empty-string key, > which the "split" string handler does too. > > m/ ^ (<-[:]>*) ': ' (.*) /; > $ ./icheck.raku regex2 41391 entries by regex2 in 4.615332639 seconds Woh! That was surprising. The new regex is only about 2x slower than the "split" method. I did read on SO that someone claimed " longest-match alternation of the list's elements" is slow. But I thought the conclusion in the answers was that, in general, regex's are slow. Might have to test this example again on 2021.10 (not easy for me). >>> Results for rakudo-pkg-2021.9.0-01: >>> $ ./icheck.raku regex >>> 41391 entries by regex in 27.859560887 seconds >>> $ ./icheck.raku starts >>> 41391 entries by starts-with in 5.970667533 seconds >>> $ ./icheck.raku split >>> 41391 entries by split in 5.12252741 seconds >>> >>> Results for rakudo-pkg-2021.10.0-01 >>> $ ./icheck.raku regex >>> 41391 entries by regex in 27.833870158 seconds >>> $ ./icheck.raku starts >>> 41391 entries by starts-with in 2.560101599 seconds >>> $ ./icheck.raku split >>> 41391 entries by split in 2.307679407 seconds >>> >>> -- #!/usr/bin/env raku class User { has $.uid; has $.uidNumber; has $.gidNumber; has $.homeDirectory; has $.mode = 0; method attributes { # return ; User.^attributes(:local)>>.name>>.substr(2); # Is the order guaranteed? } } # Read user info from LDIF file my %ldap; my @attributes = User.attributes; multi MAIN ( "regex", $ldif-fn = "db/icheck.ldif" ) { my ( %f ); for $ldif-fn.IO.lines -> $line { when not $line { # blank line is LDIF entry terminator %ldap{%f} = User.new( |%f ); } when $line.starts-with( 'dn: ' ) { %f = () } # dn: starts a new entry next unless $line ~~ m/ ^ (@attributes) ': ' (.*) $ /; %f{$0} = "$1"; } say "{%ldap.elems} entries by regex in {now - BEGIN now} seconds"; } multi MAIN ( "regex2", $ldif-fn = "db/icheck.ldif" ) { my ( %f ); for $ldif-fn.IO.lines -> $line { when not $line { # blank line is LDIF entry terminator %ldap{%f} = User.new( |%f ); } when $line.starts-with( 'dn: ' ) { %f = () } # dn: starts a new entry next unless $line ~~ m/ ^ (<-[:]>*) ': ' (.*) $ /; %f{$0} = "$1"; } say "{%ldap.elems} entries by regex2 in {now - BEGIN now} seconds"; } multi MAIN ( "starts", $ldif-fn = "db/icheck.ldif" ) { my ( %f ); for $ldif-fn.IO.lines -> $line { when not $line { # blank line is LDIF entry terminator %ldap{%f} = User.new( |%f ); } when $line.starts-with( 'dn: ' ) { %f = () } # dn: starts a new entry for @attributes -> $a { if $line.starts-with( $a ~ ": " ) { %f{$a} = (split( ": ", $line, 2))[1]; last; } } } say "{%ldap.elems} entries by starts-with in {now - BEGIN now} seconds"; } multi MAIN ( "split", $ldif-fn = "db/icheck.ldif" ) { my ( %f, $k, $v ); for $ldif-fn.IO.lines -> $line { when not $line { # blank line is LDIF entry terminator %ldap{%f} = User.new( |%f ); # attributes not used are ignored } when $line.starts-with( 'dn: ' ) { %f = () } # dn: starts a new entry ($k, $v) = split( ": ", $line, 2); %f{$k} = $v; } say "{%ldap.elems} entries by split in {now - BEGIN now} seconds"; } -- Norman Gaywood, Computer Systems Officer School of Science and Technology University of New England Armidale NSW 2351, Australia ngayw...@une.edu.au http://turing.une.edu.au/~ngaywood Phone: +61 (0)2 6773 2412 Mobile: +61 (0)4 7862 0062 Please avoid sending me Word or Power Point attachments. See http://www.gnu.org/philosophy/no-word-attachments.html
Re: Primitive benchmark comparison (parsing LDIF)
A small thing to begin with in the regex m/ ^ (@attributes) ':' \s (.+) $ /; All the string examples use the literal ': ' colon+space, so how about making the regex more consistent? And also allowing the empty string as a value, which the string examples allow. m/ ^ (@attributes) ': ' (.*) $ /; Next, how about adding a 2nd regex test similar to the "split" that also relies on User ignoring unknown fields? This accepts an empty-string key, which the "split" string handler does too. m/ ^ (<-[:]>*) ': ' (.*) /; -y On Thu, Oct 28, 2021 at 2:14 AM Norman Gaywood wrote: > Oh, and I welcome suggestions on how I might do the task more quickly, > elegantly, differently, etc :-) > And critiques of the code also welcome. I still have a strong perl5 accent > I suspect. > > On Thu, 28 Oct 2021 at 13:15, Norman Gaywood wrote: > >> Executive summary: >> - comparing raku 2021.10 with raku 2021.9 >> -comparing 3 ways of parsing (although the 2 string function ways >> are similar) >> - raku 2021.10 is better than 2 times as fast as 2021.9 using the >> string functions >> - raku 2021.10 is about the same as 2021.9 using a more general >> regular expression >> - regular expressions are still slow in 2021.10 >> >> Side note: not shown here is also parsing with Text::LDIF. In 2021.9 it >> was comparable to the regex method. Not tried with 2021.10. >> >> I need to parse a 40K entry LDIF file. >> >> Below is some code that uses 3 ways to parse. >> There are 3 MAIN subs that differ in a few last lines of the for loop. >> The loop reads the LDIF entries and populates %ldap keyed on the "uid" of >> the LDIF entry. >> The values of %ldap are User objects. >> A %f hash is used to build the values of User on each LDIF entry >> >> The aim is to show the difference in timings between 3 ways of parsing >> the LDIF >> >> The 1st MAIN (regex) uses this general regular expression to build %f >> next unless $line ~~ m/ ^ (@attributes) ':' \s (.+) $ /; >> %f{$0} = "$1"; >> >> The "starts" MAIN uses starts-with() to build %f >>for @attributes -> $a { >> if $line.starts-with( $a ~ ": " ) { >>%f{$a} = (split( ": ", $line, 2))[1]; >>last; >> } >> >> And finally the "split" MAIN uses split() but also uses the feature that >> User.new() will ignore attributes that are not used. >> ($k, $v) = split( ": ", $line, 2); >> %f{$k} = $v; >> >> That's the difference between the MAIN()'s below. Sorry I couldn't golf >> it down more. >> Running the benchmarks multiple times does vary the times slightly but >> not significantly. >> >> Results for rakudo-pkg-2021.9.0-01: >> $ ./icheck.raku regex >> 41391 entries by regex in 27.859560887 seconds >> $ ./icheck.raku starts >> 41391 entries by starts-with in 5.970667533 seconds >> $ ./icheck.raku split >> 41391 entries by split in 5.12252741 seconds >> >> Results for rakudo-pkg-2021.10.0-01 >> $ ./icheck.raku regex >> 41391 entries by regex in 27.833870158 seconds >> $ ./icheck.raku starts >> 41391 entries by starts-with in 2.560101599 seconds >> $ ./icheck.raku split >> 41391 entries by split in 2.307679407 seconds >> >> - >> #!/usr/bin/env raku >> >> class User { >> has $.uid; >> has $.uidNumber; >> has $.gidNumber; >> has $.homeDirectory; >> has $.mode = 0; >> >> method attributes { >># return ; >>User.^attributes(:local)>>.name>>.substr(2); # Is the order >> guaranteed? >> } >> } >> >> # Read user info from LDIF file >> my %ldap; >> my @attributes = User.attributes; >> >> multi MAIN ( "regex", $ldif-fn = "db/icheck.ldif" ) { >> my ( %f ); >> for $ldif-fn.IO.lines -> $line { >> when not $line { # blank line is LDIF entry terminator >> %ldap{%f} = User.new( |%f ); >> } >> when $line.starts-with( 'dn: ' ) { %f = () } # dn: starts a new >> entry >> >> next unless $line ~~ m/ ^ (@attributes) ':' \s (.+) $ /; >> %f{$0} = "$1"; >> } >> say "{%ldap.elems} entries by regex in {now - BEGIN now} seconds"; >> } >> >> multi MAIN ( "starts", $ldif-fn = "db/icheck.ldif" ) { >> my ( %f ); >> for $ldif-fn.IO.lines -> $line { >> when not $line { # blank line is LDIF entry terminator >> %ldap{%f} = User.new( |%f ); >> } >> when $line.starts-with( 'dn: ' ) { %f = () } # dn: starts a new >> entry >> >> for @attributes -> $a { >> if $line.starts-with( $a ~ ": " ) { >>%f{$a} = (split( ": ", $line, 2))[1]; >>last; >> } >> } >> >> } >> say "{%ldap.elems} entries by starts-with in {now - BEGIN now} >> seconds"; >> } >> >> multi MAIN ( "split", $ldif-fn = "db/icheck.ldif" ) { >> my ( %f, $k, $v ); >> for $ldif-fn.IO.lines -> $line { >> when not $line { # blank line is LDIF entry terminator >> %ldap{%f} = User.new(
Re: Primitive benchmark comparison (parsing LDIF)
Oh, and I welcome suggestions on how I might do the task more quickly, elegantly, differently, etc :-) And critiques of the code also welcome. I still have a strong perl5 accent I suspect. On Thu, 28 Oct 2021 at 13:15, Norman Gaywood wrote: > Executive summary: > - comparing raku 2021.10 with raku 2021.9 > -comparing 3 ways of parsing (although the 2 string function ways are > similar) > - raku 2021.10 is better than 2 times as fast as 2021.9 using the > string functions > - raku 2021.10 is about the same as 2021.9 using a more general > regular expression > - regular expressions are still slow in 2021.10 > > Side note: not shown here is also parsing with Text::LDIF. In 2021.9 it > was comparable to the regex method. Not tried with 2021.10. > > I need to parse a 40K entry LDIF file. > > Below is some code that uses 3 ways to parse. > There are 3 MAIN subs that differ in a few last lines of the for loop. > The loop reads the LDIF entries and populates %ldap keyed on the "uid" of > the LDIF entry. > The values of %ldap are User objects. > A %f hash is used to build the values of User on each LDIF entry > > The aim is to show the difference in timings between 3 ways of parsing the > LDIF > > The 1st MAIN (regex) uses this general regular expression to build %f > next unless $line ~~ m/ ^ (@attributes) ':' \s (.+) $ /; > %f{$0} = "$1"; > > The "starts" MAIN uses starts-with() to build %f >for @attributes -> $a { > if $line.starts-with( $a ~ ": " ) { >%f{$a} = (split( ": ", $line, 2))[1]; >last; > } > > And finally the "split" MAIN uses split() but also uses the feature that > User.new() will ignore attributes that are not used. > ($k, $v) = split( ": ", $line, 2); > %f{$k} = $v; > > That's the difference between the MAIN()'s below. Sorry I couldn't golf it > down more. > Running the benchmarks multiple times does vary the times slightly but not > significantly. > > Results for rakudo-pkg-2021.9.0-01: > $ ./icheck.raku regex > 41391 entries by regex in 27.859560887 seconds > $ ./icheck.raku starts > 41391 entries by starts-with in 5.970667533 seconds > $ ./icheck.raku split > 41391 entries by split in 5.12252741 seconds > > Results for rakudo-pkg-2021.10.0-01 > $ ./icheck.raku regex > 41391 entries by regex in 27.833870158 seconds > $ ./icheck.raku starts > 41391 entries by starts-with in 2.560101599 seconds > $ ./icheck.raku split > 41391 entries by split in 2.307679407 seconds > > - > #!/usr/bin/env raku > > class User { > has $.uid; > has $.uidNumber; > has $.gidNumber; > has $.homeDirectory; > has $.mode = 0; > > method attributes { ># return ; >User.^attributes(:local)>>.name>>.substr(2); # Is the order > guaranteed? > } > } > > # Read user info from LDIF file > my %ldap; > my @attributes = User.attributes; > > multi MAIN ( "regex", $ldif-fn = "db/icheck.ldif" ) { > my ( %f ); > for $ldif-fn.IO.lines -> $line { > when not $line { # blank line is LDIF entry terminator > %ldap{%f} = User.new( |%f ); > } > when $line.starts-with( 'dn: ' ) { %f = () } # dn: starts a new > entry > > next unless $line ~~ m/ ^ (@attributes) ':' \s (.+) $ /; > %f{$0} = "$1"; > } > say "{%ldap.elems} entries by regex in {now - BEGIN now} seconds"; > } > > multi MAIN ( "starts", $ldif-fn = "db/icheck.ldif" ) { > my ( %f ); > for $ldif-fn.IO.lines -> $line { > when not $line { # blank line is LDIF entry terminator > %ldap{%f} = User.new( |%f ); > } > when $line.starts-with( 'dn: ' ) { %f = () } # dn: starts a new > entry > > for @attributes -> $a { > if $line.starts-with( $a ~ ": " ) { >%f{$a} = (split( ": ", $line, 2))[1]; >last; > } > } > > } > say "{%ldap.elems} entries by starts-with in {now - BEGIN now} > seconds"; > } > > multi MAIN ( "split", $ldif-fn = "db/icheck.ldif" ) { > my ( %f, $k, $v ); > for $ldif-fn.IO.lines -> $line { > when not $line { # blank line is LDIF entry terminator > %ldap{%f} = User.new( |%f ); # attributes not > used are ignored > } > when $line.starts-with( 'dn: ' ) { %f = () } # dn: starts a new > entry > > ($k, $v) = split( ": ", $line, 2); > %f{$k} = $v; > } > say "{%ldap.elems} entries by split in {now - BEGIN now} seconds"; > } > > -- > Norman Gaywood, Computer Systems Officer > School of Science and Technology > University of New England > Armidale NSW 2351, Australia > > ngayw...@une.edu.au http://turing.une.edu.au/~ngaywood > Phone: +61 (0)2 6773 2412 Mobile: +61 (0)4 7862 0062 > > Please avoid sending me Word or Power Point attachments. > See http://www.gnu.org/philosophy/no-word-attachments.html > -- Norman Gaywood, Computer