Re: Primitive benchmark comparison (parsing LDIF)

2021-10-28 Thread Norman Gaywood
On Fri, 29 Oct 2021 at 09:14, Norman Gaywood  wrote:

>
> Might have to test this example again on 2021.10 (not easy for me).
>

I mean, test again on 2021.9 to see if there was a regex speed up in
2021.10

-- 
Norman Gaywood, Computer Systems Officer
School of Science and Technology
University of New England
Armidale NSW 2351, Australia

ngayw...@une.edu.au  http://turing.une.edu.au/~ngaywood
Phone: +61 (0)2 6773 2412  Mobile: +61 (0)4 7862 0062

Please avoid sending me Word or Power Point attachments.
See http://www.gnu.org/philosophy/no-word-attachments.html


Re: Primitive benchmark comparison (parsing LDIF)

2021-10-28 Thread Norman Gaywood
On Fri, 29 Oct 2021 at 00:46, yary  wrote:

> A small thing to begin with in the regex  m/ ^ (@attributes) ':' \s (.+)
> $ /;
> m/ ^ (@attributes) ': ' (.*) $ /;
>

Yes, nice cleanup. Thanks.


> Next, how about adding a 2nd regex test similar to the "split" that also
> relies on User ignoring unknown fields? This accepts an empty-string key,
> which the "split" string handler does too.
>
> m/ ^ (<-[:]>*) ': ' (.*) /;
>

$ ./icheck.raku regex2
41391 entries by regex2 in 4.615332639 seconds

Woh! That was surprising. The new regex is only about 2x slower than the
"split" method.

I did read on SO that someone claimed " longest-match alternation of the
list's elements" is slow.
But I thought the conclusion in the answers was that, in general, regex's
are slow.

Might have to test this example again on 2021.10 (not easy for me).


>>> Results for rakudo-pkg-2021.9.0-01:
>>> $ ./icheck.raku regex
>>> 41391 entries by regex in 27.859560887 seconds
>>> $ ./icheck.raku starts
>>> 41391 entries by starts-with in 5.970667533 seconds
>>> $ ./icheck.raku split
>>> 41391 entries by split in 5.12252741 seconds
>>>
>>> Results for rakudo-pkg-2021.10.0-01
>>> $ ./icheck.raku regex
>>> 41391 entries by regex in 27.833870158 seconds
>>> $ ./icheck.raku starts
>>> 41391 entries by starts-with in 2.560101599 seconds
>>> $ ./icheck.raku split
>>> 41391 entries by split in 2.307679407 seconds
>>>
>>>
--
 #!/usr/bin/env raku

class User {
has $.uid;
has $.uidNumber;
has $.gidNumber;
has $.homeDirectory;
has $.mode = 0;

method attributes {
   # return ;
   User.^attributes(:local)>>.name>>.substr(2);  # Is the order
guaranteed?
}
}

# Read user info from LDIF file
my %ldap;
my @attributes = User.attributes;

multi MAIN ( "regex", $ldif-fn = "db/icheck.ldif" ) {
my ( %f );
for $ldif-fn.IO.lines -> $line {
when not $line {  # blank line is LDIF entry terminator
%ldap{%f} = User.new( |%f );
}
when $line.starts-with( 'dn: ' ) { %f = () }   # dn: starts a new
entry

next unless $line ~~ m/ ^ (@attributes) ': ' (.*) $ /;
%f{$0} = "$1";
}
say "{%ldap.elems} entries by regex in {now - BEGIN now} seconds";
}

multi MAIN ( "regex2", $ldif-fn = "db/icheck.ldif" ) {
my ( %f );
for $ldif-fn.IO.lines -> $line {
when not $line {  # blank line is LDIF entry terminator
%ldap{%f} = User.new( |%f );
}
when $line.starts-with( 'dn: ' ) { %f = () }   # dn: starts a new
entry

next unless $line ~~ m/ ^ (<-[:]>*) ': ' (.*) $ /;
%f{$0} = "$1";
}
say "{%ldap.elems} entries by regex2 in {now - BEGIN now} seconds";
}

multi MAIN ( "starts", $ldif-fn = "db/icheck.ldif" ) {
my ( %f );
for $ldif-fn.IO.lines -> $line {
when not $line {  # blank line is LDIF entry terminator
%ldap{%f} = User.new( |%f );
}
when $line.starts-with( 'dn: ' ) { %f = () }   # dn: starts a new
entry

for @attributes -> $a {
if $line.starts-with( $a ~ ": " ) {
   %f{$a} = (split( ": ", $line, 2))[1];
   last;
}
 }
}
say "{%ldap.elems} entries by starts-with in {now - BEGIN now} seconds";
}

multi MAIN ( "split", $ldif-fn = "db/icheck.ldif" ) {
my ( %f, $k, $v );
for $ldif-fn.IO.lines -> $line {
when not $line {  # blank line is LDIF entry terminator
%ldap{%f} = User.new( |%f ); # attributes not used
are ignored
}
when $line.starts-with( 'dn: ' ) { %f = () }   # dn: starts a new
entry

($k, $v) = split( ": ", $line, 2);
%f{$k} = $v;
}
say "{%ldap.elems} entries by split in {now - BEGIN now} seconds";
}



-- 
Norman Gaywood, Computer Systems Officer
School of Science and Technology
University of New England
Armidale NSW 2351, Australia

ngayw...@une.edu.au  http://turing.une.edu.au/~ngaywood
Phone: +61 (0)2 6773 2412  Mobile: +61 (0)4 7862 0062

Please avoid sending me Word or Power Point attachments.
See http://www.gnu.org/philosophy/no-word-attachments.html


Re: Primitive benchmark comparison (parsing LDIF)

2021-10-28 Thread yary
A small thing to begin with in the regex  m/ ^ (@attributes) ':' \s (.+) $
/;

All the string examples use the literal ': ' colon+space, so how about
making the regex more consistent? And also allowing the empty string as a
value, which the string examples allow.

m/ ^ (@attributes) ': ' (.*) $ /;

Next, how about adding a 2nd regex test similar to the "split" that also
relies on User ignoring unknown fields? This accepts an empty-string key,
which the "split" string handler does too.

m/ ^ (<-[:]>*) ': ' (.*) /;


-y


On Thu, Oct 28, 2021 at 2:14 AM Norman Gaywood  wrote:

> Oh, and I welcome suggestions on how I might do the task more quickly,
> elegantly, differently, etc :-)
> And critiques of the code also welcome. I still have a strong perl5 accent
> I suspect.
>
> On Thu, 28 Oct 2021 at 13:15, Norman Gaywood  wrote:
>
>> Executive summary:
>>  - comparing raku 2021.10 with raku 2021.9
>>  -comparing 3 ways of parsing (although the 2 string function ways
>> are similar)
>> - raku 2021.10 is better than 2 times as fast as 2021.9 using the
>> string functions
>> - raku 2021.10 is about the same as 2021.9 using a more general
>> regular expression
>> - regular expressions are still slow in 2021.10
>>
>> Side note: not shown here is also parsing with Text::LDIF. In 2021.9 it
>> was comparable to the regex method. Not tried with 2021.10.
>>
>> I need to parse a 40K entry LDIF file.
>>
>> Below is some code that uses 3 ways to parse.
>> There are 3 MAIN subs that differ in a few last lines of the for loop.
>> The loop reads the LDIF entries and populates %ldap keyed on the "uid" of
>> the LDIF entry.
>> The values of %ldap are User objects.
>> A %f hash is used to build the values of User on each LDIF entry
>>
>> The aim is to show the difference in timings between 3 ways of parsing
>> the LDIF
>>
>> The 1st MAIN (regex) uses this general regular expression to build %f
>>  next unless $line ~~ m/ ^ (@attributes) ':' \s (.+) $ /;
>> %f{$0} = "$1";
>>
>> The "starts" MAIN uses starts-with() to build %f
>>for @attributes -> $a {
>> if $line.starts-with( $a ~ ": " ) {
>>%f{$a} = (split( ": ", $line, 2))[1];
>>last;
>> }
>>
>> And finally the "split" MAIN uses split() but also uses the feature that
>> User.new() will ignore attributes that are not used.
>> ($k, $v) = split( ": ", $line, 2);
>> %f{$k} = $v;
>>
>> That's the difference between the MAIN()'s below. Sorry I couldn't golf
>> it down more.
>> Running the benchmarks multiple times does vary the times slightly but
>> not significantly.
>>
>> Results for rakudo-pkg-2021.9.0-01:
>> $ ./icheck.raku regex
>> 41391 entries by regex in 27.859560887 seconds
>> $ ./icheck.raku starts
>> 41391 entries by starts-with in 5.970667533 seconds
>> $ ./icheck.raku split
>> 41391 entries by split in 5.12252741 seconds
>>
>> Results for rakudo-pkg-2021.10.0-01
>> $ ./icheck.raku regex
>> 41391 entries by regex in 27.833870158 seconds
>> $ ./icheck.raku starts
>> 41391 entries by starts-with in 2.560101599 seconds
>> $ ./icheck.raku split
>> 41391 entries by split in 2.307679407 seconds
>>
>> -
>> #!/usr/bin/env raku
>>
>> class User {
>> has $.uid;
>> has $.uidNumber;
>> has $.gidNumber;
>> has $.homeDirectory;
>> has $.mode = 0;
>>
>> method attributes {
>># return ;
>>User.^attributes(:local)>>.name>>.substr(2);  # Is the order
>> guaranteed?
>> }
>> }
>>
>> # Read user info from LDIF file
>> my %ldap;
>> my @attributes = User.attributes;
>>
>> multi MAIN ( "regex", $ldif-fn = "db/icheck.ldif" ) {
>> my ( %f );
>> for $ldif-fn.IO.lines -> $line {
>> when not $line {  # blank line is LDIF entry terminator
>> %ldap{%f} = User.new( |%f );
>> }
>> when $line.starts-with( 'dn: ' ) { %f = () }   # dn: starts a new
>> entry
>>
>> next unless $line ~~ m/ ^ (@attributes) ':' \s (.+) $ /;
>> %f{$0} = "$1";
>> }
>> say "{%ldap.elems} entries by regex in {now - BEGIN now} seconds";
>> }
>>
>> multi MAIN ( "starts", $ldif-fn = "db/icheck.ldif" ) {
>> my ( %f );
>> for $ldif-fn.IO.lines -> $line {
>> when not $line {  # blank line is LDIF entry terminator
>> %ldap{%f} = User.new( |%f );
>> }
>> when $line.starts-with( 'dn: ' ) { %f = () }   # dn: starts a new
>> entry
>>
>> for @attributes -> $a {
>> if $line.starts-with( $a ~ ": " ) {
>>%f{$a} = (split( ": ", $line, 2))[1];
>>last;
>> }
>>  }
>>
>> }
>> say "{%ldap.elems} entries by starts-with in {now - BEGIN now}
>> seconds";
>> }
>>
>> multi MAIN ( "split", $ldif-fn = "db/icheck.ldif" ) {
>> my ( %f, $k, $v );
>> for $ldif-fn.IO.lines -> $line {
>> when not $line {  # blank line is LDIF entry terminator
>> %ldap{%f} = 

Re: Primitive benchmark comparison (parsing LDIF)

2021-10-28 Thread Norman Gaywood
Oh, and I welcome suggestions on how I might do the task more quickly,
elegantly, differently, etc :-)
And critiques of the code also welcome. I still have a strong perl5 accent
I suspect.

On Thu, 28 Oct 2021 at 13:15, Norman Gaywood  wrote:

> Executive summary:
>  - comparing raku 2021.10 with raku 2021.9
>  -comparing 3 ways of parsing (although the 2 string function ways are
> similar)
> - raku 2021.10 is better than 2 times as fast as 2021.9 using the
> string functions
> - raku 2021.10 is about the same as 2021.9 using a more general
> regular expression
> - regular expressions are still slow in 2021.10
>
> Side note: not shown here is also parsing with Text::LDIF. In 2021.9 it
> was comparable to the regex method. Not tried with 2021.10.
>
> I need to parse a 40K entry LDIF file.
>
> Below is some code that uses 3 ways to parse.
> There are 3 MAIN subs that differ in a few last lines of the for loop.
> The loop reads the LDIF entries and populates %ldap keyed on the "uid" of
> the LDIF entry.
> The values of %ldap are User objects.
> A %f hash is used to build the values of User on each LDIF entry
>
> The aim is to show the difference in timings between 3 ways of parsing the
> LDIF
>
> The 1st MAIN (regex) uses this general regular expression to build %f
>  next unless $line ~~ m/ ^ (@attributes) ':' \s (.+) $ /;
> %f{$0} = "$1";
>
> The "starts" MAIN uses starts-with() to build %f
>for @attributes -> $a {
> if $line.starts-with( $a ~ ": " ) {
>%f{$a} = (split( ": ", $line, 2))[1];
>last;
> }
>
> And finally the "split" MAIN uses split() but also uses the feature that
> User.new() will ignore attributes that are not used.
> ($k, $v) = split( ": ", $line, 2);
> %f{$k} = $v;
>
> That's the difference between the MAIN()'s below. Sorry I couldn't golf it
> down more.
> Running the benchmarks multiple times does vary the times slightly but not
> significantly.
>
> Results for rakudo-pkg-2021.9.0-01:
> $ ./icheck.raku regex
> 41391 entries by regex in 27.859560887 seconds
> $ ./icheck.raku starts
> 41391 entries by starts-with in 5.970667533 seconds
> $ ./icheck.raku split
> 41391 entries by split in 5.12252741 seconds
>
> Results for rakudo-pkg-2021.10.0-01
> $ ./icheck.raku regex
> 41391 entries by regex in 27.833870158 seconds
> $ ./icheck.raku starts
> 41391 entries by starts-with in 2.560101599 seconds
> $ ./icheck.raku split
> 41391 entries by split in 2.307679407 seconds
>
> -
> #!/usr/bin/env raku
>
> class User {
> has $.uid;
> has $.uidNumber;
> has $.gidNumber;
> has $.homeDirectory;
> has $.mode = 0;
>
> method attributes {
># return ;
>User.^attributes(:local)>>.name>>.substr(2);  # Is the order
> guaranteed?
> }
> }
>
> # Read user info from LDIF file
> my %ldap;
> my @attributes = User.attributes;
>
> multi MAIN ( "regex", $ldif-fn = "db/icheck.ldif" ) {
> my ( %f );
> for $ldif-fn.IO.lines -> $line {
> when not $line {  # blank line is LDIF entry terminator
> %ldap{%f} = User.new( |%f );
> }
> when $line.starts-with( 'dn: ' ) { %f = () }   # dn: starts a new
> entry
>
> next unless $line ~~ m/ ^ (@attributes) ':' \s (.+) $ /;
> %f{$0} = "$1";
> }
> say "{%ldap.elems} entries by regex in {now - BEGIN now} seconds";
> }
>
> multi MAIN ( "starts", $ldif-fn = "db/icheck.ldif" ) {
> my ( %f );
> for $ldif-fn.IO.lines -> $line {
> when not $line {  # blank line is LDIF entry terminator
> %ldap{%f} = User.new( |%f );
> }
> when $line.starts-with( 'dn: ' ) { %f = () }   # dn: starts a new
> entry
>
> for @attributes -> $a {
> if $line.starts-with( $a ~ ": " ) {
>%f{$a} = (split( ": ", $line, 2))[1];
>last;
> }
>  }
>
> }
> say "{%ldap.elems} entries by starts-with in {now - BEGIN now}
> seconds";
> }
>
> multi MAIN ( "split", $ldif-fn = "db/icheck.ldif" ) {
> my ( %f, $k, $v );
> for $ldif-fn.IO.lines -> $line {
> when not $line {  # blank line is LDIF entry terminator
> %ldap{%f} = User.new( |%f ); # attributes not
> used are ignored
> }
> when $line.starts-with( 'dn: ' ) { %f = () }   # dn: starts a new
> entry
>
> ($k, $v) = split( ": ", $line, 2);
> %f{$k} = $v;
> }
> say "{%ldap.elems} entries by split in {now - BEGIN now} seconds";
> }
>
> --
> Norman Gaywood, Computer Systems Officer
> School of Science and Technology
> University of New England
> Armidale NSW 2351, Australia
>
> ngayw...@une.edu.au  http://turing.une.edu.au/~ngaywood
> Phone: +61 (0)2 6773 2412  Mobile: +61 (0)4 7862 0062
>
> Please avoid sending me Word or Power Point attachments.
> See http://www.gnu.org/philosophy/no-word-attachments.html
>


-- 
Norman Gaywood, Computer 

Primitive benchmark comparison (parsing LDIF)

2021-10-27 Thread Norman Gaywood
Executive summary:
 - comparing raku 2021.10 with raku 2021.9
 -comparing 3 ways of parsing (although the 2 string function ways are
similar)
- raku 2021.10 is better than 2 times as fast as 2021.9 using the
string functions
- raku 2021.10 is about the same as 2021.9 using a more general regular
expression
- regular expressions are still slow in 2021.10

Side note: not shown here is also parsing with Text::LDIF. In 2021.9 it was
comparable to the regex method. Not tried with 2021.10.

I need to parse a 40K entry LDIF file.

Below is some code that uses 3 ways to parse.
There are 3 MAIN subs that differ in a few last lines of the for loop.
The loop reads the LDIF entries and populates %ldap keyed on the "uid" of
the LDIF entry.
The values of %ldap are User objects.
A %f hash is used to build the values of User on each LDIF entry

The aim is to show the difference in timings between 3 ways of parsing the
LDIF

The 1st MAIN (regex) uses this general regular expression to build %f
 next unless $line ~~ m/ ^ (@attributes) ':' \s (.+) $ /;
%f{$0} = "$1";

The "starts" MAIN uses starts-with() to build %f
   for @attributes -> $a {
if $line.starts-with( $a ~ ": " ) {
   %f{$a} = (split( ": ", $line, 2))[1];
   last;
}

And finally the "split" MAIN uses split() but also uses the feature that
User.new() will ignore attributes that are not used.
($k, $v) = split( ": ", $line, 2);
%f{$k} = $v;

That's the difference between the MAIN()'s below. Sorry I couldn't golf it
down more.
Running the benchmarks multiple times does vary the times slightly but not
significantly.

Results for rakudo-pkg-2021.9.0-01:
$ ./icheck.raku regex
41391 entries by regex in 27.859560887 seconds
$ ./icheck.raku starts
41391 entries by starts-with in 5.970667533 seconds
$ ./icheck.raku split
41391 entries by split in 5.12252741 seconds

Results for rakudo-pkg-2021.10.0-01
$ ./icheck.raku regex
41391 entries by regex in 27.833870158 seconds
$ ./icheck.raku starts
41391 entries by starts-with in 2.560101599 seconds
$ ./icheck.raku split
41391 entries by split in 2.307679407 seconds

-
#!/usr/bin/env raku

class User {
has $.uid;
has $.uidNumber;
has $.gidNumber;
has $.homeDirectory;
has $.mode = 0;

method attributes {
   # return ;
   User.^attributes(:local)>>.name>>.substr(2);  # Is the order
guaranteed?
}
}

# Read user info from LDIF file
my %ldap;
my @attributes = User.attributes;

multi MAIN ( "regex", $ldif-fn = "db/icheck.ldif" ) {
my ( %f );
for $ldif-fn.IO.lines -> $line {
when not $line {  # blank line is LDIF entry terminator
%ldap{%f} = User.new( |%f );
}
when $line.starts-with( 'dn: ' ) { %f = () }   # dn: starts a new
entry

next unless $line ~~ m/ ^ (@attributes) ':' \s (.+) $ /;
%f{$0} = "$1";
}
say "{%ldap.elems} entries by regex in {now - BEGIN now} seconds";
}

multi MAIN ( "starts", $ldif-fn = "db/icheck.ldif" ) {
my ( %f );
for $ldif-fn.IO.lines -> $line {
when not $line {  # blank line is LDIF entry terminator
%ldap{%f} = User.new( |%f );
}
when $line.starts-with( 'dn: ' ) { %f = () }   # dn: starts a new
entry

for @attributes -> $a {
if $line.starts-with( $a ~ ": " ) {
   %f{$a} = (split( ": ", $line, 2))[1];
   last;
}
 }

}
say "{%ldap.elems} entries by starts-with in {now - BEGIN now} seconds";
}

multi MAIN ( "split", $ldif-fn = "db/icheck.ldif" ) {
my ( %f, $k, $v );
for $ldif-fn.IO.lines -> $line {
when not $line {  # blank line is LDIF entry terminator
%ldap{%f} = User.new( |%f ); # attributes not used
are ignored
}
when $line.starts-with( 'dn: ' ) { %f = () }   # dn: starts a new
entry

($k, $v) = split( ": ", $line, 2);
%f{$k} = $v;
}
say "{%ldap.elems} entries by split in {now - BEGIN now} seconds";
}

-- 
Norman Gaywood, Computer Systems Officer
School of Science and Technology
University of New England
Armidale NSW 2351, Australia

ngayw...@une.edu.au  http://turing.une.edu.au/~ngaywood
Phone: +61 (0)2 6773 2412  Mobile: +61 (0)4 7862 0062

Please avoid sending me Word or Power Point attachments.
See http://www.gnu.org/philosophy/no-word-attachments.html