Thanks Paul. I am surprised that mmap has that huge IO advantages comparing to the classic way. So ruby take more benefit from this mmap calling. Just get learned from your case.

Regards


On 15.01.2022 17:57, Paul Procacci wrote:
Hey John,

One more follow up and then it's bedtime for me.  I wanted to further
this discussion just a little bit more by implementing the mmap
solution that I applied to perl to ruby instead.  Now all of a sudden,
ruby is much much faster.  My ruby source code follows:

Goodnight!

# ruby -W0 ./doit.rb | md5
786be54356a5832dcd1148c18de71fc8
# perl ./doit2.pl [1] | md5
786be54356a5832dcd1148c18de71fc8

# truss -c ruby -W0 ./doit.rb

<!-- snip -->
                      ------------- ------- -------
                        0.014111502    1855     260

# truss -c perl ./doit2.pl [1]

<!-- snip -->
                      ------------- ------- -------
                        0.049820267     777      52

-------------------------------------
require 'mmap';

stopwords = {}
mmap_s = Mmap.new('stopwords.txt')
mmap_s.advise(Mmap::MADV_SEQUENTIAL)
mmap_s.each_line do |s|
  s.strip!
  stopwords[s] =1
end

count = {}
mmap_c = Mmap.new('words.txt')
mmap_c.advise(Mmap::MADV_SEQUENTIAL)
mmap_c.each_line do |s|
  s.strip!
  if ! stopwords.has_key?(s)
    if count.has_key?(s)
       count[s] += 1
    else
       count[s] = 1
    end
  end
end

z = count.sort {|a1,a2| a2[1]<=>a1[1]}
z.take(20).each do |s| puts "#{s[0]} -> #{s[1]}" end

On Sat, Jan 15, 2022 at 3:48 AM Paul Procacci <pproca...@gmail.com>
wrote:

Hey John,

On Sat, Jan 15, 2022 at 3:04 AM Jon Smart <j...@smartown.nl> wrote:

Hello Paul

Do you mean by undef $/ and with <$fh> we can read the file into
memory
at one time?

In most cases the short answer is yes.

I have problems with your wording however given the 'geek' that I
am.  'At one time' .... not quite.  In your example there were over
4000 read(2) syscalls by the operating system for instance.  This
wouldn't have been 'at one time'.  ;)

Yes that would be faster b/c we don't need to read file by each
line,
which increases the disk IO.

It actually doesn't make it faster.
Perl buffers it's reads as does all modern programming languages.
If you ask perl to give you 10 bytes it certainly will, but what you
don't know is that perl has really read up to 8192 bytes.  It only
gave you what you asked for and the rest is sitting in perl buffers.

To put this another way, you can put 8192 newline characters in a
file and read this file line by line.  This doesn't equate to 8192
separate read(2) syscalls ... it's just 1 read syscall.  It won't be
faster nor slower.

Another questions:
1. what's the "truss" command?

truss is akin to strace.  If you're on linux, you can install strace
and get the samish type of utility.

It allows you to trace system calls and see how much of your time
for a given program is waiting on the kernel and/or how often it's
asking the kernel to do something.

2. what's the syntax "<:mmap"?

mmap is a method of mapping a file (among other things) into memory
on an on-demand basis.
Given the example you provided, this is actually where the speed up
comes from.  This is because my version removes the 4000+ read(2)
syscalls in favor of just 2 mmap(2) syscalls.

Thank you.

~Paul

--
__________________

:(){ :|:& };:

Links:
------
[1] http://doit2.pl

Reply via email to