Hey John, One more follow up and then it's bedtime for me. I wanted to further this discussion just a little bit more by implementing the mmap solution that I applied to perl to ruby instead. Now all of a sudden, ruby is much much faster. My ruby source code follows:
Goodnight! # ruby -W0 ./doit.rb | md5 786be54356a5832dcd1148c18de71fc8 # perl ./doit2.pl | md5 786be54356a5832dcd1148c18de71fc8 # truss -c ruby -W0 ./doit.rb <!-- snip --> ------------- ------- ------- 0.014111502 1855 260 # truss -c perl ./doit2.pl <!-- snip --> ------------- ------- ------- 0.049820267 777 52 ------------------------------------- require 'mmap'; stopwords = {} mmap_s = Mmap.new('stopwords.txt') mmap_s.advise(Mmap::MADV_SEQUENTIAL) mmap_s.each_line do |s| s.strip! stopwords[s] =1 end count = {} mmap_c = Mmap.new('words.txt') mmap_c.advise(Mmap::MADV_SEQUENTIAL) mmap_c.each_line do |s| s.strip! if ! stopwords.has_key?(s) if count.has_key?(s) count[s] += 1 else count[s] = 1 end end end z = count.sort {|a1,a2| a2[1]<=>a1[1]} z.take(20).each do |s| puts "#{s[0]} -> #{s[1]}" end On Sat, Jan 15, 2022 at 3:48 AM Paul Procacci <pproca...@gmail.com> wrote: > Hey John, > > On Sat, Jan 15, 2022 at 3:04 AM Jon Smart <j...@smartown.nl> wrote: > >> >> Hello Paul >> >> Do you mean by undef $/ and with <$fh> we can read the file into memory >> at one time? >> > > In most cases the short answer is yes. > I have problems with your wording however given the 'geek' that I am. 'At > one time' .... not quite. In your example there were over 4000 read(2) > syscalls by the operating system for instance. This wouldn't have been 'at > one time'. ;) > > > Yes that would be faster b/c we don't need to read file by each line, >> which increases the disk IO. >> >> > It actually doesn't make it faster. > Perl buffers it's reads as does all modern programming languages. If you > ask perl to give you 10 bytes it certainly will, but what you don't know is > that perl has really read up to 8192 bytes. It only gave you what you > asked for and the rest is sitting in perl buffers. > To put this another way, you can put 8192 newline characters in a file and > read this file line by line. This doesn't equate to 8192 separate read(2) > syscalls ... it's just 1 read syscall. It won't be faster nor slower. > > > >> Another questions: >> 1. what's the "truss" command? >> > > truss is akin to strace. If you're on linux, you can install strace and > get the samish type of utility. > It allows you to trace system calls and see how much of your time for a > given program is waiting on the kernel and/or how often it's asking the > kernel to do something. > > 2. what's the syntax "<:mmap"? >> >> mmap is a method of mapping a file (among other things) into memory on an > on-demand basis. > Given the example you provided, this is actually where the speed up comes > from. This is because my version removes the 4000+ read(2) syscalls in > favor of just 2 mmap(2) syscalls. > > Thank you. > > > ~Paul > -- __________________ :(){ :|:& };: