> One more follow up and then it's bedtime for me.  I wanted to further this
> discussion just a little bit more by implementing the mmap solution that I
> applied to perl to ruby instead.  Now all of a sudden, ruby is much much
> faster.  My ruby source code follows:
> # ruby -W0 ./doit.rb | md5
> 786be54356a5832dcd1148c18de71fc8
> # perl ./ | md5
> 786be54356a5832dcd1148c18de71fc8
> # truss -c ruby -W0 ./doit.rb
>                       ------------- ------- -------
>                         0.014111502    1855     260
> # truss -c perl ./
>                       ------------- ------- -------
>                         0.049820267     777      52
> -------------------------------------
> require 'mmap';
> stopwords = {}
> mmap_s ='stopwords.txt')
> mmap_s.advise(Mmap::MADV_SEQUENTIAL)
> mmap_s.each_line do |s|
>   s.strip!
>   stopwords[s] =1
> end
> count = {}
> mmap_c ='words.txt')
> mmap_c.advise(Mmap::MADV_SEQUENTIAL)
> mmap_c.each_line do |s|
>   s.strip!
>   if ! stopwords.has_key?(s)
>     if count.has_key?(s)
>        count[s] += 1
>     else
>        count[s] = 1
>     end
>   end
> end
> z = count.sort {|a1,a2| a2[1]<=>a1[1]}
> z.take(20).each do |s| puts "#{s[0]} -> #{s[1]}" end
>>> Do you mean by undef $/ and with <$fh> we can read the file into memory
>>> at one time?
>> In most cases the short answer is yes.
>> I have problems with your wording however given the 'geek' that I am.
>> 'At one time' .... not quite.  In your example there were over 4000 read(2)
>> syscalls by the operating system for instance.  This wouldn't have been 'at
>> one time'.  ;)
>> Yes that would be faster b/c we don't need to read file by each line,
>>> which increases the disk IO.
>> It actually doesn't make it faster.
>> Perl buffers it's reads as does all modern programming languages.  If you
>> ask perl to give you 10 bytes it certainly will, but what you don't know is
>> that perl has really read up to 8192 bytes.  It only gave you what you
>> asked for and the rest is sitting in perl buffers.
>> To put this another way, you can put 8192 newline characters in a file
>> and read this file line by line.  This doesn't equate to 8192 separate
>> read(2) syscalls ... it's just 1 read syscall.  It won't be faster nor
>> slower.
>>> Another questions:
>>> 1. what's the "truss" command?
>> truss is akin to strace.  If you're on linux, you can install strace and
>> get the samish type of utility.
>> It allows you to trace system calls and see how much of your time for a
>> given program is waiting on the kernel and/or how often it's asking the
>> kernel to do something.
>> 2. what's the syntax "<:mmap"?
>>> mmap is a method of mapping a file (among other things) into memory on
>> an on-demand basis.
>> Given the example you provided, this is actually where the speed up comes
>> from.  This is because my version removes the 4000+ read(2) syscalls in
>> favor of just 2 mmap(2) syscalls.
