I have a script (below) which attempts to make an index out of all the 
man pages on my system. It takes a while, mostly because it runs man 
over and over... but anyway, as time goes on the memory usage goes up 
and up and never down. Eventually, it runs out of ram and just starts 
thrashing up the swap space, pretty much grinding to a halt.

The workaround would seem to be to index documents in batches in the 
background, shutting down the index process every so often to recover 
its memory. I'm about to try that, because I'm really hunting a 
different bug... however, the memory problem concerns me.


require 'rubygems'
require 'ferret'
require 'set'

dir = "temp_index"

if ARGV.first=="-p"
    ARGV.shift
    prefix=ARGV.shift
end

fi= Ferret::Index::FieldInfos.new
fi.add_field :name,
     :index => :yes, :store => :yes, :term_vector => :with_positions

%w[data field1 field2 field3].each{|fieldname|
   fi.add_field fieldname.to_sym,
        :index => :yes, :store => :no, :term_vector => :with_positions
}

i = Ferret::Index::IndexWriter.new(:path=>dir, :create=>true, 
:field_infos=>fi)

list=Dir["/usr/share/man/*/#{prefix}*.gz"]
numpages=(ARGV.last||list.size).to_i

list[0...numpages].each{|manfile|
   all,name,section=*/\A(.*)\.([^.]+)\Z/.match(File.basename(manfile, 
".gz"))
   tttt=`man #{section} #{name}`.gsub(/.\b/m, '')

   i << {
   :data=>tttt.to_s,
   :name=>name,
   :field1=>name,
   :field2=>name,
   :field3=>name,
   }
}

i.close


i=Ferret::Index::IndexReader.new dir

i.max_doc.times{|n|
   i.term_vector(n,:data).terms \
    .inject(0){|sum,tvt| tvt.positions.size } > 1_000_000 and
      puts "heinous term count for #{i[n][:name]}"
}

seenterms=Set[]
begin
i.terms(:data).each{|term,df|
seenterms.include? term and next
i.term_docs_for(:data,term)
seenterms << term
}
rescue Exception
   raise
end




_______________________________________________
Ferret-talk mailing list
[email protected]
http://rubyforge.org/mailman/listinfo/ferret-talk

Reply via email to