Hi all,
I'm looking at useing Ferret for categorizing documents.
Essentially what I have are thousands of query rules that if a document
matches, then it belongs to the category that is associated with that
rule. Normally what we all do is have documents indexed and then run a
query against the index to get back the documents that matche the query.
What I want to do is the inverse. I have thousands of queries and I
want to run all of them against one document at a time. The queries
that match the document essentially categorize the document into the
associated category.
Yes, I am aware that this may not be the best way to approach a
categorization problem, but it is a portion of how our current system
works and I want to investigate ways to replace it and move on to better
mechanism for categorization.
I'm considering using our currenty query language and having it be a DSL
to generates Ruby code.
Esseintially my first whack at using Ferret for this was essentially the
following :
doc = File.read(OPTIONS.input_file)
Ferret::I.new do |index|
index << doc
FasterCSV.foreach(OPTIONS.category_csv,{ :headers => headers }) do |row|
next unless row[:boolean]
top_docs = index.search(row[:boolean])
if top_docs.hits.size > 0 then
puts "Matches : #{row[:name]}"
end
end
end
Short and sweet eh? Basically I'm looking for suggestions on better
ways to means to have thousands of ferret queries (as FQL) run against a
single document. Are there other approached that would be better? API
calls that would do this more efficiently? Means to serialize FQL so
that it doesn't have to be parsed?
Thought, comments, rants, raves, brainstorms?
enjoy,
-jeremy
--
========================================================================
Jeremy Hinegardner [EMAIL PROTECTED]
_______________________________________________
Ferret-talk mailing list
[email protected]
http://rubyforge.org/mailman/listinfo/ferret-talk