On 28.08.2008, at 20:03, Erik Hatcher wrote:
On Aug 28, 2008, at 1:02 PM, Jens Kraemer wrote:
What advantage does Ferret have in terms of ActiveRecord
integration that Solr wouldn't have?
If you're talking about custom analyzers being in Ruby, more on
that below.
It's not only custom analyzers, but the fact that acts_as_ferret's
DRb runs with the full Rails application loaded, so i.e. to bulk
index a number of records aaf just hands the server the ids and
class name of the records to index, and the server does the rest.
Gotcha. Meaning the search server is pulling from the DB directly.
That's what the DataImportHandler in Solr does as well. It'd be a
simple single HTTP request to Solr (once the DB stuff is configured,
of course) to have it do full or incremental DB indexing.
With the slight difference that custom model logic defined in the
rails model class is still involved to preprocess data, index values
calculated at indexing time or even have certain records refuse being
indexed based on their current state. Having per document boosts
depending on some value from the database (i.e. record popularity) is
also a classic... Aaf never just pulls data from the db, it always
uses rails model objects. Doesn't make indexing faster of course...
[..]
XML makes me ill, generally speaking (it has its uses, but for
configuration it is just plain wrong).
FULL ACK :)
For using the built-in tokenizer/filters, a smarter acts_as_solr
could generate the right config based on a model specifying
parameters for analysis.
But even if you do that then you have
a) half a java project (I don't want that)
That's totally fair, and really the primary compelling reason for
a Ferret over Solr for pure Ruby/Rails projects. I dig that.
But isn't Ferret is like 60k lines of C code too?!
true, but I don't have to compile that every time I deploy my app...
My point was that Ferret isn't just Ruby, just a counter point to
your "half a java project". No one has to recompile Solr either.
but the custom analyzer implemented in Java... By saying 'half a java
project' I didn't mean solr, but the parts of my application logic
that have to be implemented in Java in order to be plugged into solr.
But the JRuby route looks promising here of course.
and b) no way to use your existing rails classes in that custom
analyzer (I *have* analyzers using rails models to retrieve
synonyms and narrower terms for thesaurus based query expansion)
You could leverage client-side query expansion with Solr... just
take the users query, massage it, and send whatever query you like
to Solr. Solr also has synonym and stop word capability too.
yeah, I could do that. But that's moving analysis stuff into my
application, which is quite contrary to the purpose of analyzers -
encapsulate this logic and make it pluggable into the search engine
library. So less style points for this solution...
I was just saying :) It's debatable exactly where in the client-
server spectrum synonym expansion belongs... and it really depends
on the needs of the project. Nothing wrong with a client doing some
user input massaging before a query hits the search server.
[..]
Here's what I would do *if* I experienced severe problems with
Ferret in any of my projects:
Take aaf, replace Ferret with Lucene or even make it modular to
decide at run time which one to use, run the DRb server (or the
whole app, that depends) under JRuby and call it acts_as_lucene :-)
Et voila - great Rails integration plus Lucene's maturity. But as
long as Ferret's working fine for me that's really unlikely to
happen... Unless somebody wants to sponsor that project, of
course ;)
Just using Solr and fixing up acts_as_solr to meet your needs (if
it doesn't) would be even easier than all that :) Solr really is
a better starting point than Lucene directly, for caching,
scalability, replication, faceting, etc.
Depends on whether you need these features or not. From my
experience, lots of projects don't need these things anyway,
because they're running on a single host and nearly every other
part of the application is slower than search... Maybe it's because
I'm quite involved with the topic and am familiar with lucene's
API, but to me Solr looks like an additional layer of abstraction
and complexity which I only want to have when it really gives me a
feature I need. Plus the last time I checked Lucene didn't need xml
configuration files ;)
I hear ya about the XML config files. And always to be fair to Solr
here, you really only need to set things up from a basic example
configuration that covers most scenarios already - so it really
isn't necessary to even touch XML config except for tweaking little
things.
But I still have to read it in order to see if it fits my needs. Okay,
I'll stop whining about that xml now ;)
[..]
In development environments and especially when it comes to
automated tests / CI it's also quite comfortable not having to run
a separate server but using the short cut directly to the index,
which isn't possible with Solr.
Not true. Solr can work embedded. There is a base SolrServer
abstraction, with an implementation that runs embedded (inside the
same JVM) versus over HTTP. Exactly the same interface for both
operations, using a very simple API (SolrJ, much like Lucene's basic
API actually).
cool, but that won't work for Rails projects running on MRI and
accessing solr via solr-ruby.
I'd be curious to see scalability comparisons between Ferret and
Solr - or perhaps more properly between Stellr and Solr - as it
boils down to number of documents, queries per second, and
faceting and highlighting speed. I'm betting on Solr myself (by
being so into it and basing my professional life on it).
This would be interesting, but I wouldn't be that disappointed with
Stellr ending up second given the little amount of time I've spent
building it so far. Just out of curiosity, do you have some kind of
performance testing suite for Solr which I could throw at Stellr?
No, I don't have those kinds of tests myself. While I can speak to
Solr's performance based on what I hear from our clients and the
reports in the mailing lists, I don't consider myself a performance
savvy person myself.
I'm curious - what are the numbers of documents being put into
Ferret indexes out there? millions? hundreds of millions?
billions? And are folks doing faceting? Does Ferret have faceting
support?
not sure about the billions, but afair an earlier message in this
thread stated an index size of 90 million documents with aaf.
Altlaw.org has reported an index size of > 4GB with around 700k
documents last fall. The selfhtml.org index has approximately 1
million forum entries indexed, index size around 2GB. Stellr doesn't
ever use more than around 50MB of RAM during indexing and searching
this index. I know RAM is cheap and all, but RAM size still has a
quite large influence on the price of the server you rent for your
app, at least here in germany.
Without doubt Solr has much more references in the area of such large
installations than ferret/aaf. I for myself never saw aaf as a drop-in
solution for indexes of this size, but more as an easy to use out of
the box solution for the average rails app with maybe several
thousands or tens of thousands records, but I'm happy to see it still
works in larger scale setups.
Heck, it all began with a simple full text search for my blog ;)
Regarding the faceting - it's not built into ferret, and aaf doesn't
support it either since I didn't need it yet, and nobody else
requested this feature so far. All in all I think the average usage
scenarios of solr and aaf are quite different atm...
I'll try to find the time to benchmark the selfhtml.org data set with
solr and stellr. I'll report my findings here.
Cheers,
Jens
--
Jens Krämer
Finkenlust 14, 06449 Aschersleben, Germany
VAT Id DE251962952
http://www.jkraemer.net/ - Blog
http://www.omdb.org/ - The new free film database
_______________________________________________
Ferret-talk mailing list
[email protected]
http://rubyforge.org/mailman/listinfo/ferret-talk