On 28.08.2008, at 20:03, Erik Hatcher wrote:


On Aug 28, 2008, at 1:02 PM, Jens Kraemer wrote:
What advantage does Ferret have in terms of ActiveRecord integration that Solr wouldn't have?

If you're talking about custom analyzers being in Ruby, more on that below.

It's not only custom analyzers, but the fact that acts_as_ferret's DRb runs with the full Rails application loaded, so i.e. to bulk index a number of records aaf just hands the server the ids and class name of the records to index, and the server does the rest.

Gotcha. Meaning the search server is pulling from the DB directly. That's what the DataImportHandler in Solr does as well. It'd be a simple single HTTP request to Solr (once the DB stuff is configured, of course) to have it do full or incremental DB indexing.

With the slight difference that custom model logic defined in the rails model class is still involved to preprocess data, index values calculated at indexing time or even have certain records refuse being indexed based on their current state. Having per document boosts depending on some value from the database (i.e. record popularity) is also a classic... Aaf never just pulls data from the db, it always uses rails model objects. Doesn't make indexing faster of course...

[..]
XML makes me ill, generally speaking (it has its uses, but for configuration it is just plain wrong).

FULL ACK :)

For using the built-in tokenizer/filters, a smarter acts_as_solr could generate the right config based on a model specifying parameters for analysis.

But even if you do that then you have
a) half a java project (I don't want that)

That's totally fair, and really the primary compelling reason for a Ferret over Solr for pure Ruby/Rails projects. I dig that.

But isn't Ferret is like 60k lines of C code too?!

true, but I don't have to compile that every time I deploy my app...

My point was that Ferret isn't just Ruby, just a counter point to your "half a java project". No one has to recompile Solr either.

but the custom analyzer implemented in Java... By saying 'half a java project' I didn't mean solr, but the parts of my application logic that have to be implemented in Java in order to be plugged into solr. But the JRuby route looks promising here of course.

and b) no way to use your existing rails classes in that custom analyzer (I *have* analyzers using rails models to retrieve synonyms and narrower terms for thesaurus based query expansion)

You could leverage client-side query expansion with Solr... just take the users query, massage it, and send whatever query you like to Solr. Solr also has synonym and stop word capability too.

yeah, I could do that. But that's moving analysis stuff into my application, which is quite contrary to the purpose of analyzers - encapsulate this logic and make it pluggable into the search engine library. So less style points for this solution...

I was just saying :) It's debatable exactly where in the client- server spectrum synonym expansion belongs... and it really depends on the needs of the project. Nothing wrong with a client doing some user input massaging before a query hits the search server.

[..]

Here's what I would do *if* I experienced severe problems with Ferret in any of my projects:

Take aaf, replace Ferret with Lucene or even make it modular to decide at run time which one to use, run the DRb server (or the whole app, that depends) under JRuby and call it acts_as_lucene :-) Et voila - great Rails integration plus Lucene's maturity. But as long as Ferret's working fine for me that's really unlikely to happen... Unless somebody wants to sponsor that project, of course ;)

Just using Solr and fixing up acts_as_solr to meet your needs (if it doesn't) would be even easier than all that :) Solr really is a better starting point than Lucene directly, for caching, scalability, replication, faceting, etc.

Depends on whether you need these features or not. From my experience, lots of projects don't need these things anyway, because they're running on a single host and nearly every other part of the application is slower than search... Maybe it's because I'm quite involved with the topic and am familiar with lucene's API, but to me Solr looks like an additional layer of abstraction and complexity which I only want to have when it really gives me a feature I need. Plus the last time I checked Lucene didn't need xml configuration files ;)

I hear ya about the XML config files. And always to be fair to Solr here, you really only need to set things up from a basic example configuration that covers most scenarios already - so it really isn't necessary to even touch XML config except for tweaking little things.

But I still have to read it in order to see if it fits my needs. Okay, I'll stop whining about that xml now ;)

[..]
In development environments and especially when it comes to automated tests / CI it's also quite comfortable not having to run a separate server but using the short cut directly to the index, which isn't possible with Solr.

Not true. Solr can work embedded. There is a base SolrServer abstraction, with an implementation that runs embedded (inside the same JVM) versus over HTTP. Exactly the same interface for both operations, using a very simple API (SolrJ, much like Lucene's basic API actually).

cool, but that won't work for Rails projects running on MRI and accessing solr via solr-ruby.

I'd be curious to see scalability comparisons between Ferret and Solr - or perhaps more properly between Stellr and Solr - as it boils down to number of documents, queries per second, and faceting and highlighting speed. I'm betting on Solr myself (by being so into it and basing my professional life on it).

This would be interesting, but I wouldn't be that disappointed with Stellr ending up second given the little amount of time I've spent building it so far. Just out of curiosity, do you have some kind of performance testing suite for Solr which I could throw at Stellr?

No, I don't have those kinds of tests myself. While I can speak to Solr's performance based on what I hear from our clients and the reports in the mailing lists, I don't consider myself a performance savvy person myself.

I'm curious - what are the numbers of documents being put into Ferret indexes out there? millions? hundreds of millions? billions? And are folks doing faceting? Does Ferret have faceting support?

not sure about the billions, but afair an earlier message in this thread stated an index size of 90 million documents with aaf. Altlaw.org has reported an index size of > 4GB with around 700k documents last fall. The selfhtml.org index has approximately 1 million forum entries indexed, index size around 2GB. Stellr doesn't ever use more than around 50MB of RAM during indexing and searching this index. I know RAM is cheap and all, but RAM size still has a quite large influence on the price of the server you rent for your app, at least here in germany.

Without doubt Solr has much more references in the area of such large installations than ferret/aaf. I for myself never saw aaf as a drop-in solution for indexes of this size, but more as an easy to use out of the box solution for the average rails app with maybe several thousands or tens of thousands records, but I'm happy to see it still works in larger scale setups.

Heck, it all began with a simple full text search for my blog ;)

Regarding the faceting - it's not built into ferret, and aaf doesn't support it either since I didn't need it yet, and nobody else requested this feature so far. All in all I think the average usage scenarios of solr and aaf are quite different atm...

I'll try to find the time to benchmark the selfhtml.org data set with solr and stellr. I'll report my findings here.

Cheers,
Jens

--
Jens Krämer
Finkenlust 14, 06449 Aschersleben, Germany
VAT Id DE251962952
http://www.jkraemer.net/ - Blog
http://www.omdb.org/     - The new free film database

_______________________________________________
Ferret-talk mailing list
[email protected]
http://rubyforge.org/mailman/listinfo/ferret-talk

Reply via email to