On Aug 28, 2008, at 3:02 PM, Jens Kraemer wrote:
Gotcha. Meaning the search server is pulling from the DB directly. That's what the DataImportHandler in Solr does as well. It'd be a simple single HTTP request to Solr (once the DB stuff is configured, of course) to have it do full or incremental DB indexing.

With the slight difference that custom model logic defined in the rails model class is still involved to preprocess data, index values calculated at indexing time or even have certain records refuse being indexed based on their current state. Having per document boosts depending on some value from the database (i.e. record popularity) is also a classic... Aaf never just pulls data from the db, it always uses rails model objects. Doesn't make indexing faster of course...

All great points. ActiveRecord is much more pleasant than any other database access that I've ever worked with. I don't generally work with databases personally, though. The bulk of my full-text searching experiences don't involve databases at all.

I suppose the Java counterpart would be Hibernate Search - surely involving a lot more hideous XML and @annotations - ewww.


In development environments and especially when it comes to automated tests / CI it's also quite comfortable not having to run a separate server but using the short cut directly to the index, which isn't possible with Solr.

Not true. Solr can work embedded. There is a base SolrServer abstraction, with an implementation that runs embedded (inside the same JVM) versus over HTTP. Exactly the same interface for both operations, using a very simple API (SolrJ, much like Lucene's basic API actually).

cool, but that won't work for Rails projects running on MRI and accessing solr via solr-ruby.

Fair point.

Again, the answer comes back to JRuby ;) Forget MRI. Good point about solr-ruby - it is specifically designed for Solr over HTTP. It wouldn't take much to refactor it to work with embedded Solr via JRuby though. But if JRuby is a given, it'd be just as easy to work with SolrJ's API directly.

Though for testing purposes, solr-ruby is easily mocked. solr-ruby touts great (98% or something like that) code coverage with unit tests, many of those tests are against solr-ruby's API with Solr itself mocked. And there are tests that fire up Solr in the background and test that way too for full functional tests. So for unit testing purposes, having Solr running isn't needed, but it launches plenty fast enough for testing end-to-end if desired.

I'm curious - what are the numbers of documents being put into Ferret indexes out there? millions? hundreds of millions? billions? And are folks doing faceting? Does Ferret have faceting support?

not sure about the billions, but afair an earlier message in this thread stated an index size of 90 million documents with aaf. Altlaw.org has reported an index size of > 4GB with around 700k documents last fall. The selfhtml.org index has approximately 1 million forum entries indexed, index size around 2GB. Stellr doesn't ever use more than around 50MB of RAM during indexing and searching this index. I know RAM is cheap and all, but RAM size still has a quite large influence on the price of the server you rent for your app, at least here in germany.

90 million is impressive for sure.

RAM - well, when Ferret/Stellr does faceting we'll revisit that discussion :) Solr loves RAM! It still can run in modest environments, but the more RAM you can give it to use for caches (depending on your needs) the better it is.

Without doubt Solr has much more references in the area of such large installations than ferret/aaf. I for myself never saw aaf as a drop-in solution for indexes of this size, but more as an easy to use out of the box solution for the average rails app with maybe several thousands or tens of thousands records, but I'm happy to see it still works in larger scale setups.

Indeed!  ferret: +1 - no question!

Heck, it all began with a simple full text search for my blog ;)

Same for me (though I abandoned it when I realized that regular blogging and server maintenance weren't for me).

Regarding the faceting - it's not built into ferret, and aaf doesn't support it either since I didn't need it yet, and nobody else requested this feature so far. All in all I think the average usage scenarios of solr and aaf are quite different atm...

I'm really surprised by that. Faceting is the major feature that attracts folks to Solr. It's critical for all of our customers.

But yeah, no question that Lucene/Solr and Ferret/Stellr can happily coexist and aren't necessarily competition for every project. But there definitely are those areas of overlap where a project could go with either solution. And I would definitely not try to shoehorn Solr into a project where it didn't fit and Ferret worked fine. I'm pragmatic like that.

I'll try to find the time to benchmark the selfhtml.org data set with solr and stellr. I'll report my findings here.

Awesome. If you have the data in some easily digestible format, I'd be happy to toss it into Solr and report back numbers from my development machine. Drop me a line offline if you'd like.

        Erik

_______________________________________________
Ferret-talk mailing list
[email protected]
http://rubyforge.org/mailman/listinfo/ferret-talk

Reply via email to