Re: [Ferret-talk] Road map of ferret

Jens Kraemer Thu, 28 Aug 2008 12:06:05 -0700


On 28.08.2008, at 20:03, Erik Hatcher wrote:

On Aug 28, 2008, at 1:02 PM, Jens Kraemer wrote:
What advantage does Ferret have in terms of ActiveRecordintegration that Solr wouldn't have?
If you're talking about custom analyzers being in Ruby, more onthat below.
It's not only custom analyzers, but the fact that acts_as_ferret'sDRb runs with the full Rails application loaded, so i.e. to bulkindex a number of records aaf just hands the server the ids andclass name of the records to index, and the server does the rest.
Gotcha. Meaning the search server is pulling from the DB directly.That's what the DataImportHandler in Solr does as well. It'd be asimple single HTTP request to Solr (once the DB stuff is configured,of course) to have it do full or incremental DB indexing.

With the slight difference that custom model logic defined in therails model class is still involved to preprocess data, index valuescalculated at indexing time or even have certain records refuse beingindexed based on their current state. Having per document boostsdepending on some value from the database (i.e. record popularity) isalso a classic... Aaf never just pulls data from the db, it alwaysuses rails model objects. Doesn't make indexing faster of course...


[..]

XML makes me ill, generally speaking (it has its uses, but forconfiguration it is just plain wrong).


FULL ACK :)

For using the built-in tokenizer/filters, a smarter acts_as_solrcould generate the right config based on a model specifyingparameters for analysis.
But even if you do that then you have
a) half a java project (I don't want that)
That's totally fair, and really the primary compelling reason fora Ferret over Solr for pure Ruby/Rails projects. I dig that.
But isn't Ferret is like 60k lines of C code too?!
true, but I don't have to compile that every time I deploy my app...
My point was that Ferret isn't just Ruby, just a counter point toyour "half a java project". No one has to recompile Solr either.

but the custom analyzer implemented in Java... By saying 'half a javaproject' I didn't mean solr, but the parts of my application logicthat have to be implemented in Java in order to be plugged into solr.But the JRuby route looks promising here of course.

and b) no way to use your existing rails classes in that customanalyzer (I *have* analyzers using rails models to retrievesynonyms and narrower terms for thesaurus based query expansion)
You could leverage client-side query expansion with Solr... justtake the users query, massage it, and send whatever query you liketo Solr. Solr also has synonym and stop word capability too.
yeah, I could do that. But that's moving analysis stuff into myapplication, which is quite contrary to the purpose of analyzers -encapsulate this logic and make it pluggable into the search enginelibrary. So less style points for this solution...
I was just saying :) It's debatable exactly where in the client-server spectrum synonym expansion belongs... and it really dependson the needs of the project. Nothing wrong with a client doing someuser input massaging before a query hits the search server.


[..]

Here's what I would do *if* I experienced severe problems withFerret in any of my projects:
Take aaf, replace Ferret with Lucene or even make it modular todecide at run time which one to use, run the DRb server (or thewhole app, that depends) under JRuby and call it acts_as_lucene :-)Et voila - great Rails integration plus Lucene's maturity. But aslong as Ferret's working fine for me that's really unlikely tohappen... Unless somebody wants to sponsor that project, ofcourse ;)
Just using Solr and fixing up acts_as_solr to meet your needs (ifit doesn't) would be even easier than all that :) Solr really isa better starting point than Lucene directly, for caching,scalability, replication, faceting, etc.
Depends on whether you need these features or not. From myexperience, lots of projects don't need these things anyway,because they're running on a single host and nearly every otherpart of the application is slower than search... Maybe it's becauseI'm quite involved with the topic and am familiar with lucene'sAPI, but to me Solr looks like an additional layer of abstractionand complexity which I only want to have when it really gives me afeature I need. Plus the last time I checked Lucene didn't need xmlconfiguration files ;)
I hear ya about the XML config files. And always to be fair to Solrhere, you really only need to set things up from a basic exampleconfiguration that covers most scenarios already - so it reallyisn't necessary to even touch XML config except for tweaking littlethings.

But I still have to read it in order to see if it fits my needs. Okay,I'll stop whining about that xml now ;)


[..]

In development environments and especially when it comes toautomated tests / CI it's also quite comfortable not having to runa separate server but using the short cut directly to the index,which isn't possible with Solr.
Not true. Solr can work embedded. There is a base SolrServerabstraction, with an implementation that runs embedded (inside thesame JVM) versus over HTTP. Exactly the same interface for bothoperations, using a very simple API (SolrJ, much like Lucene's basicAPI actually).

cool, but that won't work for Rails projects running on MRI andaccessing solr via solr-ruby.

I'd be curious to see scalability comparisons between Ferret andSolr - or perhaps more properly between Stellr and Solr - as itboils down to number of documents, queries per second, andfaceting and highlighting speed. I'm betting on Solr myself (bybeing so into it and basing my professional life on it).
This would be interesting, but I wouldn't be that disappointed withStellr ending up second given the little amount of time I've spentbuilding it so far. Just out of curiosity, do you have some kind ofperformance testing suite for Solr which I could throw at Stellr?
No, I don't have those kinds of tests myself. While I can speak toSolr's performance based on what I hear from our clients and thereports in the mailing lists, I don't consider myself a performancesavvy person myself.
I'm curious - what are the numbers of documents being put intoFerret indexes out there? millions? hundreds of millions?billions? And are folks doing faceting? Does Ferret have facetingsupport?

not sure about the billions, but afair an earlier message in thisthread stated an index size of 90 million documents with aaf.Altlaw.org has reported an index size of > 4GB with around 700kdocuments last fall. The selfhtml.org index has approximately 1million forum entries indexed, index size around 2GB. Stellr doesn'tever use more than around 50MB of RAM during indexing and searchingthis index. I know RAM is cheap and all, but RAM size still has aquite large influence on the price of the server you rent for yourapp, at least here in germany.

Without doubt Solr has much more references in the area of such largeinstallations than ferret/aaf. I for myself never saw aaf as a drop-insolution for indexes of this size, but more as an easy to use out ofthe box solution for the average rails app with maybe severalthousands or tens of thousands records, but I'm happy to see it stillworks in larger scale setups.


Heck, it all began with a simple full text search for my blog ;)

Regarding the faceting - it's not built into ferret, and aaf doesn'tsupport it either since I didn't need it yet, and nobody elserequested this feature so far. All in all I think the average usagescenarios of solr and aaf are quite different atm...

I'll try to find the time to benchmark the selfhtml.org data set withsolr and stellr. I'll report my findings here.


Cheers,
Jens

--
Jens Krämer
Finkenlust 14, 06449 Aschersleben, Germany
VAT Id DE251962952
http://www.jkraemer.net/ - Blog
http://www.omdb.org/     - The new free film database

_______________________________________________
Ferret-talk mailing list
[email protected]
http://rubyforge.org/mailman/listinfo/ferret-talk

Re: [Ferret-talk] Road map of ferret

Reply via email to