We have several 3GB indexes with approximately 1 million documents in  
each of them. Here are some quick notes, feel free to reach out with  
other questions:

* no corruption problems that weren't our fault.
* there was an issue with large index files (> ~2GB) that was patched,  
but I'm honestly not sure if it is in the trunk, as the ferret trac/ 
svn is frequently MIA (which is a concern of course)
* the code is clear and fairly easy to follow. AAF is very easy to  
follow.
* I've been very happy with performance of the actual indexing/ 
searching, however you need to watch out for the processes that are  
actually doing the synchronization for writes. DRB is a bottleneck for  
us right now, though our volume isn't high enough that I'd call it a  
real problem yet.
* for moderately high-volume sites you'll want to consider batching  
index updates "offline", though for large indexes make sure that you  
have enough IO capacity to optimize the index. We host on EC2 and the  
$.1/hour instances simply do not have anywhere near the IO capacity to  
optimize a large index without having _every other process_ waiting  
for IO. I haven't tested the larger instance types yet.
* we love how easy and efficient it is to combine many indexes into  
one. We index tens of thousands of websites in parallel and then  
combine 100 or so indexes into one index very quickly.
* the mailing list is great. Jens is on top of things, very receptive  
to new ideas and takes *very* good care of AAF. Haven't seen Dave  
Balmain in a while.

Overall we are happy. There are times when search accuracy questions  
come up, and frequently the problem is that we are not effectively  
parsing queries or using the right analyzer for the problem at hand,  
so RTFM (http://www.oreilly.com/catalog/9780596527853/).

That's all I can think of now...

Erik
On Nov 15, 2007, at 9:37 AM, Sam Smoot wrote:

> Hello. I'm the author of DataMapper (http://datamapper.org), and am
> trying to choose what Full-Text-Indexing engine/plugin I want to
> include by default. I was hoping you guys could help. :-)
>
> Sphinx comes highly recommended, but without live index updates, it
> just doesn't seem practical for most of my work.
>
> I'm most experienced with Solr, but the whole HTTP::Request and
> general complexity of it is off-putting.
>
> I haven't used Ferret in an application yet, but I love what I see so
> far. The ability to have an in-process server in development, and the
> clean Ruby API are big wins for me. But I've heard a lot of scary
> things about corrupted indexes, even when using the DRb server. Is
> this just FUD? Are there any unresolved issues revolving around
> corrupted indexes? Can I afford to use Ferret in big applications for
> Fortune-500 clients? (I know that sounds... pompous really, but it's a
> genuine concern.)
>
> Any advice you could offer would be greatly appreciated.
>
> I've also read a few messages about serializing index requests/updates
> to Ferret through message-queues. Are there any decent
> guides/blog-posts on this topic?
>
> Thanks, -Sam
> _______________________________________________
> Ferret-talk mailing list
> [email protected]
> http://rubyforge.org/mailman/listinfo/ferret-talk

_______________________________________________
Ferret-talk mailing list
[email protected]
http://rubyforge.org/mailman/listinfo/ferret-talk

Reply via email to