On Mon, Apr 23, 2007 at 03:52:51AM +0200, Danny Burkes wrote: [..] > > Sorry, I should have been more clear- what I was referring to was not > storage, but rather tokenization. My understanding is that many > people use a simple Regex-based one-token-per-character tokenizer for > non-Latin languages, but, since our languages are mixed, I wasn't sure > what type of approach to tokenization would be best. Clearly we can't > use that one-token-per-character analyzer on latin text, right?
Right :-) Some heuristics to get an idea about which language you're working on right now might be a good idea to select a proper analyzing algorithm. The Nutch search engine (Java, Lucene-based) seems to have something like that, possibly we could port this: http://wiki.apache.org/nutch/MultiLingualSupport > > However, if your search system isn't online (ie, the feature isn't > > enabled in the front end), why would you need anything special? The > > AAF DRb server can server requests while you're running a rebuild (as > > long as you don't use the current rebuild_index method). > > > > Perhaps I'm remembering incorrectly, but my recollection was that, the > first time I created a new record for a model that uses aaf, the whole > instance blocked while aaf was creating the index. Did I remember that > wrong? No, that's correct. You can force a rebuild by calling Model.rebuild_index from the console. > If that is the way that it works, then, clearly, I need to start the > rebuild from outside of the application, before any users can create new > model objects. > > Further, are you saying that model creations during the rebuild won't > block (I guess they realize that a rebuild is already happening and just > return immediately)? Unfortunately the DRb server doesn't realize this, yet. As Ryan wrote, I plan to rework the re-indexing stuff in the near future, most likely then there will be some kind of index rotation and a queue remembering model updates that occured while a rebuild is going on. > >> 5. I suspect we will have to disable_ferret(:always) on our utterance > >> model, then update the index manually on some periodic basis (cron job, > >> backgroundrb worker, etc.). The reason for this is that we don't want > >> to introduce any delay into the process of storing a new utterance, > >> which occurs in realtime during a chat session. Anyone have experience > >> doing this? > > > > It's pretty fast. The only time you'd see a slowdown is when you > > encounter a lock in the DRb server. > > > > And what would cause that? Do normal model creates cause a lock? Index updates are synchronized as there only may be one thread writing to the index at a time. In case immediate indexing of new or updated records is not needed, I see no problem in doing this later from cron or backgroundrb based on some flag or timestamp. Ferret *is* fast, but you also have to take into account the DRb round trip time, so this really could make sense for a chat application. cheers, Jens -- Jens Krämer webit! Gesellschaft für neue Medien mbH Schnorrstraße 76 | 01069 Dresden Telefon +49 351 46766-0 | Telefax +49 351 46766-66 [EMAIL PROTECTED] | www.webit.de Amtsgericht Dresden | HRB 15422 GF Sven Haubold, Hagen Malessa _______________________________________________ Ferret-talk mailing list [email protected] http://rubyforge.org/mailman/listinfo/ferret-talk

