ANNOUNCE: Stump The Chump @ Lucene Revolution EU - Tommorrow

2013-11-05 Thread Chris Hostetter
(Note: cross posted announcement, please confine any replies to solr-user) Hey folks, On Wednesday, I'll be doing a Stump The Chump session at Lucene Revolution EU in Dublin Ireland. http://lucenerevolution.org/stump-the-chump If you aren't familiar with Stump The Chump it is a QA style

Re: Twitter analyser

2013-11-05 Thread Erick Erickson
If your universe of items you want to match this way is small, consider something akin to synonyms. Your indexing process emits two tokens, with and without the @ or # which should cover your situation. FWIW, Erick On Tue, Nov 5, 2013 at 2:40 AM, Stéphane Nicoll stephane.nic...@gmail.comwrote:

Re: Twitter analyser

2013-11-05 Thread Stephane Nicoll
Hi, Thanks for the reply. It's an index with tweets so any word really is a target for this. This would mean a significant increase of the index. My volumes are really small so that shouldn't be a problem (but performance/scalability is a concern). I have the control over the query. Another

Re: Twitter analyser

2013-11-05 Thread Erick Erickson
You have to get the values _into_ the index with the special characters, that's where the issue is. Depending on your analysis chain special characters may or may not be even in your index to search in the first place. So it's not how many different words are after the special characters as much

Re: Twitter analyser

2013-11-05 Thread Jack Krupansky
You can specify custom character types with the word delimiter filter, so you could define @ and # as digit and set SPLIT_ON_NUMERICS. This would cause @foo to tokenize as two adjacent terms, ditto for #foo. Unfortunately, A user name or tag that starts with a digit would not tokenize as

Corrupt Index with IndexWriter.addIndexes(IndexReader readers[])

2013-11-05 Thread Gili Nachum
Hello, I got an index corruption in production, and was wondering if it might be a known bug (still with Lucene 3.1), or is my code doing something wrong. It's a local disk index. No known machine power lose. No suppose to even happen, right? This index that got corrupted is updated every 30sec;

Modify the StandardTokenizerFactory to concatenate all words

2013-11-05 Thread Kevin
Currently I'm using StandardTokenizerFactory which tokenizes the words bases on spaces. For Toy Story it will create tokens toy and story. Ideally, I would want to extend the functionality ofStandardTokenizerFactory to create tokens toy, story, and toy story. How do I do that?

Re: Modify the StandardTokenizerFactory to concatenate all words

2013-11-05 Thread Benson Margulies
How would you expect to recognize that 'Toy Story' is a thing? On Tue, Nov 5, 2013 at 6:32 PM, Kevin glidekensing...@gmail.com wrote: Currently I'm using StandardTokenizerFactory which tokenizes the words bases on spaces. For Toy Story it will create tokens toy and story. Ideally, I would