Hi Mark, Thanks for the explanation - makes sense!
Re ES - yes. But I pasted your Q in http://blog.sematext.com/2012/09/04/solr-vs-elasticsearch-part-2-data-handling/comments, too, so you should get a more thorough answer there soon. Otis -- Search Analytics - http://sematext.com/search-analytics/index.html Performance Monitoring - http://sematext.com/spm/index.html On Thu, Nov 1, 2012 at 3:32 PM, Mark Bennett <mbenn...@ideaeng.com> wrote: > Hi Otis, > > Forgive my vagueness, it's an NDA thing. > > Generally speaking you might want to do record matching based on a number > of fields. But since text fields are input by humans, they can be a bit > inconsistent about how values are entered. > > One answer is to remove things like stop words, abbreviations, > punctuation, etc. to normalize the fields a bit. But then you might want > to do some fuzzy matching with things like Levenstein or double metaphone, > etc., and treat the entire field as one "unit". > > I realize you could still get much of this by then using traditional > search, but in the app we're porting the business rules are quite specific > and need to support legacy accounts. And clearly the combination of > normalization and fuzzy matching is potentially quite "lossy", but here > again the business logic has other mitigators for that. > > Let me ask you a question back. We really appreciate your ongoing series > on Solr vs. ElasticSearch (I haven't dove into ES yet). Looking your > section on indexing ( > http://blog.sematext.com/2012/09/04/solr-vs-elasticsearch-part-2-data-handling/), > can ES be as precise and flexible about creating highly customized tokens? > When I initially heard about schema-less and read their use cases, I had > the impression that ES was more for mainstream use-cases, but your review > got me thinking maybe there's a lot more there? > > Mark > > > -- > Mark Bennett / New Idea Engineering, Inc. / mbenn...@ideaeng.com > Direct: 408-733-0387 / Main: 866-IDEA-ENG / Cell: 408-829-6513 > > > On Thu, Nov 1, 2012 at 12:01 PM, Otis Gospodnetic < > otis.gospodne...@gmail.com> wrote: > >> Hi Mark, >> >> Out of curiosity, what was your use case? >> >> Thanks, >> Otis >> -- >> Search Analytics - http://sematext.com/search-analytics/index.html >> Performance Monitoring - http://sematext.com/spm/index.html >> >> >> >> On Wed, Oct 31, 2012 at 10:56 PM, Mark Bennett <mbenn...@ideaeng.com>wrote: >> >>> This filter lets you "glue" tokens back together. This has been >>> discussed and posted on the list before, but this updated version uses all >>> the preferred 4.x classes. >>> >>> Normally you wouldn't want to stick tokens back together, but if you've >>> found this post, you probably have some atypical need for it (as I did) >>> As an example you could: >>> * Let tokenizer break up text on white spaces >>> * Then lowercase >>> * then remove stop words >>> * ***then concatenate all the words back together into one string*** >>> >>> You'll need: >>> * ConcatFilter.java (for lucene, below) >>> * ConcatFilterFactory.java (for solr, below) >>> * entry in your schema >>> >>> schema.xml entry >>> ---------- >>> ... >>> <fieldType ...> >>> <analyzer> >>> ... >>> <filter class="solr.ConcatFilterFactory" /> >>> ... >>> </analyzer> >>> </fieldType> >>> ... >>> >>> ConcatFilter.java >>> ----------------- >>> package org.apache.lucene.analysis; >>> import java.io.IOException; >>> import org.apache.lucene.analysis.TokenFilter; >>> import org.apache.lucene.analysis.TokenStream; >>> import org.apache.lucene.analysis.tokenattributes.CharTermAttribute; >>> public class ConcatFilter extends TokenFilter { >>> protected CharTermAttribute charTermAttr; >>> public ConcatFilter(TokenStream input) { >>> super(input); >>> charTermAttr = addAttribute( CharTermAttribute.class ); >>> } >>> @Override >>> public boolean incrementToken() throws IOException { >>> StringBuilder buffer = new StringBuilder(); >>> while( input.incrementToken() ) { >>> buffer.append( charTermAttr ); >>> } >>> // We need to clear it either way >>> charTermAttr.setEmpty(); >>> if ( buffer.length() > 0 ) { >>> charTermAttr.append( buffer ); >>> return true; >>> } >>> else { >>> return false; >>> } >>> } >>> } >>> >>> ConcatFilterFactory.java >>> ------------------------ >>> package org.apache.solr.analysis; >>> import org.apache.lucene.analysis.TokenStream; >>> import org.apache.lucene.analysis.util.TokenFilterFactory; >>> public class ConcatFilterFactory extends TokenFilterFactory { >>> @Override >>> public TokenStream create(TokenStream stream) { >>> return new ConcatFilter(stream); >>> } >>> } >>> >>> >>> -- >>> Mark Bennett / New Idea Engineering, Inc. / mbenn...@ideaeng.com >>> Direct: 408-733-0387 / Main: 866-IDEA-ENG / Cell: 408-829-6513 >>> >> >> >