Re: Multiple Words in String
Short form: I think you're going down a rabbit-hole and should just use synonyms and forget about it. I'm particularly thinking that a general-purpose solution that somehow breaks up or combines adjacent tokens will have consequences that pop out other places that you don't want and you'll have to fix *that*. I can't think of a way to do this that wouldn't run that danger. Long form, think of it as a sermon, it's Sunday after all. This is the point, in my experience, where you have to ask your business people "what's it worth to you"? You can handle any case the come up similar to the examples you've shown by adding it into your synonyms file - compressing any pair into it's joined form (as a synonym) and be done with it. This is a very straight-forward approach that has predictable consequences. Or you can mess around, possibly for quite some time, trying to find a general purpose solution that will almost inevitably lead to unanticipated behavior that you'll then spend lots of time trying to chase down, time you could have spent putting in features that your users will actually notice. Here's a test. Ask your business people to create a list of all the pairs they want to see treated like this. If your response is any variant of "we don't have time to do that" then even *they* must not think it's very important . And if they do, put it in your synonyms file and be a hero Evil thoughts aside, I'm dead serious. This is the kind of rabbit-hole that development efforts go down that, in all probability, add almost zero *value* to the product. There's a way to handle 95% of the cases that's very easy to implement. It's already there in Solr. Historically, we in the programming field have done a very poor job of making it clear to the business folks that every such request has not only an implementation cost (and we all too often don't include debugging/maintenance in that cost) but an opportunity cost. We owe it to the business folks *and ourselves* to clearly explain to them the cost and let them make the decision whether it's worth it. A decision based on information. And understand that I'm not knocking the business folks here. We haven't given them the consequences to weigh, so how can we fault their decisions? OK, sermon over . I've just too often said "yes, we can do that" without thinking to add "and it'll cost 3 weeks of development effort". Eventually I figured out that adding the estimate and letting the business folks know what I wouldn't be able to get to because of that time spent lead to "Oh, never mind". Best Erick P.S. Ok, it's late Sunday night and I feel like writing long, involved responses that aren't entirely on-topic On Sun, Apr 3, 2011 at 9:04 PM, Chris Fauerbach wrote: > It's not a specific case only ( e.g. microsoft.com), but it's really a > multi word issue. > > carwash, bookkeeper etc... > > I'm ultimately looking for a schema for search and retrieve that's heavily > focused on 'names'.. these are peoples names, business names etc.. not > content like large text fields, web sites or anything like that, but > business data that I'm very succesfully receiving using dataimport > handlers... it's these special cases that are really tripping me up .. my > business folks keep coming up with them! > > > Chris Fauerbach > chrisfauerb...@gmail.com > > > On Sun, Apr 3, 2011 at 6:51 PM, Erick Erickson >wrote: > > > Is this a general question or specific? You can handle specific ones by > > using synonyms. > > > > But the general case, that is treating any two pairs of tokens as > > a single pair seems fraught with unintended consequences, but > > you know your problem space better than I do. > > > > Best > > Erick > > > > On Sat, Apr 2, 2011 at 2:21 PM, Chris Fauerbach < > chrisfauerb...@gmail.com > > >wrote: > > > > > Good afternoon everyone! > > > I am stumped, and I would love some help.I'm new to solr/lucene, > > > but I have thrown myself into it, so I think I have a solid > > > understanding. Using the analysis tool in the admin interface, I see > > > these words stemmed and processed as I assume they would be, so I'm > > > stuck. > > > > > > In my index, I have two documents, each with a text field, and here > > > are example values > > > > > > 1) microsoft.com > > > 2) micro soft > > > > > > I want to do a search using microsoft or "micro soft" and find both. > > > I'm using the dismax interface, the fields are properly listed in the > > > config, and I can find both records, but never at the same time. > > > Here's my schema.xml for my text field, any thoughts on what I can do > > > to find these together? > > > > > > > > > > > positionIncrementGap="100"> > > > > > > > > > > > > > > words="stopwords.txt" enablePositionIncrements="true"/> > > > > > generateWordParts="1" generateNumberParts="1" catenateWords="1" > > > catenateNumbers="0" catenateAll="0" splitOnCaseChange="1" > > > preserveOriginal="1"/> >
Re: Multiple Words in String
It's not a specific case only ( e.g. microsoft.com), but it's really a multi word issue. carwash, bookkeeper etc... I'm ultimately looking for a schema for search and retrieve that's heavily focused on 'names'.. these are peoples names, business names etc.. not content like large text fields, web sites or anything like that, but business data that I'm very succesfully receiving using dataimport handlers... it's these special cases that are really tripping me up .. my business folks keep coming up with them! Chris Fauerbach chrisfauerb...@gmail.com On Sun, Apr 3, 2011 at 6:51 PM, Erick Erickson wrote: > Is this a general question or specific? You can handle specific ones by > using synonyms. > > But the general case, that is treating any two pairs of tokens as > a single pair seems fraught with unintended consequences, but > you know your problem space better than I do. > > Best > Erick > > On Sat, Apr 2, 2011 at 2:21 PM, Chris Fauerbach >wrote: > > > Good afternoon everyone! > > I am stumped, and I would love some help.I'm new to solr/lucene, > > but I have thrown myself into it, so I think I have a solid > > understanding. Using the analysis tool in the admin interface, I see > > these words stemmed and processed as I assume they would be, so I'm > > stuck. > > > > In my index, I have two documents, each with a text field, and here > > are example values > > > > 1) microsoft.com > > 2) micro soft > > > > I want to do a search using microsoft or "micro soft" and find both. > > I'm using the dismax interface, the fields are properly listed in the > > config, and I can find both records, but never at the same time. > > Here's my schema.xml for my text field, any thoughts on what I can do > > to find these together? > > > > > > > positionIncrementGap="100"> > > > > > > > > > words="stopwords.txt" enablePositionIncrements="true"/> > > > generateWordParts="1" generateNumberParts="1" catenateWords="1" > > catenateNumbers="0" catenateAll="0" splitOnCaseChange="1" > > preserveOriginal="1"/> > > > synonyms="syn/index_synonyms.txt" ignoreCase="true" expand="true"/> > > minGramSize="2" > > maxGramSize="15" side="front"/> > > minGramSize="2" > > maxGramSize="15" side="back"/> > > > language="English" protected="protwords.txt"/> > > > > > > > > > > minGramSize="2" > > maxGramSize="15" side="front"/> > > minGramSize="2" > > maxGramSize="15" side="back"/> > > > words="stopwords.txt" enablePositionIncrements="true"/> > > > generateWordParts="1" generateNumberParts="1" catenateWords="1" > > catenateNumbers="0" catenateAll="0" splitOnCaseChange="1" > > preserveOriginal="1"/> > > > language="English" protected="protwords.txt"/> > > > > > > > > >
Re: Multiple Words in String
Is this a general question or specific? You can handle specific ones by using synonyms. But the general case, that is treating any two pairs of tokens as a single pair seems fraught with unintended consequences, but you know your problem space better than I do. Best Erick On Sat, Apr 2, 2011 at 2:21 PM, Chris Fauerbach wrote: > Good afternoon everyone! > I am stumped, and I would love some help.I'm new to solr/lucene, > but I have thrown myself into it, so I think I have a solid > understanding. Using the analysis tool in the admin interface, I see > these words stemmed and processed as I assume they would be, so I'm > stuck. > > In my index, I have two documents, each with a text field, and here > are example values > > 1) microsoft.com > 2) micro soft > > I want to do a search using microsoft or "micro soft" and find both. > I'm using the dismax interface, the fields are properly listed in the > config, and I can find both records, but never at the same time. > Here's my schema.xml for my text field, any thoughts on what I can do > to find these together? > > > positionIncrementGap="100"> > > > > words="stopwords.txt" enablePositionIncrements="true"/> > generateWordParts="1" generateNumberParts="1" catenateWords="1" > catenateNumbers="0" catenateAll="0" splitOnCaseChange="1" > preserveOriginal="1"/> > synonyms="syn/index_synonyms.txt" ignoreCase="true" expand="true"/> > maxGramSize="15" side="front"/> > maxGramSize="15" side="back"/> > language="English" protected="protwords.txt"/> > > > > > maxGramSize="15" side="front"/> > maxGramSize="15" side="back"/> > words="stopwords.txt" enablePositionIncrements="true"/> > generateWordParts="1" generateNumberParts="1" catenateWords="1" > catenateNumbers="0" catenateAll="0" splitOnCaseChange="1" > preserveOriginal="1"/> > language="English" protected="protwords.txt"/> > > > >
Re: Multiple Words in String
I managed to find both documents with your two input queries . Add this filter in your analyzer query part : => The main problem is that your query "microsoft" is transformed into one single PhraseQuery which cannot match the document containing "micro soft". The PositionFilterFactory will transform the query into multiple queries. You can activate the debug mode to see the differences. you can see more informations here : http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.PositionFilterFactory Ludovic. - Jouve France. -- View this message in context: http://lucene.472066.n3.nabble.com/Multiple-Words-in-String-tp2767964p2770713.html Sent from the Solr - User mailing list archive at Nabble.com.