Re: preserve special characters
Hi Jack, That seems like the solution I am looking for. Thanks so much! //Can't find this "types" for WDF anywhere. Ming- On Tue, Jun 18, 2013 at 4:52 PM, Jack Krupansky wrote: > The WDF has a "types" attribute which can specify one or more character > type mapping files. You could create a file like: > > @ => ALPHA > _ => ALPHA > > For example (from the book!): > > Example - Treat at-sign and underscores as text > > positionIncrementGap="100" autoGeneratePhraseQueries="**true"> > > >types="at-under-alpha.txt"/> > > > > The file +at-under-alpha.txt+ would contain: > > @ => ALPHA > _ => ALPHA > > The analysis results: > >Source: Hello @World_bar, r@end. >Tokens: 1: Hello 2: @World_bar 3: r@end > > > -- Jack Krupansky > > -Original Message- From: Mingfeng Yang > Sent: Tuesday, June 18, 2013 6:58 PM > To: solr-user@lucene.apache.org > Subject: preserve special characters > > > We need to index and search lots of tweets which can like "@solr: solr is > great". or "@solr_lucene, good combination". > > And we want to search with "@solr" or "@solr_lucene". How can we preserve > "@" and "_" in the index? > > If using whitespacetokennizer followed by worddelimiterfilter, @solr_lucene > will be broken down into "solr" and "lucene", which make the search results > contain lots of non-relevant docs. > > If using standardtokenizer, the "@" symbol is stripped. > > Thanks, > Ming- >
Re: preserve special characters
The WDF has a "types" attribute which can specify one or more character type mapping files. You could create a file like: @ => ALPHA _ => ALPHA For example (from the book!): Example - Treat at-sign and underscores as text The file +at-under-alpha.txt+ would contain: @ => ALPHA _ => ALPHA The analysis results: Source: Hello @World_bar, r@end. Tokens: 1: Hello 2: @World_bar 3: r@end -- Jack Krupansky -Original Message- From: Mingfeng Yang Sent: Tuesday, June 18, 2013 6:58 PM To: solr-user@lucene.apache.org Subject: preserve special characters We need to index and search lots of tweets which can like "@solr: solr is great". or "@solr_lucene, good combination". And we want to search with "@solr" or "@solr_lucene". How can we preserve "@" and "_" in the index? If using whitespacetokennizer followed by worddelimiterfilter, @solr_lucene will be broken down into "solr" and "lucene", which make the search results contain lots of non-relevant docs. If using standardtokenizer, the "@" symbol is stripped. Thanks, Ming-
Re: preserve special characters
You can use keyword tokenizer.. Creates org.apache.lucene.analysis.core.KeywordTokenizer. Treats the entire field as a single token, regardless of its content. Example: "http://example.com/I-am+example?Text=-Hello"; ==> "http://example.com/I-am+example?Text=-Hello"; -- View this message in context: http://lucene.472066.n3.nabble.com/preserve-special-characters-tp4071488p4071496.html Sent from the Solr - User mailing list archive at Nabble.com.
preserve special characters
We need to index and search lots of tweets which can like "@solr: solr is great". or "@solr_lucene, good combination". And we want to search with "@solr" or "@solr_lucene". How can we preserve "@" and "_" in the index? If using whitespacetokennizer followed by worddelimiterfilter, @solr_lucene will be broken down into "solr" and "lucene", which make the search results contain lots of non-relevant docs. If using standardtokenizer, the "@" symbol is stripped. Thanks, Ming-