Thanks for the tips guys, got it working now. Cheers, Chris
On Sun, Nov 8, 2009 at 6:43 AM, Erick Erickson <erickerick...@gmail.com>wrote: > <<<I am using the StandardAnalyzer as most of the other fields being > indexed > are free form text. >>> > > If you try Ahmet's suggestion, PerFieldAnalyzerWrapper is your friend. The > snippet > above makes me wonder if you've seen this class...... > > Best > Erick > > On Sun, Nov 8, 2009 at 5:54 AM, AHMET ARSLAN <iori...@yahoo.com> wrote: > > > > Hi, > > > > > > How do I go about indexing domain names? I currently index > > > the domain, but > > > it only works if I put the exact full domain in. For > > > example: > > > > > > site:www.youtube.com (this works) > > > site:youtube.com (this doesn't work) > > > > > > I am using the StandardAnalyzer as most of the other fields > > > being indexed > > > are free form text. Currently the "site" field is stored > > > and tokenized. > > > > StandardTokenizer recognizes www.youtube.com and youtube.com as singe > > token. Therefore they do not match. You can use SimpleAnalyzer which uses > > LetterTokenizer. So > > > > www.youtube.com will be broken into three tokens: www youtube com > > youtube.com will be boreken into two tokens : youtube com > > > > By doing so site:youtube.com will bring you www.youtube.com > > > > But query site:youtube.com will also match a document like > > www.foo.com/youtube.com > > > > Note that LetterTokenizer uses Character.isLetter() method to break text. > > If your input has numbers like www.645cafe.com it will cause you > problems. > > > > In your case it is better to extend CharTonizer and override protected > > boolean isTokenChar(char c) method according to your needs. > > > > > As an additional improvement it would be even better if > > > something like this > > > worked: > > > > > > site:youtube.com/foo > > > > To accomplish this, you can pre-process your queries to strip from first > > '/' char to the end. You need to convert youtube.com/foo/bla/bla to > > youtube.com. > > You can do it in a TokenFilter along with KeywordTokenizer with writing > > custom code. > > > > Hope this helps. > > > > > > > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > > For additional commands, e-mail: java-user-h...@lucene.apache.org > > > > >