If you don't know which tokens you'll face, then it's really a much harder problem. If you know where the token is, e.g. it's always in http://some.example.site/a/b/<here will be the token to break>/index.html, then it eases the task a bit. Otherwise you'll need to search every single token produced. I can think of several ways to break "aboutus" to "about us", or any other sequence for that matter:
1) Break it to "a boutus", "ab outus" ... "about us", "aboutu s", index all of them in the same position. Expensive though. This I'd recommend only if you know where this token is located (otherwise it will explode your term dictionary). 2) Use a dictionary (real dictionary), and search it for every substring, e.g. "a", "ab", "abo" ... "about" etc. If you find a match, split it there. This needs some fine tuning, like checking if the rest is also a word and if the full string is also a word, so that you don't break up meaningful words. You'll need to get a dictionary for that. The key though - do you know exactly where this token is? Otherwise, every solution will be a killer to performance. Shai On Tue, Aug 4, 2009 at 12:59 PM, m.harig <m.ha...@gmail.com> wrote: > > Thanks , > > i've noticed that , but the code is for known tokens, how do i > do it for dynamic tokens , meaning , i don't know the urls , someone picked > up the urls and i'll index it. Is there any technique to use while indexing > ? am using lucene 2.4.0 version. Please suggest me. > -- > View this message in context: > http://www.nabble.com/Searching-doubt-tp24802552p24805609.html > Sent from the Lucene - Java Users mailing list archive at Nabble.com. > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > >