If you don't know which tokens you'll face, then it's really a much harder
problem. If you know where the token is, e.g. it's always in
http://some.example.site/a/b/<here will be the token to break>/index.html,
then it eases the task a bit. Otherwise you'll need to search every single
token produced. I can think of several ways to break "aboutus" to "about
us", or any other sequence for that matter:

1) Break it to "a boutus", "ab outus" ... "about us", "aboutu s", index all
of them in the same position. Expensive though. This I'd recommend only if
you know where this token is located (otherwise it will explode your term
dictionary).

2) Use a dictionary (real dictionary), and search it for every substring,
e.g. "a", "ab", "abo" ... "about" etc. If you find a match, split it there.
This needs some fine tuning, like checking if the rest is also a word and if
the full string is also a word, so that you don't break up meaningful words.
You'll need to get a dictionary for that.

The key though - do you know exactly where this token is? Otherwise, every
solution will be a killer to performance.

Shai

On Tue, Aug 4, 2009 at 12:59 PM, m.harig <m.ha...@gmail.com> wrote:

>
> Thanks ,
>
>              i've noticed that , but the code is for known tokens, how do i
> do it for dynamic tokens , meaning , i don't know the urls , someone picked
> up the urls and i'll index it. Is there any technique to use while indexing
> ? am using lucene 2.4.0 version. Please suggest me.
> --
> View this message in context:
> http://www.nabble.com/Searching-doubt-tp24802552p24805609.html
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>

Reply via email to