I think you will have to write a custom analyzer and tokenizer to produce the 
tokens you need and you will have to arrange for whatever code you are using to 
create your query to use that analyzer in the correct circumstances. I don't 
think I've seen anyone post about this particular use case before, so I'd be 
surprised if there is much other information about it on the lists.

I haven't read Lucene in Action, so I don't know if it is covered there or not. 
If Erik has any information on indexing Asian languages, there might be some 
background there on using overlapping n-grams. Searching the list archive may 
give you some background if Lucene in Action doesn't have enough info on this 
topic.

-----Original Message-----
From: Rob Young [mailto:[EMAIL PROTECTED] 
Sent: Thursday, May 11, 2006 11:39 AM
To: java-user@lucene.apache.org
Subject: Re: Searching across spaces

That sounds like just what I'm looking for. Do you know if this is covered in 
Lucene in Action or where I can find more information about it.

Eric Isakson wrote:

>You might consider using overlapping bi-gram tokenization with stripped out 
>whitespace and a PhraseQuery.
>
>So your tokenized content, "spongebob squarepants", would look like:
>
>sp po on ng ge eb bo ob bs sq qu ua ar re ep pa an nt ts
>
>and your tokens for your query, "sponge bob", would look like
>
>sp po on ng ge eb bo ob
>
>Add each token to the PhraseQuery and you should match.
>
>This is very similar to the techniques used for searching in Asian languages 
>which do not seperate words with spaces. There are probably some side effects 
>for compound words that you didn't mean to do this too, but without knowing 
>the exact domain of compound words that you wish to support, this is probably 
>the best you will be able to do.
>  
>


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to