RE: Penalize fact the searched term is within a world

Uwe Schindler Fri, 09 Jun 2017 06:10:10 -0700

Hi,

the tokens are matched as is. It is only a match if the tokens are exactly the 
same bytes. There are never done any substring matches, just simple comparison 
of bytes.


To have more fuzzier matches, you have to do text analysis right. This includes 
splitting of tokens (Tokenizer), but also term "normalization" (TokenFilters). 
One example is lowercasing (to allow case insensitive matching), but also 
stemming might be done, or conversion to phonetic codes (to allow phonetic 
matches). The output of the tokens does not necessarily need to be "human 
readable" anymore. How does this work with matching, the user won't enter 
phonetic codes? - Tokenization and normalization is done on both the indexing 
as well as on the query side. If both sides produce same tokens it's a match, 
very simple. By that information you should be able to think about good ways to 
analyze the text for your use case. If you use Solr, the schema.xml is your 
friend. In Lucene look at the analysis module that has examples for common 
languages. If you want to do your own, use CustomAnalyzer to create your own 
combination of tokenization and normalization (filtering of tokens).

Uwe

-----
Uwe Schindler
Achterdiek 19, D-28357 Bremen
http://www.thetaphi.de
eMail: u...@thetaphi.de

> -----Original Message-----
> From: Jacek Grzebyta [mailto:grzebyta....@gmail.com]
> Sent: Friday, June 9, 2017 1:39 PM
> To: java-user@lucene.apache.org
> Subject: Re: Penalize fact the searched term is within a world
> 
> Hi Ahmed,
> 
> That works! Still I do not understand how that staff working. I just know
> that analysed cut an indexed text into tokens. But I do not know how the
> matching is done.
> 
> Do you recommend and good book to read. I prefer something with less
> maths
> and more examples?
> The only I found is free "An Introduction to Information Retrieval" but I
> has lot of maths I do not understand.
> 
> Best regards,
> Jacek
> 
> 
> 
> On 8 June 2017 at 19:36, Ahmet Arslan <iori...@yahoo.com.invalid> wrote:
> 
> > Hi,
> > You can completely ban within-a-word search by simply using
> > WhitespaceTokenizer for example.By the way, it is all about how you
> > tokenize/analyze your text. Once you decided, you can create a two
> versions
> > of a single field using different analysers.This allows you to assign
> > different weights to those field at query time.
> > Ahmet
> >
> >
> > On Thursday, June 8, 2017, 2:56:37 PM GMT+3, Jacek Grzebyta <
> > grzebyta....@gmail.com> wrote:
> >
> >
> > Hi,
> >
> > Apologies for repeating question from IRC room but I am not sure if that is
> > alive.
> >
> > I have no idea about how lucene works but I need to modify some part in
> > rdf4j project which depends on that.
> >
> > I need to use lucene to create a mapping file based on text searching and I
> > found there is a following problem. Let take a term 'abcd' which is mapped
> > to node 'abcd-2' whereas node 'abcd' exists. I found the issue is lucene is
> > searching the term and finds it in both nodes 'abcd' and 'abcd-2' and gives
> > the same score. My question is: how to modify the scoring to penalise the
> > fact the searched term is a part of longer word and give more score if that
> > is itself a word.
> >
> > Visually It looks like that:
> >
> > node 'abcd':
> >   - name: abcd
> >
> > total score = LS /lucene score/ * 2.0 /name weight/
> >
> >
> >
> > node 'abcd-2':
> >   - name: abcd-2
> >   - alias1: abcd-h
> >   - alias2: abcd-k9
> >
> > total score = LS * 2.0 + LS * 0.5 /alias1 score/ + LS * 0.1 /alias2 score/
> >
> > I gave different weights for properties. "Name" has the the highest weight
> > but "alias" has some small weight as well. In total the score for a node is
> > a sum of all partial score * weight. Finally 'abcd-2' has highest score
> > than 'abcd'.
> >
> > thanks,
> > Jacek
> >


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

RE: Penalize fact the searched term is within a world

Reply via email to