Re: Which token filter can combine 2 terms into 1?

Jack Krupansky Wed, 26 Dec 2012 16:08:23 -0800

Ah! You're quoting full phrases. You weren't clear about that originally.Thanks for the clarification.


-- Jack Krupansky

-----Original Message-----From: Tom

Sent: Wednesday, December 26, 2012 5:54 PM
To: [email protected]
Subject: Re: Which token filter can combine 2 terms into 1?

On Fri, Dec 21, 2012 at 2:44 PM, Jack Krupansky<[email protected]>wrote:

You still have the query parser's parsing before analysis to deal with, no
matter what magic you code in your analyzer.


Not quite.
"query parser's parsing" comes first, you are correct on that. But it is
irrelevant for splitting field values into search terms, because this part
of the whole process is done by an analyzer. Therefore, if you make sure
the correct analyzer is used, then the parsing and splitting into
individual search terms will be done by this analyzer, not by the query
parser.

Try it: Implement an analyzer with the SnippetFilter below. Start Luke and
make sure this analyzer is selected in "Analyzer to use for query parsing".
In the search expression, type in any length of text for example:

body:"word1 word2 word3"

and you will get the possibly combined Terms.

For example, let's say one snipped in your SnippetFilter is: "word2 word3"
you will get

Term 0: field=body text=word1
Term 1: field=body text=word2 word3

In this case, word2 and word3 will NOT be split.

-- Jack Krupansky

-----Original Message----- From: Tom
Sent: Friday, December 21, 2012 2:24 PM
To: [email protected]

Subject: Re: Which token filter can combine 2 terms into 1?

On Fri, Dec 21, 2012 at 9:16 AM, Jack Krupansky <[email protected]>*
*wrote:

 And to be more specific, most query parsers will have already separated

the terms and will call the analyzer with only one term at a time, so no
term recombination is possible for those parsed terms, at query time.

 Most analyzers will do that, yes. But if Xi writes his own analyzer with

his own combiner filter, then he should also use this for query generation
and thus get the desired combinations / snippets there as well.

Xi, here is the recipe:
- SnippetFilter extends TokenFilter
-SnippetFilter  needs access to your lexicon: a data structure to store

your snippets. In the general case this is a tree, and going along abranch

will tell you whenever a valid snipped has been built or if the snipped
could be longer. (Example: "internal revenue" can be one snippet but,

depending on the next token, a larger snipped of "internal revenueservice"

could be built.)

- Logic of the SnippetFilter.incrementToken() goes something like this:You

need a loop which retrieves tokens from the input variable until the input
is empty. You store each retrieved token in a variable(s) x in

SnippetFilter . As long as you have a potential match against yourlexicon,

you can continue in this loop. Once you realize that there is something

within x which can not possibly become a (longer) snippet, break out ofthe

loop and allow the consumer to retrieve it.
- make sure your analyzer inserts SnippetFilter at the correct spot in the
filter chain.

Cheers
FiveMileTom

-- Jack Krupansky
-----Original Message----- From: Erick Erickson
Sent: Friday, December 21, 2012 8:27 AM
To: java-user
Subject: Re: Which token filter can combine 2 terms into 1?

If it's a fixed list and not excessively long, would synonyms work?

But if theres some kind of logic you need to apply, I don't think you're
going to find anything OOB.
The problem is that by the time a token filter gets called, they are
already split up, you'll probably
have to write a custom filter that manages that logic.

Best
Erick

On Fri, Dec 21, 2012 at 4:16 AM, Xi Shen <[email protected]> wrote:

 Unfortunately, no...I am not combine every two term into one. I am

combining a specific pair.

E.g. the Token Stream: t1 t2 t2a t3
should be rewritten into t1 t2t2a t3

But the TS: t1 t2 t3 t2a
should not be rewritten, and it is already correct


On Fri, Dec 21, 2012 at 5:00 PM, Alan Woodward <
alan.woodward@romseysoftware.****co.uk <alan.woodward@romseysoftware.**
co.uk <[email protected]>>>

wrote:

> Have a look at ShingleFilter:
>
http://lucene.apache.org/core/****3_6_0/api/all/org/apache/**<http://lucene.apache.org/core/**3_6_0/api/all/org/apache/**>
lucene/analysis/shingle/****ShingleFilter.html<http://**
lucene.apache.org/core/3_6_0/**api/all/org/apache/lucene/**
analysis/shingle/**ShingleFilter.html<http://lucene.apache.org/core/3_6_0/api/all/org/apache/lucene/analysis/shingle/ShingleFilter.html>
>

>
> On 21 Dec 2012, at 08:42, Xi Shen wrote:
>

> > I have to use the white space and word delimiter to process the> > input

> > first. I tried many combination, and it seems to me that it is
inevitable
> > the term will be split into two :(
> >
> > I think developing my own filter is the only resolution...but I just
> cannot
> > find a guide to help me understand what I need to do to implement a
> > TokenFilter.
> >
> >
> > On Fri, Dec 21, 2012 at 4:03 PM, Danil ŢORIN <[email protected]>
wrote:
> >

> >> Easiest way would be to pre-process your input and join those 2 >> >> >>

tokens
> >> before splitting them by white space.
> >>
> >> But from given context I might miss some details...still worth a >
>> shot.
> >>
> >> On Fri, Dec 21, 2012 at 9:50 AM, Xi Shen <[email protected]>
wrote:
> >>
> >>> Hi,
> >>>
> >>> I am looking for a token filter that can combine 2 terms into 1? >
>>> E.g.
> >>>
> >>> the input has been tokenized by white space:
> >>>
> >>> t1 t2 t2a t3
> >>>
> >>> I want a filter that output:
> >>>
> >>> t1 t2t2a t3
> >>>

> >>> I know it is a very special case, and I am thinking about develop> >>> a

> >> filter

> >>> of my own. But I cannot figure out which API I should use to look> >>> >

>>> for
> >> terms
> >>> in a Token Stream.
> >>>
> >>> --
> >>> Regards，
> >>> David Shen
> >>>
> >>> http://about.me/davidshen
> >>> 
https://twitter.com/#!/****davidshen84<https://twitter.com/#!/**davidshen84>
<https://twitter.**com/#!/davidshen84<https://twitter.com/#!/davidshen84>
>

> >>>
> >>
> >
> >
> >
> > --
> > Regards，
> > David Shen
> >
> > http://about.me/davidshen
> > https://twitter.com/#!/****davidshen84<https://twitter.com/#!/**davidshen84>
<https://twitter.**com/#!/davidshen84<https://twitter.com/#!/davidshen84>
>

>
>


--
Regards，
David Shen

http://about.me/davidshen
https://twitter.com/#!/****davidshen84<https://twitter.com/#!/**davidshen84><
https://twitter.com/#!/**davidshen84<https://twitter.com/#!/davidshen84>
>

------------------------------****----------------------------**
--**---------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.****apache.org<
java-user-**[email protected]<[email protected]>
>
For additional commands, e-mail: [email protected].****org<
java-user-help@lucene.**apache.org <[email protected]>>


------------------------------**------------------------------**---------

To unsubscribe, e-mail:java-user-unsubscribe@lucene.**apache.org<[email protected]>For additional commands, e-mail:[email protected].**org<[email protected]>



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: Which token filter can combine 2 terms into 1?

Reply via email to