Re: Implement a relaxed PhraseQuery?

climbingrose Mon, 24 Mar 2008 04:42:47 -0700

Hi Uwe,

Thanks a lot for the code. I'm digging into it now!


Cheers,

Cuong

On Mon, Mar 24, 2008 at 7:41 PM, Uwe Goetzke <[EMAIL PROTECTED]>
wrote:

> Hi Cuong ,
>
> I have written a TolerantPhraseScorer starting with the code from
> PhraseScorer but I think I have modified it to much to be generally useful.
> We use it with bigramm clusters and therefore does not need the slop factor
> for scoring but have a tolerance factor (depending on the length of the
> phrase). Here are the most relevant code fragments to start with...
> So the idea is to keep queue ordered (calling firstToLast2 and moveLast).
> I have not yet checked the code for optimisations. If you find one, I would
> be glad to hear about it... ;-)
>
>
> protected TolerantPhrasePositions first, last, reallast; // last point to
> the last tpp for the doc varying from tolerance to phrase size (reallast)
>
> protected int tolerance;
>
> /**
>         * similar to PhraseScorer but with a tolerance factor
>         *
>         * @see PhraseScorer
>         */
>        TolerantPhraseScorer(Weight weight, TermPositions[] tps, int[]
> positions, Similarity similarity,
>                             byte[] norms, int tolerance)
>        {
>                super(similarity);
>                this.norms = norms;
>                this.weight = weight;
>                this.value = weight.getValue();
>                this.tolerance = tolerance;
>                termsize = 0;
>                // convert tps to a list
>                for (int i = 0; i < tps.length; i++) {
>                        if (tps[i] != null) {
>                                TolerantPhrasePositions pp = new
> TolerantPhrasePositions(tps[i], positions[i]);
>                                termsize++;
>                                if (reallast != null) {
> // add next to end of list
>                                        reallast.next = pp;
>                                        pp.previous = reallast;
>                                }
>                                else
>                                        first = pp;
>                                reallast = pp;
>                                if ((termsize >= tolerance) && (last ==
> null))
>                                        last = pp;
>                        }
>                }
>                pq = new TolerantPhraseQueue(termsize);             //
> construct empty pq
>        }
>
>
> public boolean next() throws IOException
>        {
>                if (firstTime) {
>                        init();
>                        firstTime = false;
>                }
>                else if (more) {
>                        int doc = last.doc;
>                        while (doc == last.doc) {
>                                more = last.next();                     //
> trigger further scanning
>                                moveLast();
>                        }
>                }
>                return doNext();
>        }
>
>        // next without initial increment
>        private boolean doNext() throws IOException
>        {
>                while (more) {
>                        while (more && first.doc < last.doc) {      // find
> doc w/ all the terms
>                                more = first.skipTo(last.doc);
>  // skip first upto last
>                                firstToLast2();
>  // and move it to the end
>                        }
>                        if (more) {
>                                // found a doc with all of the terms
>                                freq = phraseFreq();
>  // check for phrase
>                                if (freq == 0.0f) {
>  // no match
>                                        int doc = last.doc;
>                                        while (doc == last.doc) {
>                                                more = last.next();
>             // trigger further scanning
>                                                moveLast();
>                                        }
>                                }
>                                else
>                                        return true;
>      // found a match
>                        }
>                }
>                return false;                                 // no more
> matches
>        }
>
>
> private void firstToLast2()
>        {
>                TolerantPhrasePositions newfirst = first.next;
>                TolerantPhrasePositions test = last;
>                TolerantPhrasePositions insertp = test;
>                while ((test != null) && (first.doc >= test.doc)) {
>                        insertp = test;
>                        test = test.next;
>                }
>                if (insertp == null) { // last elem should not happen
>                        System.out.println("firstToLast2->insertp==null");
>                }
>                else {
>                        first.previous = insertp;  // einkoppeln
>                        first.next = insertp.next;
>                        if (first.next != null)
>                                first.next.previous = first;
>                        insertp.next = first;
>                        if (test == null) {
>                                reallast = first;
>                                reallast.next = null;
>                        }
>                }
>                last = last.next;
>                first = newfirst;
>                first.previous = null;
>        }
>
> private void moveLast()
>        {
>                TolerantPhrasePositions test = last;
>                TolerantPhrasePositions insertp = null;
>                while ((test != null) && (last.doc >= test.doc)) {
>                        insertp = test;
>                        test = test.next;
>                }
>                if (insertp == null) { // last elem should not happen
>                        System.out.println("insertp==null");
>                }
>                else {
>                        if (insertp != last) {
>                                TolerantPhrasePositions prev =
> last.previous;   // dequeue
>                                if (prev != null) { // if only 1 character!
>                                        prev.next = last.next;
>                                        prev.next.previous = prev;
>                                }
>                                last.previous = insertp;  // enqueue
>                                last.next = insertp.next;
>                                if (last.next != null)
>                                        last.next.previous = last;
>                                insertp.next = last;
>
>                                if (test == null) {
>                                        reallast = last;
>                                        reallast.next = null;
>                                }
>                                if (prev != null) { // if only 1 character!
>                                        last = prev.next;
>                                }
>                        }
>                }
>        }
>
> Best Regards
>
> Uwe
>
>
> -----Ursprüngliche Nachricht-----
> Von: climbingrose [mailto:[EMAIL PROTECTED]
> Gesendet: Montag, 24. März 2008 00:37
> An: java-user
> Betreff: Implement a relaxed PhraseQuery?
>
> Hi all,
>
> I posted this in Solr mailing but then I thought it would be more
> appropriate to have it here.
>
> I thought many people would encounter the situation I'm having here.
> Basically, we'd like to have a PhraseQuery with "minimum should match"
> property similar to BooleanQuery. Consider the query "Senior Java
> Developer":
>
> 1) I'd like to do a PhraseQuery on "Senior Java Developer" with a slop of
> say 2, so that the query only matches documents with these words located
> in
> proximity. I don't want to match documents like "Senior <Huge block of
> text>
> Java <Huge block of Text> Developer".
> 2) I also want to relax PhraseQuery a bit so that it not only match
> "Senior
> Java Developer"~2 but also matches "Java Developer"~2 but of course with a
> lower score. I can programmatically generate on the combination but it's
> not
> gonna be efficient if user issues query with many terms.
>
> It looks like the only solution is to hack PhraseScorer and its
> subclasses.
> Has anyone done this before? If yes, please share your experience.
>
>
> --
> Regards,
>
> Cuong Hoang
>
> -----------------------------------------------------------------------
> Healy Hudson GmbH - D-55252 Mainz Kastel
> Geschäftsführer Christian Konhäuser - Amtsgericht Wiesbaden HRB 12076
>
> Diese Email ist vertraulich. Wenn Sie nicht der beabsichtigte Empfänger
> sind, dürfen Sie die Informationen nicht offen legen oder benutzen. Wenn Sie
> diese Email durch einen Fehler bekommen haben, teilen Sie uns dies bitte
> umgehend mit, indem Sie diese Email an den Absender zurückschicken. Bitte
> löschen Sie danach diese Email.
> This email is confidential. If you are not the intended recipient, you
> must not disclose or use this information contained in it. If you have
> received this email in error please tell us immediately by return email and
> delete the document.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>

Re: Implement a relaxed PhraseQuery?

Reply via email to