Re: how to control terms to be highlighted?

Harini Raghavan Mon, 05 Dec 2005 03:30:33 -0800

Hi,

I was able to use the Highlighter API to extract the text where thekeywords occur. However I am facing another related problem. Myapplication downloads the news items to the local server. The indexerapi parses these HTML files and extracts the content and stores it inthe index. The parser extracts all the text in the html page includingtitle, headings etc. So when the highlighter is run on this content,instead of highlighting the keywords in the main content, it just showsthe title or words found in the beginning of the page.

For example for the article in the link:http://biz.yahoo.com/rb/051130/apple.html , the highlighted text issomething like below:/Options Order Book Symbol Lookup Reuters Apple may launch Intellaptops: analyst Wednesday November 30, 10:24 am ET NEW

/My requirement is to extract the best fragment/sentence from the newsarticle where the keywords appear(similar to google) and display belowthe search result. But, the above text extracted is not really the bestfragment, it seems to be the first fragment which has the keywords. Hassomeone implemented this kind of functionality?


-Harini



Harini Raghavan wrote:

Hi Chris,

Can we pass a different query object for searching and a different oneto the highlighter? I am not sure of that.In any case, based on Mark's suggestion I modified theQueryTermsExtractor class and filtered the query terms by the fieldName.

Attached is the modified file.

Thanks,
Harini



Chris Hostetter wrote:

I don't know what your application is, and I have no experience with the
Highlighter code, so forgive me if this is a silly suggestion:

It looks like you are building a query up programaticaly, which
contains some words to search on, and some other stuff that's mainly
being used to "filter" the results (i'll avoid my usual rant about
people underutilizing Filters).  So why not pass the Higherlighter just
the portion of the Query that you acctaully want to contribute to the
highlighting?  In this query...

: >> +DocumentType:news
: >> +(CompanyId:10 CompanyId:20 CompanyId:30 CompanyId:40)
: >> +FilingDate:[20041201 TO 20051201]
: >> +(Content:"cost saving" Content:"cost savings"
: >>Content:outsource
: >>Content:outsources Content:downsize
: >>Content:downsizes
: >>Content:restructuring Content:restructure)

...just give the highlighter...

   (Content:"cost saving" Content:"cost savings"
    Content:outsource
    Content:outsources Content:downsize
    Content:downsizes
    Content:restructuring Content:restructure)


: Date: Thu, 01 Dec 2005 10:38:41 +0530
: From: Harini Raghavan <[EMAIL PROTECTED]>
: Reply-To: java-user@lucene.apache.org
: To: java-user@lucene.apache.org
: Subject: Re: how to control terms to be highlighted?
:
: Hi Mark,
:
: It would be great if you can make this change and send the
: QueryTermsExtractor class. I am invoking the QueryScorer(Query)
: contructor. Should I use QueryScorer(Query query, IndexReader reader,
: String fieldName) instead for this to work?
:
: Thanks,
: Harini
:
: mark harwood wrote:
:
: >>>>Is there anyway to restrict the highlighter to
: >>>>
: >>>>
: >>highlight only the values
: >>mentioned for the field 'Content'?
: >>
: >>
: >
: >The problem lies in the QueryTermsExtractor class
: >which is typically used to provide the Highlighter
: >with the list of strings to identify in the text. It
: >currently has no filter for fieldname - you could add
: >this without too much effort.
: >
: >I could make this modification but it may change the
: >behaviour of existing applications - currently the
: >QueryTermsExtractor method that takes a fieldname only
: >uses that fieldname to derive IDF weightings, the
: >proposed change would also have the effect of
: >filtering out any query terms that weren't for this
: >field.
: >Would this change be a problem for anyone?
: >
: >Cheers,
: >Mark
: >
: >--- Harini Raghavan <[EMAIL PROTECTED]>
: >wrote:
: >
: >
: >
: >>Hi,
: >>
: >>I have a requirement to highlight search keywords in
: >>the results and
: >>display the matching fragment of the text with the
: >>results. I am using
: >>the Hits highlighting mentioned in Lucene in Action.
: >>
: >>Here is the search query(BooleanQuery) I am passing
: >>to the IndexSearcher
: >>and QueryScorer:
: >> +DocumentType:news
: >> +(CompanyId:10 CompanyId:20 CompanyId:30
: >>CompanyId:40)
: >> +FilingDate:[20041201 TO 20051201]
: >> +(Content:"cost saving" Content:"cost savings"
: >>Content:outsource
: >>Content:outsources Content:downsize
: >>Content:downsizes
: >>Content:restructuring Content:restructure)
: >>
: >>My requirement is to highlight only the keywords for
: >>'Content' field,
: >>but the highlighter api is also highlighting words
: >>like 'news', '10',
: >>'40' etc.
: >>Is there anyway to restrict the highlighter to
: >>highlight only the values
: >>mentioned for the field 'Content'?
: >>
: >>Thanks,
: >>Harini
: >>
: >>
: >>
: >>
: >>
: >>
: >>
: >>
: >---------------------------------------------------------------------
: >
: >
: >>To unsubscribe, e-mail:
: >>[EMAIL PROTECTED]
: >>For additional commands, e-mail:
: >>[EMAIL PROTECTED]
: >>
: >>
: >>
: >>
: >
: >
: >
: >
: >___________________________________________________________

: >Yahoo! Model Search 2005 - Find the next catwalk superstars -http://uk.news.yahoo.com/hot/model-search/

: >
: >---------------------------------------------------------------------
: >To unsubscribe, e-mail: [EMAIL PROTECTED]
: >For additional commands, e-mail: [EMAIL PROTECTED]
: >
: >
: >
: >
:
:
: ---------------------------------------------------------------------
: To unsubscribe, e-mail: [EMAIL PROTECTED]
: For additional commands, e-mail: [EMAIL PROTECTED]
:



-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

------------------------------------------------------------------------

package org.apache.lucene.search.highlight;
/**
* Copyright 2002-2004 The Apache Software Foundation
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
*     http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/

import java.io.IOException;
import java.util.Collection;
import java.util.HashSet;
import java.util.Iterator;

import org.apache.lucene.index.IndexReader;
import org.apache.lucene.index.Term;
import org.apache.lucene.search.BooleanClause;
import org.apache.lucene.search.BooleanQuery;
import org.apache.lucene.search.PhraseQuery;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.TermQuery;
import org.apache.lucene.search.spans.SpanNearQuery;

/**
* Utility class used to extract the terms used in a query, plus any weights.
* This class will not find terms for MultiTermQuery, RangeQuery and PrefixQuery 
classes
* so the caller must pass a rewritten query (see Query.rewrite) to obtain a 
list of
* expanded terms.
*
*/
public final class QueryTermExtractor
{

        /**
         * Extracts all terms texts of a given Query into an array of 
WeightedTerms
         *
         * @param query      Query to extract term texts from
         * @return an array of the terms used in a query, plus their weights.
         */
        public static final WeightedTerm[] getTerms(Query query)
        {
                return getTerms(query,false,"");
        }

        /**
         * Extracts all terms texts of a given Query into an array of 
WeightedTerms
         *
         * @param query      Query to extract term texts from
         * @param reader used to compute IDF which can be used to a) score 
selected fragments better
         * b) use graded highlights eg chaning intensity of font color
         * @param fieldName the field on which Inverse Document Frequency (IDF) 
calculations are based
         * @return an array of the terms used in a query, plus their weights.
         */
        public static final WeightedTerm[] getIdfWeightedTerms(Query query, 
IndexReader reader, String fieldName)
        {
            WeightedTerm[] terms=getTerms(query,false,fieldName);
            int totalNumDocs=reader.numDocs();
            for (int i = 0; i < terms.length; i++)
       {
                try
           {
               int docFreq=reader.docFreq(new Term(fieldName,terms[i].term));
               //IDF algorithm taken from DefaultSimilarity class
               float 
idf=(float)(Math.log((float)totalNumDocs/(double)(docFreq+1)) + 1.0);
               terms[i].weight*=idf;
           }
                catch (IOException e)
           {
                    //ignore
           }
       }
                return terms;
        }

        /**
         * Extracts all terms texts of a given Query into an array of 
WeightedTerms
         *
         * @param query      Query to extract term texts from
         * @param prohibited <code>true</code> to extract "prohibited" terms, 
too
  * @return an array of the terms used in a query, plus their weights.
  */
        public static final WeightedTerm[] getTerms(Query query, boolean 
prohibited, String fieldName)
        {
                HashSet terms=new HashSet();
                getTerms(query,terms,prohibited,fieldName);
                return (WeightedTerm[]) terms.toArray(new WeightedTerm[0]);
        }

        private static final void getTerms(Query query, HashSet terms,boolean 
prohibited, String fieldName)
        {
                if (query instanceof BooleanQuery)
                        getTermsFromBooleanQuery((BooleanQuery) query, terms, 
prohibited, fieldName);
                else
                        if (query instanceof PhraseQuery)
                                getTermsFromPhraseQuery((PhraseQuery) query, 
terms, fieldName);
                        else
                                if (query instanceof TermQuery)
                                        getTermsFromTermQuery((TermQuery) 
query, terms, fieldName);
                                else
                        if(query instanceof SpanNearQuery)
                            getTermsFromSpanNearQuery((SpanNearQuery) query, 
terms, fieldName);
        }

        private static final void getTermsFromBooleanQuery(BooleanQuery query, 
HashSet terms, boolean prohibited, String fieldName)
        {
                BooleanClause[] queryClauses = query.getClauses();
                int i;

                for (i = 0; i < queryClauses.length; i++)
                {
                        if (prohibited || !queryClauses[i].prohibited)
                                getTerms(queryClauses[i].query, terms, 
prohibited, fieldName);
                }
        }

        private static final void getTermsFromPhraseQuery(PhraseQuery query, 
HashSet terms, String fieldName)
        {
                Term[] queryTerms = query.getTerms();
                int i;
                String field;

                for (i = 0; i < queryTerms.length; i++)
                {
                        if(fieldName.equals(""))
                                terms.add(new 
WeightedTerm(query.getBoost(),queryTerms[i].text()));
                        else {
                                field = queryTerms[i].field();
                                if(field.equals(fieldName))
                                        terms.add(new 
WeightedTerm(query.getBoost(),queryTerms[i].text()));
                        }
                }
        }

        private static final void getTermsFromTermQuery(TermQuery query, 
HashSet terms, String fieldName)
        {
                String field = query.getTerm().field();
                if(fieldName.equals(""))
                        terms.add(new 
WeightedTerm(query.getBoost(),query.getTerm().text()));
                else if(field.equals(fieldName)) {
                        terms.add(new 
WeightedTerm(query.getBoost(),query.getTerm().text()));
                }
        }

   private static final void getTermsFromSpanNearQuery(SpanNearQuery query, 
HashSet terms, String fieldName){

       Collection queryTerms = query.getTerms();

       for(Iterator iterator = queryTerms.iterator(); iterator.hasNext();){

           // break it out for debugging.

           Term term = (Term) iterator.next();

           String text = term.text();

                        String field = term.field();

                        if(fieldName.equals(""))
                                terms.add(new WeightedTerm(query.getBoost(), 
text));
                        else if(field.equals(fieldName)) {
            terms.add(new WeightedTerm(query.getBoost(), text));
                        }

       }

   }

}

------------------------------------------------------------------------

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: how to control terms to be highlighted?

Reply via email to