Re: HTML pages highlighter

Erik Hatcher Wed, 06 Apr 2005 15:44:59 -0700

What file do those line numbers correspond to?  I'm lost.

Did the Lucene in Action highlighting code work for you?

        Erik

On Apr 6, 2005, at 6:16 PM, Yagnesh Shah wrote:

Hi! Erik, Yes basic seems to be working. a) My problem is there is a chances that query is not present in stored content of a file so sometimes I am getting empty strings at line#106 so I have to put a special check at line#109 and line#126. I guess this is not a problem. What you think? b) When I click on a doc path that was generated by line#120 and line#121 The files that it open do not have a searched query highlighted. Any suggestion for this? How I can do?


-----Original Message-----
From: Erik Hatcher [mailto:[EMAIL PROTECTED]
Sent: Monday, April 04, 2005 8:45 PM
To: java-user@lucene.apache.org
Subject: Re: HTML pages highlighter

On Apr 4, 2005, at 5:35 PM, Yagnesh Shah wrote:

        I end up purchasing your book "Lucene in Action". I have downloaded
your code samples. I am able to retrieve "result" only some time.
Below is the code I have taken from Search.jhtml in lucene demo. I
have 2 problem

a) I am unable to display "result" using
b) When I click on the title to retrieve document I do not see my
query highlighted.


First things first.... get something very very simple working and
expand from there.  Here is the simple code from our HighlightIt.java:

     TermQuery query = new TermQuery(new Term("f", "ipsum"));
     QueryScorer scorer = new QueryScorer(query);
     SimpleHTMLFormatter formatter =
         new SimpleHTMLFormatter("<span class=\"highlight\">",
             "</span>");
     Highlighter highlighter = new Highlighter(formatter, scorer);
     Fragmenter fragmenter = new SimpleFragmenter(50);
     highlighter.setTextFragmenter(fragmenter);

     TokenStream tokenStream = new StandardAnalyzer()
         .tokenStream("f", new StringReader(text));

     String result =
         highlighter.getBestFragments(tokenStream, text, 5, "...");

One trick is that you must ensure the query you are passing to
QueryScorer has been rewritten.  In our simple TermQuery case, that is
not necessary, but in a general application it is.  You can call
query.rewrite(reader) where reader is your IndexReader instance.  This
ensures that range, fuzzy, and wildcard queries are expanded and
highlightable.

I'm not sure what is wrong with the code you are trying.  But again,
start simple, just try out our HighlightIt or our HighlightTest.  If
those work fine for you then move on to integrating further with your
index.  Besides the Query.rewrite() trick, you have to be sure that the
text you want to highlight is available.  If you're pulling it from the
index, it must be in a stored field, otherwise you need to retrieve it
from elsewhere.

        Erik


<java>

  Searcher searcher = new IndexSearcher(getReader(indexName));

  // get query from request
  String queryString = request.getParameter("query");

  query = QueryParser.parse(queryString, "contents", analyzer);
  Hits hits = searcher.search(query);
  SimpleHTMLFormatter formatter =
  new SimpleHTMLFormatter();
  Highlighter highlighter = new Highlighter(formatter, new
QueryScorer(query));
  highlighter.setTextFragmenter(new SimpleFragmenter(50));
  String FIELD_NAME = "contents";

  for (int i = start; i < end; i++) {             // display the hits
  Document doc = hits.doc(i);
  String text = hits.doc(i).get(FIELD_NAME);
  int maxNumFragmentsRequired = 5;
  String fragmentSeparator = "...";
  if ( text != null){
        TokenStream tokenStream = new
StandardAnalyzer().tokenStream(FIELD_NAME, new
java.io.StringReader(text));
        String result =
highlighter.getBestFragments            
(tokenStream,text,maxNumFragmentsRequired,fragmentSeparator);
        System.out.println("result=" +result);
  }

String title = doc.get("title"); if (title.equals("")) // use url for docs w/o title title = doc.get("path"); </java> <java type=print>(int)(hits.score(i) * 100.0f)</java>% <a href="`doc.get("path")`"> <java type=print>Entities.encode(title)</java> </a> <java> if (showSummaries) { // maybe show summary </java> <ul>Summary: <java type=print>Entities.encode(doc.get("summary"))</java> </ul> <java> } } </java>

-----Original Message-----
From: Erik Hatcher [mailto:[EMAIL PROTECTED]
Sent: Thursday, March 31, 2005 8:04 PM
To: java-user@lucene.apache.org
Subject: Re: HTML pages highlighter

On Mar 31, 2005, at 6:36 PM, Yagnesh Shah wrote:

    try {
      fis = new FileInputStream(f);
      HTMLParser parser = new HTMLParser(fis);

      // Add the tag-stripped contents as a Reader-valued Text field
so it will
      // get tokenized and indexed.
//      doc.add(new Field("contents", parser.getReader()));
      LineNumberReader reader = new
LineNumberReader(parser.getReader());
      for (String l = reader.readLine(); l != null; l =
reader.readLine())
//        System.out.println(l);
      doc.add(Field.Text("contents", l));


Notice that your loop here is adding a "contents" field for *every*
line read since that is where the first semi-colon is.

Look at using Luke to explore your index.  Try indexing just a dummy
String:

        doc.add(Field.Text("contents", "some dummy text"));

to show that it works.  Always always always simplify a complicated
situation by doing the most obvious thing that _should_ work.

Also, the demo Lucene code is not really designed to be used in a
production application (sadly), so you're better off borrowing code
from the many articles or our book to begin with.

        Erik


      // Add the summary as a field that is stored and returned with
      // hit documents for display.
      doc.add(new Field("summary", parser.getSummary(),
Field.Store.YES, Field.Index.NO));

      // Add the title as a field that it can be searched and that is
stored.
      doc.add(new Field("title", parser.getTitle(), Field.Store.YES,
Field.Index.TOKENIZED));
    }

-----Original Message-----
From: Erik Hatcher [mailto:[EMAIL PROTECTED]
Sent: Wednesday, March 30, 2005 7:38 PM
To: java-user@lucene.apache.org
Subject: Re: HTML pages highlighter

On Mar 30, 2005, at 4:46 PM, Yagnesh Shah wrote:

Hi! Eric,


Erik - with a 'k' - Sorry, I let it slide once though :)

        I try to modified that with this but I get compile error. Do you
have
any code snippet of highlighting code to pull the contents from the
original source?


I have a whole book full of code examples :)
http://www.lucenebook.com - Grab the source code and look in
src/lia/tools at Highlight*.java

 or Do you know how I can do field store?

      doc.add(new Field("contents", parser.getReader(),
Field.Store.YES, Field.Index.NO));

You cannot store it with a Reader. You need to use Field.Text(String, String), or one of the other variations.

        Erik


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: HTML pages highlighter

Reply via email to