RE: HTML pages highlighter

Yagnesh Shah Wed, 06 Apr 2005 15:16:20 -0700

Hi! Erik,
        Yes basic seems to be working. 
a) My problem is there is a chances that query is not present in stored content 
of a file so sometimes I am getting empty strings at line#106 so I have to put 
a special check at line#109 and line#126. I guess this is not a problem. What 
you think?
b) When I click on a doc path that was generated by line#120 and line#121 The 
files that it open do not have a searched query highlighted. Any suggestion for 
this? How I can do?


-----Original Message-----
From: Erik Hatcher [mailto:[EMAIL PROTECTED]
Sent: Monday, April 04, 2005 8:45 PM
To: [email protected]
Subject: Re: HTML pages highlighter


On Apr 4, 2005, at 5:35 PM, Yagnesh Shah wrote:
>       I end up purchasing your book "Lucene in Action". I have downloaded  
> your code samples. I am able to retrieve "result" only some time.  
> Below is the code I have taken from Search.jhtml in lucene demo. I  
> have 2 problem
>
> a) I am unable to display "result" using
> b) When I click on the title to retrieve document I do not see my  
> query highlighted.

First things first.... get something very very simple working and  
expand from there.  Here is the simple code from our HighlightIt.java:

     TermQuery query = new TermQuery(new Term("f", "ipsum"));
     QueryScorer scorer = new QueryScorer(query);
     SimpleHTMLFormatter formatter =
         new SimpleHTMLFormatter("<span class=\"highlight\">",
             "</span>");
     Highlighter highlighter = new Highlighter(formatter, scorer);
     Fragmenter fragmenter = new SimpleFragmenter(50);
     highlighter.setTextFragmenter(fragmenter);

     TokenStream tokenStream = new StandardAnalyzer()
         .tokenStream("f", new StringReader(text));

     String result =
         highlighter.getBestFragments(tokenStream, text, 5, "...");

One trick is that you must ensure the query you are passing to  
QueryScorer has been rewritten.  In our simple TermQuery case, that is  
not necessary, but in a general application it is.  You can call  
query.rewrite(reader) where reader is your IndexReader instance.  This  
ensures that range, fuzzy, and wildcard queries are expanded and  
highlightable.

I'm not sure what is wrong with the code you are trying.  But again,  
start simple, just try out our HighlightIt or our HighlightTest.  If  
those work fine for you then move on to integrating further with your  
index.  Besides the Query.rewrite() trick, you have to be sure that the  
text you want to highlight is available.  If you're pulling it from the  
index, it must be in a stored field, otherwise you need to retrieve it  
from elsewhere.

        Erik


>
> <java>
>
>   Searcher searcher = new IndexSearcher(getReader(indexName));
>
>   // get query from request
>   String queryString = request.getParameter("query");
>
>   query = QueryParser.parse(queryString, "contents", analyzer);
>   Hits hits = searcher.search(query);
>   SimpleHTMLFormatter formatter =
>   new SimpleHTMLFormatter();
>   Highlighter highlighter = new Highlighter(formatter, new  
> QueryScorer(query));
>   highlighter.setTextFragmenter(new SimpleFragmenter(50));
>   String FIELD_NAME = "contents";
>
>   for (int i = start; i < end; i++) {             // display the hits
>   Document doc = hits.doc(i);
>   String text = hits.doc(i).get(FIELD_NAME);
>   int maxNumFragmentsRequired = 5;
>   String fragmentSeparator = "...";
>   if ( text != null){
>       TokenStream tokenStream = new  
> StandardAnalyzer().tokenStream(FIELD_NAME, new  
> java.io.StringReader(text));
>       String result =  
> highlighter.getBestFragments           
> (tokenStream,text,maxNumFragmentsRequired,fragmentSeparator);
>       System.out.println("result=" +result);
>   }
>
>     String title = doc.get("title");
>     if (title.equals(""))                         // use url for docs  
> w/o title
>       title = doc.get("path");
>     </java>
>     <p><b><java type=print>(int)(hits.score(i) * 100.0f)</java>%
>     <a href="`doc.get("path")`">
>     <java type=print>Entities.encode(title)</java>
>     </b></a>
>     <java>
>     if (showSummaries) {                          // maybe show summary
>     </java>
>     <ul><i>Summary</i>:
>       <java type=print>Entities.encode(doc.get("summary"))</java>
>     </ul>
>     <java>
>     }
>   }
> </java>
>
>
>
> -----Original Message-----
> From: Erik Hatcher [mailto:[EMAIL PROTECTED]
> Sent: Thursday, March 31, 2005 8:04 PM
> To: [email protected]
> Subject: Re: HTML pages highlighter
>
>
>
> On Mar 31, 2005, at 6:36 PM, Yagnesh Shah wrote:
>>     try {
>>       fis = new FileInputStream(f);
>>       HTMLParser parser = new HTMLParser(fis);
>>
>>       // Add the tag-stripped contents as a Reader-valued Text field
>> so it will
>>       // get tokenized and indexed.
>> //      doc.add(new Field("contents", parser.getReader()));
>>       LineNumberReader reader = new
>> LineNumberReader(parser.getReader());
>>       for (String l = reader.readLine(); l != null; l =
>> reader.readLine())
>> //        System.out.println(l);
>>       doc.add(Field.Text("contents", l));
>
> Notice that your loop here is adding a "contents" field for *every*
> line read since that is where the first semi-colon is.
>
> Look at using Luke to explore your index.  Try indexing just a dummy
> String:
>
>       doc.add(Field.Text("contents", "some dummy text"));
>
> to show that it works.  Always always always simplify a complicated
> situation by doing the most obvious thing that _should_ work.
>
> Also, the demo Lucene code is not really designed to be used in a
> production application (sadly), so you're better off borrowing code
> from the many articles or our book to begin with.
>
>       Erik
>
>
>>
>>       // Add the summary as a field that is stored and returned with
>>       // hit documents for display.
>>       doc.add(new Field("summary", parser.getSummary(),
>> Field.Store.YES, Field.Index.NO));
>>
>>       // Add the title as a field that it can be searched and that is
>> stored.
>>       doc.add(new Field("title", parser.getTitle(), Field.Store.YES,
>> Field.Index.TOKENIZED));
>>     }
>>
>>
>>
>> -----Original Message-----
>> From: Erik Hatcher [mailto:[EMAIL PROTECTED]
>> Sent: Wednesday, March 30, 2005 7:38 PM
>> To: [email protected]
>> Subject: Re: HTML pages highlighter
>>
>>
>>
>> On Mar 30, 2005, at 4:46 PM, Yagnesh Shah wrote:
>>
>>> Hi! Eric,
>>
>> Erik - with a 'k' - Sorry, I let it slide once though :)
>>
>>>     I try to modified that with this but I get compile error. Do you  
>>> have
>>> any code snippet of highlighting code to pull the contents from the
>>> original source?
>>
>> I have a whole book full of code examples :)
>> http://www.lucenebook.com - Grab the source code and look in
>> src/lia/tools at Highlight*.java
>>
>>>  or Do you know how I can do field store?
>>>
>>>       doc.add(new Field("contents", parser.getReader(),
>>> Field.Store.YES, Field.Index.NO));
>>
>> You cannot store it with a Reader.  You need to use Field.Text(String,
>> String), or one of the other variations.
>>
>>      Erik
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [EMAIL PROTECTED]
>> For additional commands, e-mail: [EMAIL PROTECTED]
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [EMAIL PROTECTED]
>> For additional commands, e-mail: [EMAIL PROTECTED]
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: HTML pages highlighter

Reply via email to