David,
I tried your technique, i am directly streaminf the pdf file in to Lucene highlighter as below and i get a NPE in highlighter.getBestFragments(tokenStream, docAsString, 3, "..."); API doc is not very clear here, i fed the contents of query string (instead of docAsString)to this method and still i get NPE.. Can you shed some light on this please!! Please post your code snippet if you can! My code snippet: File f = new File(sourceDocLocation); if (!f.exists()) { log.debug("File does not exist" + f.getAbsolutePath() +" "+ f.getName()); return null; } org.apache.lucene.document.Document doc = LucenePDFDocument.getDocument(f); Highlighter highlighter = new Highlighter(new QueryScorer(query)); TokenStream tokenStream = new SimpleAnalyzer().tokenStream(FIELD_NAME, new FileReader(f)); doc.add(Field.Text("contents", new FileReader(f))); // Get 3 best fragments and seperate with a "..." =========>>>>>>>> result = highlighter.getBestFragments(tokenStream, queryString, 3, "..."); <<<<<<<<======== Thanks, Vijay Balasubramanian DPRA Inc., 214 665 7503 David Spencer <dave-lucene-user To: Lucene Users List <[EMAIL PROTECTED]> @tropo.com> cc: Subject: Re: Highlighting PDF file after the search 09/20/2004 05:02 PM Please respond to Lucene Users List [EMAIL PROTECTED] wrote: > > > > Hello, > > I can successfully index and search the PDF documents, however i am not > able to highlight the searched text in my original PDF file (ie: like > dtSearch > highlights on original file) > > I took a look at the highlighter in sandbox, compiled it and have it > ready. I am wondering if this highlighter is for highlighting indexed > documents or > can it be used for PDF Files as is ! Please enlighten ! I did this a few weeks ago. There are two ways, and they both revolve round the same thing, you need the tokenized PDF text available. [a] Store the tokenized PDF text in the index, or in some other file on disk i.e. a "cache" ( but cache is a misleading term, as you can't have a cache miss unless you can do [b]). [b] Tokenize it on the fly when you call getBestFragments() - the 1st arg, the TokenStream, should be one that takes a PDF file as input and tokenizes it. http://www.searchmorph.com/pub/jakarta-lucene-sandbox/contributions/highlighter/build/docs/api/org/apache/lucene/search/highlight/Highlighter.html#getBestFragments(org.apache.lucene.analysis.TokenStream,%20java.lang.String,%20int,%20java.lang.String) > > Thanks, > > Vijay Balasubramanian > DPRA Inc., > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]