Re: Seeking advice on index parameter settings for large index
Chuck Williams wrote: index.setMaxBufferedDocs(10); // Buffer 10 documents at a time in memory (they could be big) You might use a larger value here for the index with the small documents. I've sucessfully used values as high as a 1000 when indexing documents that average a few kilobytes with a few hundred megabyte heap. This can make indexing a lot faster. Note that this is the number of single document indexes that are buffered, not document text. Indexes are typically smaller than the text. index.setMaxMergeDocs(10); // Yields about 75 large segments for 7.5 million docs (plus log2 smaller segments) = 100 total This is reasonable while incrementally indexing, in order to bound the delay while adding documents. But I would use Integer.MAX_VALUE during the initial build. 75 segments are much slower to search than one segment. I think this is also a realistic assumption for most systems that are incrementally updated. For example, if you have "scheduled downtime" you can optimize the index. Or perhaps you can optimize at midnight every night, queing updates while this operates. If there's never downtime, and updates must always be prompt, you can, as a background process, periodically copy the index, optimize it and apply queued updates until it is in sync with the live index, then swap them. There are lots of ways to implement this, but, in short, you should never need to have 75 segments, but only ever 1 + log2(#updates_since_optimize). index.setUseCompoundFile(true); // false could improve performance but will consume more file handles If you don't have 75 big segments, then you can probably afford to set this false. Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
HTML pages highlighter
Hello Lucene-User, Is any one try to do highlighting with HTML pages? I am trying to do this using demo example by Keld H. Hansen article "Unweaving a Tangled Web HTMLParser and Lucene" but I am getting "null" value for text at line #47 Any Idea? 1 package org.apache.lucene.search.highlight; 2 3 import java.io.StringReader; 4 5 import org.apache.lucene.analysis.Analyzer; 6 import org.apache.lucene.analysis.TokenStream; 7 import org.apache.lucene.analysis.standard.StandardAnalyzer; 8 import org.apache.lucene.queryParser.QueryParser; 9 import org.apache.lucene.search.Hits; 10 import org.apache.lucene.search.IndexSearcher; 11 import org.apache.lucene.search.Query; 12 import org.apache.lucene.search.highlight.Formatter; 13 import org.apache.lucene.search.highlight.Highlighter; 14 import org.apache.lucene.search.highlight.QueryScorer; 15 import org.apache.lucene.search.highlight.SimpleFragmenter; 16 17 public class Searcher { 18 19static Query query; 20static Hits hits; 21 22private static final String FIELD_NAME = "contents"; 23private static final String indexDir = "/opt/dynamo/prod/hww-doc/hww/help/index"; 24 25private static Analyzer analyzer = new StandardAnalyzer(); 26 27public static void main(String[] args) throws Exception { 28 29 IndexSearcher is = new IndexSearcher(indexDir); 30 String searchCriteria = "scholarly"; 31 query = QueryParser.parse(searchCriteria, "contents", analyzer); 32 33 hits = is.search(query); 34 System.out.println("found in: " + query +"\nhits-length:" +hits.length()); 35 36 doStandardHighlights(); 37 38 is.close(); 39} 40 41static void doStandardHighlights() throws Exception { 42 Highlighter highlighter = new Highlighter(new MyBolder(), new QueryScorer(query)); 43 System.out.println("Highlighter: " + highlighter +"\nhits-length:" +hits.length()); 44 highlighter.setTextFragmenter(new SimpleFragmenter(20)); 45 for (int i = 0; i < hits.length(); i++) { 46 System.out.println("URL " + (i + 1) + ": " + hits.doc(i).getField("path").stringValue()); 47 String text = hits.doc(i).get("FIELD_NAME"); 48 int maxNumFragmentsRequired = 2; 49 String fragmentSeparator = "..."; 50 TokenStream tokenStream = analyzer.tokenStream(FIELD_NAME, new StringReader(text)); 51 52 String result = 53 highlighter.getBestFragments( 54tokenStream, 55text, 56maxNumFragmentsRequired, 57fragmentSeparator); 58 System.out.println("\tfound in: " + result); 59 } 60} 61 62private static class MyBolder implements Formatter { 63 public String highlightTerm(String originalText , TokenGroup group) 64 { 65 if(group.getTotalScore()<=0) 66 { 67 return originalText; 68 } 69 return "" + originalText + ""; 70 } 71} 72 73 } Yagnesh N. Shah Senior Technology Engineer CS Dept., 4th Floor H. W. Wilson 950 University Avenue, Bronx NY 10452 (718) 588 8400 x2721 http://www.hwwilson.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: pre computing possible search results narrowing and hit counts on those
Antony Sequeira wrote: A user does a search for say "condominium", and i show him the 50,000 properties that meet that description. I need two other pieces of information for display - 1. I want to show a "select" box on the UI, which contains all the cities that appear in those 50,000 documents 2. Against each city I want to show the count of matching documents. For example the drop down might look like "Los Angeles" 1 "San Francisco" 5000 (But, I do not want to show "San Jose" if none of the 50,000 documents contain it) You can use the FieldCache & HitCollector: private class Count { int value; } String[] docToCity = FieldCache.getStrings(indexReader, "city"); Map cityToCount = new HashMap(); searcher.search(query, new HitCollector() { public void collect(int doc, float score) { String city = docToCity[doc]; Count count = cityToCount.get(city); if (count == null) { count = new Count(); cityToCount.put(city, count); } count.value++; } }); // sort & display entries in cityToCount Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
searcher question
I am having a large index (100GB) and when i run the following code : String indexLocation = servlet.getServletContext().getInitParameter( "com.lucene.index" ); logger.log( Level.INFO, "got the index location from: " + indexLocation ); searcher = new IndexSearcher(indexLocation); logger.log( Level.INFO, "we created an instance of SearchIndex" ); I never get to see the last message "we created an instance of SearchIndex" and I get java.lang.OutOfMemoryError: Java heap space. please if anyone has any ideas???. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: searcher question
Omar Didi wrote: I am having a large index (100GB) and when i run the following code : String indexLocation = servlet.getServletContext().getInitParameter( "com.lucene.index" ); logger.log( Level.INFO, "got the index location from: " + indexLocation ); searcher = new IndexSearcher(indexLocation); logger.log( Level.INFO, "we created an instance of SearchIndex" ); I never get to see the last message "we created an instance of SearchIndex" and I get java.lang.OutOfMemoryError: Java heap space. How big is your java heap? How much RAM do you have on the machine? How many documents are in the index? What version of Lucene? You might try calling IndexWriter.setTermIndexInterval(512) and re-optimizing your index. You might need to add and/or delete a document for this to have an effect if the index is already optimized. This method is only in the latest sources, available from subversion. It should dramatically reduce the amount of memory required to open the index. There are other changes in the latest sources that will also reduce memory requirements, so you may not even need to use IndexWriter.setTermIndexInterval(). Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: searcher question
my java heap is between 128 and 1024 MB, I have 2GB of RAM and about 10 million documents in the index which is broken down to 6 indexes. I am using a multi-searcher to query the index. I am using lucene1.4.3. -Original Message- From: Doug Cutting [mailto:[EMAIL PROTECTED] Sent: Wednesday, March 30, 2005 1:59 PM To: java-user@lucene.apache.org Subject: Re: searcher question Omar Didi wrote: > I am having a large index (100GB) and when i run the following code : > > String indexLocation = servlet.getServletContext().getInitParameter( > "com.lucene.index" ); > logger.log( Level.INFO, "got the index location from: " + indexLocation ); > searcher = new IndexSearcher(indexLocation); > logger.log( Level.INFO, "we created an instance of SearchIndex" ); > > I never get to see the last message "we created an instance of SearchIndex" > and I get > java.lang.OutOfMemoryError: Java heap space. How big is your java heap? How much RAM do you have on the machine? How many documents are in the index? What version of Lucene? You might try calling IndexWriter.setTermIndexInterval(512) and re-optimizing your index. You might need to add and/or delete a document for this to have an effect if the index is already optimized. This method is only in the latest sources, available from subversion. It should dramatically reduce the amount of memory required to open the index. There are other changes in the latest sources that will also reduce memory requirements, so you may not even need to use IndexWriter.setTermIndexInterval(). Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: searcher question
Curious... what kind of search performance are you getting for an index this size... -Original Message- From: Omar Didi [mailto:[EMAIL PROTECTED] Sent: Wednesday, March 30, 2005 3:15 PM To: java-user@lucene.apache.org Subject: RE: searcher question my java heap is between 128 and 1024 MB, I have 2GB of RAM and about 10 million documents in the index which is broken down to 6 indexes. I am using a multi-searcher to query the index. I am using lucene1.4.3. -Original Message- From: Doug Cutting [mailto:[EMAIL PROTECTED] Sent: Wednesday, March 30, 2005 1:59 PM To: java-user@lucene.apache.org Subject: Re: searcher question Omar Didi wrote: > I am having a large index (100GB) and when i run the following code : > > String indexLocation = servlet.getServletContext().getInitParameter( "com.lucene.index" ); > logger.log( Level.INFO, "got the index location from: " + indexLocation ); > searcher = new IndexSearcher(indexLocation); > logger.log( Level.INFO, "we created an instance of SearchIndex" ); > > I never get to see the last message "we created an instance of SearchIndex" and I get > java.lang.OutOfMemoryError: Java heap space. How big is your java heap? How much RAM do you have on the machine? How many documents are in the index? What version of Lucene? You might try calling IndexWriter.setTermIndexInterval(512) and re-optimizing your index. You might need to add and/or delete a document for this to have an effect if the index is already optimized. This method is only in the latest sources, available from subversion. It should dramatically reduce the amount of memory required to open the index. There are other changes in the latest sources that will also reduce memory requirements, so you may not even need to use IndexWriter.setTermIndexInterval(). Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: HTML pages highlighter
How did you index "contents"? If you did not use a stored field type, then that is the issue. Erik On Mar 30, 2005, at 12:31 PM, Yagnesh Shah wrote: Hello Lucene-User, Is any one try to do highlighting with HTML pages? I am trying to do this using demo example by Keld H. Hansen article "Unweaving a Tangled Web HTMLParser and Lucene" but I am getting "null" value for text at line #47 Any Idea? 1 package org.apache.lucene.search.highlight; 2 3 import java.io.StringReader; 4 5 import org.apache.lucene.analysis.Analyzer; 6 import org.apache.lucene.analysis.TokenStream; 7 import org.apache.lucene.analysis.standard.StandardAnalyzer; 8 import org.apache.lucene.queryParser.QueryParser; 9 import org.apache.lucene.search.Hits; 10 import org.apache.lucene.search.IndexSearcher; 11 import org.apache.lucene.search.Query; 12 import org.apache.lucene.search.highlight.Formatter; 13 import org.apache.lucene.search.highlight.Highlighter; 14 import org.apache.lucene.search.highlight.QueryScorer; 15 import org.apache.lucene.search.highlight.SimpleFragmenter; 16 17 public class Searcher { 18 19static Query query; 20static Hits hits; 21 22private static final String FIELD_NAME = "contents"; 23private static final String indexDir = "/opt/dynamo/prod/hww-doc/hww/help/index"; 24 25private static Analyzer analyzer = new StandardAnalyzer(); 26 27public static void main(String[] args) throws Exception { 28 29 IndexSearcher is = new IndexSearcher(indexDir); 30 String searchCriteria = "scholarly"; 31 query = QueryParser.parse(searchCriteria, "contents", analyzer); 32 33 hits = is.search(query); 34 System.out.println("found in: " + query +"\nhits-length:" +hits.length()); 35 36 doStandardHighlights(); 37 38 is.close(); 39} 40 41static void doStandardHighlights() throws Exception { 42 Highlighter highlighter = new Highlighter(new MyBolder(), new QueryScorer(query)); 43 System.out.println("Highlighter: " + highlighter +"\nhits-length:" +hits.length()); 44 highlighter.setTextFragmenter(new SimpleFragmenter(20)); 45 for (int i = 0; i < hits.length(); i++) { 46 System.out.println("URL " + (i + 1) + ": " + hits.doc(i).getField("path").stringValue()); 47 String text = hits.doc(i).get("FIELD_NAME"); 48 int maxNumFragmentsRequired = 2; 49 String fragmentSeparator = "..."; 50 TokenStream tokenStream = analyzer.tokenStream(FIELD_NAME, new StringReader(text)); 51 52 String result = 53 highlighter.getBestFragments( 54tokenStream, 55text, 56maxNumFragmentsRequired, 57fragmentSeparator); 58 System.out.println("\tfound in: " + result); 59 } 60} 61 62private static class MyBolder implements Formatter { 63 public String highlightTerm(String originalText , TokenGroup group) 64 { 65 if(group.getTotalScore()<=0) 66 { 67 return originalText; 68 } 69 return "" + originalText + ""; 70 } 71} 72 73 } Yagnesh N. Shah Senior Technology Engineer CS Dept., 4th Floor H. W. Wilson 950 University Avenue, Bronx NY 10452 (718) 588 8400 x2721 http://www.hwwilson.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: HTML pages highlighter
Hi! Erik, Here is what I used : cd /opt/dynamo/prod/hww-doc/hww java org.apache.lucene.demo.IndexHTML -create -index help/index help -Original Message- From: Erik Hatcher [mailto:[EMAIL PROTECTED] Sent: Wednesday, March 30, 2005 4:01 PM To: java-user@lucene.apache.org Subject: Re: HTML pages highlighter How did you index "contents"? If you did not use a stored field type, then that is the issue. Erik On Mar 30, 2005, at 12:31 PM, Yagnesh Shah wrote: > Hello Lucene-User, > Is any one try to do highlighting with HTML pages? > > I am trying to do this using demo example by Keld H. Hansen article > "Unweaving a Tangled Web HTMLParser and Lucene" but I am getting > "null" value for text at line #47 Any Idea? > > 1 package org.apache.lucene.search.highlight; > 2 > 3 import java.io.StringReader; > 4 > 5 import org.apache.lucene.analysis.Analyzer; > 6 import org.apache.lucene.analysis.TokenStream; > 7 import org.apache.lucene.analysis.standard.StandardAnalyzer; > 8 import org.apache.lucene.queryParser.QueryParser; > 9 import org.apache.lucene.search.Hits; > 10 import org.apache.lucene.search.IndexSearcher; > 11 import org.apache.lucene.search.Query; > 12 import org.apache.lucene.search.highlight.Formatter; > 13 import org.apache.lucene.search.highlight.Highlighter; > 14 import org.apache.lucene.search.highlight.QueryScorer; > 15 import org.apache.lucene.search.highlight.SimpleFragmenter; > 16 > 17 public class Searcher { > 18 > 19static Query query; > 20static Hits hits; > 21 > 22private static final String FIELD_NAME = "contents"; > 23private static final String indexDir = > "/opt/dynamo/prod/hww-doc/hww/help/index"; > 24 > 25private static Analyzer analyzer = new StandardAnalyzer(); > 26 > 27public static void main(String[] args) throws Exception { > 28 > 29 IndexSearcher is = new IndexSearcher(indexDir); > 30 String searchCriteria = "scholarly"; > 31 query = QueryParser.parse(searchCriteria, "contents", > analyzer); > 32 > 33 hits = is.search(query); > 34 System.out.println("found in: " + query > +"\nhits-length:" +hits.length()); > 35 > 36 doStandardHighlights(); > 37 > 38 is.close(); > 39} > 40 > 41static void doStandardHighlights() throws Exception { > 42 Highlighter highlighter = new Highlighter(new > MyBolder(), new QueryScorer(query)); > 43 System.out.println("Highlighter: " + highlighter > +"\nhits-length:" +hits.length()); > 44 highlighter.setTextFragmenter(new SimpleFragmenter(20)); > 45 for (int i = 0; i < hits.length(); i++) { > 46 System.out.println("URL " + (i + 1) + ": " + > hits.doc(i).getField("path").stringValue()); > 47 String text = hits.doc(i).get("FIELD_NAME"); > 48 int maxNumFragmentsRequired = 2; > 49 String fragmentSeparator = "..."; > 50 TokenStream tokenStream = > analyzer.tokenStream(FIELD_NAME, new StringReader(text)); > 51 > 52 String result = > 53 highlighter.getBestFragments( > 54tokenStream, > 55text, > 56maxNumFragmentsRequired, > 57fragmentSeparator); > 58 System.out.println("\tfound in: " + result); > 59 } > 60} > 61 > 62private static class MyBolder implements Formatter { > 63 public String highlightTerm(String originalText , > TokenGroup group) > 64 { > 65 if(group.getTotalScore()<=0) > 66 { > 67 return originalText; > 68 } > 69 return "" + originalText + ""; > 70 } > 71} > 72 > 73 } > > Yagnesh N. Shah > Senior Technology Engineer > CS Dept., 4th Floor > H. W. Wilson > 950 University Avenue, > Bronx NY 10452 > (718) 588 8400 x2721 > http://www.hwwilson.com > > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: HTML pages highlighter
Hi! Eric, One more thing, I am using the same HTMLDocument.java that comes with /trunk/src/demo/org/apache/lucene/demo -Original Message- From: Erik Hatcher [mailto:[EMAIL PROTECTED] Sent: Wednesday, March 30, 2005 4:01 PM To: java-user@lucene.apache.org Subject: Re: HTML pages highlighter How did you index "contents"? If you did not use a stored field type, then that is the issue. Erik On Mar 30, 2005, at 12:31 PM, Yagnesh Shah wrote: > Hello Lucene-User, > Is any one try to do highlighting with HTML pages? > > I am trying to do this using demo example by Keld H. Hansen article > "Unweaving a Tangled Web HTMLParser and Lucene" but I am getting > "null" value for text at line #47 Any Idea? > > 1 package org.apache.lucene.search.highlight; > 2 > 3 import java.io.StringReader; > 4 > 5 import org.apache.lucene.analysis.Analyzer; > 6 import org.apache.lucene.analysis.TokenStream; > 7 import org.apache.lucene.analysis.standard.StandardAnalyzer; > 8 import org.apache.lucene.queryParser.QueryParser; > 9 import org.apache.lucene.search.Hits; > 10 import org.apache.lucene.search.IndexSearcher; > 11 import org.apache.lucene.search.Query; > 12 import org.apache.lucene.search.highlight.Formatter; > 13 import org.apache.lucene.search.highlight.Highlighter; > 14 import org.apache.lucene.search.highlight.QueryScorer; > 15 import org.apache.lucene.search.highlight.SimpleFragmenter; > 16 > 17 public class Searcher { > 18 > 19static Query query; > 20static Hits hits; > 21 > 22private static final String FIELD_NAME = "contents"; > 23private static final String indexDir = > "/opt/dynamo/prod/hww-doc/hww/help/index"; > 24 > 25private static Analyzer analyzer = new StandardAnalyzer(); > 26 > 27public static void main(String[] args) throws Exception { > 28 > 29 IndexSearcher is = new IndexSearcher(indexDir); > 30 String searchCriteria = "scholarly"; > 31 query = QueryParser.parse(searchCriteria, "contents", > analyzer); > 32 > 33 hits = is.search(query); > 34 System.out.println("found in: " + query > +"\nhits-length:" +hits.length()); > 35 > 36 doStandardHighlights(); > 37 > 38 is.close(); > 39} > 40 > 41static void doStandardHighlights() throws Exception { > 42 Highlighter highlighter = new Highlighter(new > MyBolder(), new QueryScorer(query)); > 43 System.out.println("Highlighter: " + highlighter > +"\nhits-length:" +hits.length()); > 44 highlighter.setTextFragmenter(new SimpleFragmenter(20)); > 45 for (int i = 0; i < hits.length(); i++) { > 46 System.out.println("URL " + (i + 1) + ": " + > hits.doc(i).getField("path").stringValue()); > 47 String text = hits.doc(i).get("FIELD_NAME"); > 48 int maxNumFragmentsRequired = 2; > 49 String fragmentSeparator = "..."; > 50 TokenStream tokenStream = > analyzer.tokenStream(FIELD_NAME, new StringReader(text)); > 51 > 52 String result = > 53 highlighter.getBestFragments( > 54tokenStream, > 55text, > 56maxNumFragmentsRequired, > 57fragmentSeparator); > 58 System.out.println("\tfound in: " + result); > 59 } > 60} > 61 > 62private static class MyBolder implements Formatter { > 63 public String highlightTerm(String originalText , > TokenGroup group) > 64 { > 65 if(group.getTotalScore()<=0) > 66 { > 67 return originalText; > 68 } > 69 return "" + originalText + ""; > 70 } > 71} > 72 > 73 } > > Yagnesh N. Shah > Senior Technology Engineer > CS Dept., 4th Floor > H. W. Wilson > 950 University Avenue, > Bronx NY 10452 > (718) 588 8400 x2721 > http://www.hwwilson.com > > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: HTML pages highlighter
On Mar 30, 2005, at 4:17 PM, Yagnesh Shah wrote: Hi! Eric, One more thing, I am using the same HTMLDocument.java that comes with /trunk/src/demo/org/apache/lucene/demo Which does this: doc.add(new Field("contents", parser.getReader())); That is not a stored field. In other words, the original "contents" are not available from the Lucene index. You will have to adjust your indexing code to store the contents, or adjust your highlighting code to pull the contents from the original source again. Erik -Original Message- From: Erik Hatcher [mailto:[EMAIL PROTECTED] Sent: Wednesday, March 30, 2005 4:01 PM To: java-user@lucene.apache.org Subject: Re: HTML pages highlighter How did you index "contents"? If you did not use a stored field type, then that is the issue. Erik On Mar 30, 2005, at 12:31 PM, Yagnesh Shah wrote: Hello Lucene-User, Is any one try to do highlighting with HTML pages? I am trying to do this using demo example by Keld H. Hansen article "Unweaving a Tangled Web HTMLParser and Lucene" but I am getting "null" value for text at line #47 Any Idea? 1 package org.apache.lucene.search.highlight; 2 3 import java.io.StringReader; 4 5 import org.apache.lucene.analysis.Analyzer; 6 import org.apache.lucene.analysis.TokenStream; 7 import org.apache.lucene.analysis.standard.StandardAnalyzer; 8 import org.apache.lucene.queryParser.QueryParser; 9 import org.apache.lucene.search.Hits; 10 import org.apache.lucene.search.IndexSearcher; 11 import org.apache.lucene.search.Query; 12 import org.apache.lucene.search.highlight.Formatter; 13 import org.apache.lucene.search.highlight.Highlighter; 14 import org.apache.lucene.search.highlight.QueryScorer; 15 import org.apache.lucene.search.highlight.SimpleFragmenter; 16 17 public class Searcher { 18 19static Query query; 20static Hits hits; 21 22private static final String FIELD_NAME = "contents"; 23private static final String indexDir = "/opt/dynamo/prod/hww-doc/hww/help/index"; 24 25private static Analyzer analyzer = new StandardAnalyzer(); 26 27public static void main(String[] args) throws Exception { 28 29 IndexSearcher is = new IndexSearcher(indexDir); 30 String searchCriteria = "scholarly"; 31 query = QueryParser.parse(searchCriteria, "contents", analyzer); 32 33 hits = is.search(query); 34 System.out.println("found in: " + query +"\nhits-length:" +hits.length()); 35 36 doStandardHighlights(); 37 38 is.close(); 39} 40 41static void doStandardHighlights() throws Exception { 42 Highlighter highlighter = new Highlighter(new MyBolder(), new QueryScorer(query)); 43 System.out.println("Highlighter: " + highlighter +"\nhits-length:" +hits.length()); 44 highlighter.setTextFragmenter(new SimpleFragmenter(20)); 45 for (int i = 0; i < hits.length(); i++) { 46 System.out.println("URL " + (i + 1) + ": " + hits.doc(i).getField("path").stringValue()); 47 String text = hits.doc(i).get("FIELD_NAME"); 48 int maxNumFragmentsRequired = 2; 49 String fragmentSeparator = "..."; 50 TokenStream tokenStream = analyzer.tokenStream(FIELD_NAME, new StringReader(text)); 51 52 String result = 53 highlighter.getBestFragments( 54tokenStream, 55text, 56maxNumFragmentsRequired, 57fragmentSeparator); 58 System.out.println("\tfound in: " + result); 59 } 60} 61 62private static class MyBolder implements Formatter { 63 public String highlightTerm(String originalText , TokenGroup group) 64 { 65 if(group.getTotalScore()<=0) 66 { 67 return originalText; 68 } 69 return "" + originalText + ""; 70 } 71} 72 73 } Yagnesh N. Shah Senior Technology Engineer CS Dept., 4th Floor H. W. Wilson 950 University Avenue, Bronx NY 10452 (718) 588 8400 x2721 http://www.hwwilson.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscrib
error when query contains numbers
hi guys, I am using a QueryParser to search the index. when the query has numbers, i don t get any results?? any suggestions?? - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Newbie question
Newbie question here, is upgrading Lucene as easy as replacing the old Jar file with a newer version's Jar file? or do I need to recompile the application's code? Thanks, Luis - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: HTML pages highlighter
Hi! Eric, I try to modified that with this but I get compile error. Do you have any code snippet of highlighting code to pull the contents from the original source? or Do you know how I can do field store? doc.add(new Field("contents", parser.getReader(), Field.Store.YES, Field.Index.NO)); -Original Message- From: Erik Hatcher [mailto:[EMAIL PROTECTED] Sent: Wednesday, March 30, 2005 4:35 PM To: java-user@lucene.apache.org Subject: Re: HTML pages highlighter On Mar 30, 2005, at 4:17 PM, Yagnesh Shah wrote: > Hi! Eric, > One more thing, I am using the same HTMLDocument.java that comes with > /trunk/src/demo/org/apache/lucene/demo Which does this: doc.add(new Field("contents", parser.getReader())); That is not a stored field. In other words, the original "contents" are not available from the Lucene index. You will have to adjust your indexing code to store the contents, or adjust your highlighting code to pull the contents from the original source again. Erik > > -Original Message- > From: Erik Hatcher [mailto:[EMAIL PROTECTED] > Sent: Wednesday, March 30, 2005 4:01 PM > To: java-user@lucene.apache.org > Subject: Re: HTML pages highlighter > > > How did you index "contents"? If you did not use a stored field type, > then that is the issue. > > Erik > > On Mar 30, 2005, at 12:31 PM, Yagnesh Shah wrote: > >> Hello Lucene-User, >> Is any one try to do highlighting with HTML pages? >> >> I am trying to do this using demo example by Keld H. Hansen article >> "Unweaving a Tangled Web HTMLParser and Lucene" but I am getting >> "null" value for text at line #47 Any Idea? >> >> 1 package org.apache.lucene.search.highlight; >> 2 >> 3 import java.io.StringReader; >> 4 >> 5 import org.apache.lucene.analysis.Analyzer; >> 6 import org.apache.lucene.analysis.TokenStream; >> 7 import org.apache.lucene.analysis.standard.StandardAnalyzer; >> 8 import org.apache.lucene.queryParser.QueryParser; >> 9 import org.apache.lucene.search.Hits; >> 10 import org.apache.lucene.search.IndexSearcher; >> 11 import org.apache.lucene.search.Query; >> 12 import org.apache.lucene.search.highlight.Formatter; >> 13 import org.apache.lucene.search.highlight.Highlighter; >> 14 import org.apache.lucene.search.highlight.QueryScorer; >> 15 import org.apache.lucene.search.highlight.SimpleFragmenter; >> 16 >> 17 public class Searcher { >> 18 >> 19static Query query; >> 20static Hits hits; >> 21 >> 22private static final String FIELD_NAME = "contents"; >> 23private static final String indexDir = >> "/opt/dynamo/prod/hww-doc/hww/help/index"; >> 24 >> 25private static Analyzer analyzer = new StandardAnalyzer(); >> 26 >> 27public static void main(String[] args) throws Exception { >> 28 >> 29 IndexSearcher is = new IndexSearcher(indexDir); >> 30 String searchCriteria = "scholarly"; >> 31 query = QueryParser.parse(searchCriteria, "contents", >> analyzer); >> 32 >> 33 hits = is.search(query); >> 34 System.out.println("found in: " + query >> +"\nhits-length:" +hits.length()); >> 35 >> 36 doStandardHighlights(); >> 37 >> 38 is.close(); >> 39} >> 40 >> 41static void doStandardHighlights() throws Exception { >> 42 Highlighter highlighter = new Highlighter(new >> MyBolder(), new QueryScorer(query)); >> 43 System.out.println("Highlighter: " + highlighter >> +"\nhits-length:" +hits.length()); >> 44 highlighter.setTextFragmenter(new SimpleFragmenter(20)); >> 45 for (int i = 0; i < hits.length(); i++) { >> 46 System.out.println("URL " + (i + 1) + ": " + >> hits.doc(i).getField("path").stringValue()); >> 47 String text = hits.doc(i).get("FIELD_NAME"); >> 48 int maxNumFragmentsRequired = 2; >> 49 String fragmentSeparator = "..."; >> 50 TokenStream tokenStream = >> analyzer.tokenStream(FIELD_NAME, new StringReader(text)); >> 51 >> 52 String result = >> 53 highlighter.getBestFragments( >> 54tokenStream, >> 55text, >> 56maxNumFragmentsRequired, >> 57fragmentSeparator); >> 58 System.out.println("\tfound in: " + result); >> 59 } >> 60} >> 61 >> 62private static class MyBolder implements Formatter { >> 63 public String highlightTerm(String originalText , >> TokenGroup group) >> 64 { >> 65 if(group.getTotalScore()<=0) >> 66 { >> 67 return originalText; >> 68 } >> 69 return "" + originalText +
Re: error when query contains numbers
On Mar 30, 2005, at 8:05 PM, Omar Didi wrote: the .toString() looks excactly like the query I enter: if I search for "yahoo AND 200" it returns 0 hits. I am sure there are documents that have 200 hundreds in them. The analyzer I am using is a custom analyzer that has a list of stop words. I don t know much about the way data was indexed, I am just developing an aplication to search using the analyzer that was used while indexing. Try the tips here: http://wiki.apache.org/jakarta-lucene/AnalysisParalysis - you need to analyze your analyzer and ensure what you think was indexed actually was. Also, look into using Luke - http://www.getopt.org/luke/ - to see what makes your index tick. my concern now is if there is an error with the way the indexing was do I have to reindex the documents? Yes. That's just the nature of how it works. Getting the analysis right is important stuff, and if you didn't index it, you can't search for it! Feel free to share more details of your analyzer, and we'd be happy to "analyze" it. Erik thanks On Mar 30, 2005, at 4:41 PM, Omar Didi wrote: I am using a QueryParser to search the index. when the query has numbers, i don t get any results?? any suggestions?? What is the .toString of the Query object instance returned from QueryParser? What Analyzer are you using? How did you index the field(s) being queried? Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: error when query contains numbers
Thanks Eric, I have looked at the way the documents were indexed and they are using 90% of the code using in chapter 2 and 4 of your book LIA. except for the stop words. I will try to use Luke to see if there are any numbers indexed first. From: Erik Hatcher [mailto:[EMAIL PROTECTED] Sent: Wed 3/30/2005 8:58 PM To: java-user@lucene.apache.org Subject: Re: error when query contains numbers On Mar 30, 2005, at 8:05 PM, Omar Didi wrote: > the .toString() looks excactly like the query I enter: if I search for > "yahoo AND 200" it returns 0 hits. I am sure there are documents that > have 200 hundreds in them. The analyzer I am using is a custom > analyzer that has a list of stop words. I don t know much about the > way data was indexed, I am just developing an aplication to search > using the analyzer that was used while indexing. Try the tips here: http://wiki.apache.org/jakarta-lucene/AnalysisParalysis - you need to analyze your analyzer and ensure what you think was indexed actually was. Also, look into using Luke - http://www.getopt.org/luke/ - to see what makes your index tick. > my concern now is if there is an error with the way the indexing was > do I have to reindex the documents? Yes. That's just the nature of how it works. Getting the analysis right is important stuff, and if you didn't index it, you can't search for it! Feel free to share more details of your analyzer, and we'd be happy to "analyze" it. Erik > thanks > > On Mar 30, 2005, at 4:41 PM, Omar Didi wrote: >> I am using a QueryParser to search the index. when the query has >> numbers, i don t get any results?? >> any suggestions?? > > What is the .toString of the Query object instance returned from > QueryParser? What Analyzer are you using? How did you index the > field(s) being queried? > > Erik > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
LUKE [ NEW VERSION ]
Hi Guys. Apologies. :( Can Somebody Please Tell me How to add Custom Analyzer's to the new Version of LUKE , or is there an existing Process to do the same. Thx in advance WITH WARM REGARDS HAVE A NICE DAY [ N.S.KARTHIK]
Re: LUKE [ NEW VERSION ]
Karthik N S wrote: * * * Can Somebody Please Tell me How to add Custom Analyzer's to the new Version of LUKE , * The same way as to the old version - you put them on your classpath when you run Luke, like this: java -cp lukeall.jar;myAnalyzers.jar org.getopt.luke.Luke -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
new bie ..
hi, i'm a new lucene user. i've few questions regarding indexing and searching. 1)how do i search within tokens ..for example if i've a string "my name is abc123". using whitespace analyser i can search for any of these strings but when i search for 123 the search returns zero results. how can i search such tokens r strings ?? i want the search to display abc123 when i search for either abc or 123 not as a complete string. 2)i'm fetching records from the database and adding it to the index. how can i update the existing index when i add a new row or delete a row from the database. thanks, pashupathinath.k Send instant messages to your online friends http://uk.messenger.yahoo.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
searching within tokens
hi, i'm a new lucene user. i've few questions regarding indexing and searching. 1)how do i search within tokens ..for example if i've a string "my name is abc123". using whitespace analyser i can search for any of these strings but when i search for 123 the search returns zero results. how can i search such tokens r strings ?? i want the search to display abc123 when i search for either abc or 123 not as a complete string. 2)i'm fetching records from the database and adding it to the index. how can i update the existing index when i add a new row or delete a row from the database. thanks, pashupathinath.k Send instant messages to your online friends http://uk.messenger.yahoo.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: pre computing possible search results narrowing and hit counts on those
On Wed, 30 Mar 2005 09:42:32 -0800, Doug Cutting <[EMAIL PROTECTED]> wrote: > Antony Sequeira wrote: > > A user does a search for say "condominium", and i show him the 50,000 > > properties that meet that description. > > > > I need two other pieces of information for display - > > 1. I want to show a "select" box on the UI, which contains all the > > cities that appear in those 50,000 documents > > 2. Against each city I want to show the count of matching documents. > > > > For example the drop down might look like > > "Los Angeles" 1 > > "San Francisco" 5000 > > > > (But, I do not want to show "San Jose" if none of the 50,000 documents > > contain it) > > You can use the FieldCache & HitCollector: > > private class Count { int value; } > > String[] docToCity = FieldCache.getStrings(indexReader, "city"); > Map cityToCount = new HashMap(); > > searcher.search(query, new HitCollector() { >public void collect(int doc, float score) { > String city = docToCity[doc]; > Count count = cityToCount.get(city); > if (count == null) { >count = new Count(); >cityToCount.put(city, count); > } > count.value++; >} > }); > > // sort & display entries in cityToCount > > Doug > Based on a previous reply , I went through the java docs and came up with public class PreFilterCollector extends HitCollector { final BitVector bits = new BitVector(reader.maxDoc()); java.util.HashMap statemap = new java.util.HashMap() ; public void collect(int id, float score) { bits.set(id); } public java.util.HashMap getStateCounts() { try { int k = bits.size(); int j = 0; for (int i =0; i < k; i++) { if (!bits.get(i)) continue; Document doc = reader.document(i); j++; String state = doc.get("state"); // we assume one state for now if (statemap.containsKey(state)) { statemap.put(state,statemap.get(state) + 1); } else { statemap.put(state,1); } } } catch (Exception e) { throw new RuntimeException(e); } return statemap; } } But, I have the following questions 1. My code first collects all the doc ids and then iterates over them to collect field info. I did this becasue, http://lucene.apache.org/java/docs/api/org/apache/lucene/search/HitCollector.html says "This is called in an inner search loop. For good search performance, implementations of this method should not call Searchable.doc(int) or IndexReader.document(int) on every document number encountered" Have I misunderstood and doing this wrongly ? 2. Would your code be faster (under what circumstances) ? 3. One problem i see with my current solution is that it accesses every doc of the result set. One of the previous responses pointed to a solution in http://www.mail-archive.com/java-dev@lucene.apache.org/msg00034.html After reading it, to me it looked like that solution won't be any better. (Looks like it walks values of terms that do not even occur in teh current search result set). Have I got this right ? I am a newbee to lucene. Thanks for all the replies. Appreciate it very much. -Antony - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: error when query contains numbers
On Mar 30, 2005, at 4:41 PM, Omar Didi wrote: I am using a QueryParser to search the index. when the query has numbers, i don t get any results?? any suggestions?? What is the .toString of the Query object instance returned from QueryParser? What Analyzer are you using? How did you index the field(s) being queried? Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Newbie question
On Mar 30, 2005, at 4:42 PM, Luis Medina wrote: Newbie question here, is upgrading Lucene as easy as replacing the old Jar file with a newer version's Jar file? or do I need to recompile the application's code? Try it and see :) It should work fine by replacing the JAR, with no recompilation necessary. The more important question is do you need to reindex. Most likely not, but there have been some versions of Lucene that have changed how some factors were computed and a reindex is needed in those cases to keep things in sync. Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: HTML pages highlighter
On Mar 30, 2005, at 4:46 PM, Yagnesh Shah wrote: Hi! Eric, Erik - with a 'k' - Sorry, I let it slide once though :) I try to modified that with this but I get compile error. Do you have any code snippet of highlighting code to pull the contents from the original source? I have a whole book full of code examples :) http://www.lucenebook.com - Grab the source code and look in src/lia/tools at Highlight*.java or Do you know how I can do field store? doc.add(new Field("contents", parser.getReader(), Field.Store.YES, Field.Index.NO)); You cannot store it with a Reader. You need to use Field.Text(String, String), or one of the other variations. Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: error when query contains numbers
the .toString() looks excactly like the query I enter: if I search for "yahoo AND 200" it returns 0 hits. I am sure there are documents that have 200 hundreds in them. The analyzer I am using is a custom analyzer that has a list of stop words. I don t know much about the way data was indexed, I am just developing an aplication to search using the analyzer that was used while indexing. my concern now is if there is an error with the way the indexing was do I have to reindex the documents? thanks On Mar 30, 2005, at 4:41 PM, Omar Didi wrote: > I am using a QueryParser to search the index. when the query has > numbers, i don t get any results?? > any suggestions?? What is the .toString of the Query object instance returned from QueryParser? What Analyzer are you using? How did you index the field(s) being queried? Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]