Re: Seeking advice on index parameter settings for large index

2005-03-30 Thread Doug Cutting
Chuck Williams wrote:
   index.setMaxBufferedDocs(10);  // Buffer 10 documents at a time 
in memory (they could be big)
You might use a larger value here for the index with the small 
documents.  I've sucessfully used values as high as a 1000 when indexing 
documents that average a few kilobytes with a few hundred megabyte heap. 
 This can make indexing a lot faster.  Note that this is the number of 
single document indexes that are buffered, not document text.  Indexes 
are typically smaller than the text.

   index.setMaxMergeDocs(10);  // Yields about 75 large segments 
for 7.5 million docs (plus log2 smaller segments) = 100 total
This is reasonable while incrementally indexing, in order to bound the 
delay while adding documents.  But I would use Integer.MAX_VALUE during 
the initial build.  75 segments are much slower to search than one 
segment.  I think this is also a realistic assumption for most systems 
that are incrementally updated.  For example, if you have "scheduled 
downtime" you can optimize the index.  Or perhaps you can optimize at 
midnight every night, queing updates while this operates.  If there's 
never downtime, and updates must always be prompt, you can, as a 
background process, periodically copy the index, optimize it and apply 
queued updates until it is in sync with the live index, then swap them. 
 There are lots of ways to implement this, but, in short, you should 
never need to have 75 segments, but only ever 1 + 
log2(#updates_since_optimize).

   index.setUseCompoundFile(true);  // false could improve 
performance but will consume more file handles
If you don't have 75 big segments, then you can probably afford to set 
this false.

Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


HTML pages highlighter

2005-03-30 Thread Yagnesh Shah
Hello Lucene-User,
Is any one try to do highlighting with HTML pages?

I am trying to do this using demo example by Keld H. Hansen article "Unweaving 
a Tangled Web HTMLParser and Lucene" but I am getting "null" value for text at 
line #47 Any Idea?

  1 package org.apache.lucene.search.highlight;
  2
  3 import java.io.StringReader;
  4
  5 import org.apache.lucene.analysis.Analyzer;
  6 import org.apache.lucene.analysis.TokenStream;
  7 import org.apache.lucene.analysis.standard.StandardAnalyzer;
  8 import org.apache.lucene.queryParser.QueryParser;
  9 import org.apache.lucene.search.Hits;
 10 import org.apache.lucene.search.IndexSearcher;
 11 import org.apache.lucene.search.Query;
 12 import org.apache.lucene.search.highlight.Formatter;
 13 import org.apache.lucene.search.highlight.Highlighter;
 14 import org.apache.lucene.search.highlight.QueryScorer;
 15 import org.apache.lucene.search.highlight.SimpleFragmenter;
 16
 17 public class Searcher {
 18
 19static Query query;
 20static Hits hits;
 21
 22private static final String FIELD_NAME = "contents";
 23private static final String indexDir = 
"/opt/dynamo/prod/hww-doc/hww/help/index";
 24
 25private static Analyzer analyzer = new StandardAnalyzer();
 26
 27public static void main(String[] args) throws Exception {
 28
 29   IndexSearcher is   = new IndexSearcher(indexDir);
 30   String searchCriteria = "scholarly";
 31   query = QueryParser.parse(searchCriteria, "contents", analyzer);
 32
 33   hits  = is.search(query);
 34   System.out.println("found in: " + query +"\nhits-length:" 
+hits.length());
 35
 36   doStandardHighlights();
 37
 38   is.close();
 39}
 40
 41static void doStandardHighlights() throws Exception {
 42   Highlighter highlighter = new Highlighter(new MyBolder(), new 
QueryScorer(query));
 43   System.out.println("Highlighter: " + highlighter 
+"\nhits-length:" +hits.length());
 44   highlighter.setTextFragmenter(new SimpleFragmenter(20));
 45   for (int i = 0; i < hits.length(); i++) {
 46  System.out.println("URL " + (i + 1) + ": " + 
hits.doc(i).getField("path").stringValue());
 47  String text = hits.doc(i).get("FIELD_NAME");
 48  int maxNumFragmentsRequired = 2;
 49  String fragmentSeparator = "...";
 50  TokenStream tokenStream = analyzer.tokenStream(FIELD_NAME, new 
StringReader(text));
 51
 52  String result =
 53 highlighter.getBestFragments(
 54tokenStream,
 55text,
 56maxNumFragmentsRequired,
 57fragmentSeparator);
 58  System.out.println("\tfound in: " + result);
 59   }
 60}
 61
 62private static class MyBolder implements Formatter {
 63 public String highlightTerm(String originalText , TokenGroup 
group)
 64 {
 65 if(group.getTotalScore()<=0)
 66 {
 67 return originalText;
 68 }
 69 return "" + originalText + "";
 70 }
 71}
 72
 73 }

Yagnesh N. Shah 
Senior Technology Engineer 
CS Dept., 4th Floor
H. W. Wilson 
950 University Avenue, 
Bronx NY 10452 
(718) 588 8400 x2721 
http://www.hwwilson.com

 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: pre computing possible search results narrowing and hit counts on those

2005-03-30 Thread Doug Cutting
Antony Sequeira wrote:
A user does a search for say "condominium", and i show him the 50,000
properties that meet that description.
I need two other pieces of information for display -
1. I want to show a "select" box on the UI, which contains all the
cities that appear in those 50,000 documents
2. Against each city I want to show the count of matching documents.
For example the drop down might look like
"Los Angeles"  1
"San Francisco" 5000
(But, I do not want to show "San Jose" if none of the 50,000 documents
contain it)
You can use the FieldCache & HitCollector:
private class Count { int value; }
String[] docToCity = FieldCache.getStrings(indexReader, "city");
Map cityToCount = new HashMap();
searcher.search(query, new HitCollector() {
  public void collect(int doc, float score) {
String city = docToCity[doc];
Count count = cityToCount.get(city);
if (count == null) {
  count = new Count();
  cityToCount.put(city, count);
}
count.value++;
  }
});
// sort & display entries in cityToCount
Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


searcher question

2005-03-30 Thread Omar Didi
I am having a large index (100GB) and when i run the following code :

String indexLocation = servlet.getServletContext().getInitParameter( 
"com.lucene.index" );
logger.log( Level.INFO, "got the index location from:  " + indexLocation );
searcher = new IndexSearcher(indexLocation);
logger.log( Level.INFO, "we created an instance of  SearchIndex" );

I never get to see the last message "we created an instance of  SearchIndex" 
and I get 
java.lang.OutOfMemoryError: Java heap space.

please if anyone has any ideas???.



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: searcher question

2005-03-30 Thread Doug Cutting
Omar Didi wrote:
I am having a large index (100GB) and when i run the following code :
String indexLocation = servlet.getServletContext().getInitParameter( 
"com.lucene.index" );
logger.log( Level.INFO, "got the index location from:  " + indexLocation );
searcher = new IndexSearcher(indexLocation);
logger.log( Level.INFO, "we created an instance of  SearchIndex" );
I never get to see the last message "we created an instance of  SearchIndex" and I get 
java.lang.OutOfMemoryError: Java heap space.
How big is your java heap?  How much RAM do you have on the machine? 
How many documents are in the index?  What version of Lucene?

You might try calling IndexWriter.setTermIndexInterval(512) and 
re-optimizing your index.  You might need to add and/or delete a 
document for this to have an effect if the index is already optimized. 
This method is only in the latest sources, available from subversion. 
It should dramatically reduce the amount of memory required to open the 
index.  There are other changes in the latest sources that will also 
reduce memory requirements, so you may not even need to use 
IndexWriter.setTermIndexInterval().

Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


RE: searcher question

2005-03-30 Thread Omar Didi
my java heap is between 128 and 1024 MB, I have 2GB of RAM and about 10 million 
documents in the index which is broken down to 6 indexes. I am using a 
multi-searcher to query the index. I am using lucene1.4.3.
-Original Message-
From: Doug Cutting [mailto:[EMAIL PROTECTED]
Sent: Wednesday, March 30, 2005 1:59 PM
To: java-user@lucene.apache.org
Subject: Re: searcher question


Omar Didi wrote:
> I am having a large index (100GB) and when i run the following code :
> 
> String indexLocation = servlet.getServletContext().getInitParameter( 
> "com.lucene.index" );
> logger.log( Level.INFO, "got the index location from:  " + indexLocation );
> searcher = new IndexSearcher(indexLocation);
> logger.log( Level.INFO, "we created an instance of  SearchIndex" );
> 
> I never get to see the last message "we created an instance of  SearchIndex" 
> and I get 
> java.lang.OutOfMemoryError: Java heap space.

How big is your java heap?  How much RAM do you have on the machine? 
How many documents are in the index?  What version of Lucene?

You might try calling IndexWriter.setTermIndexInterval(512) and 
re-optimizing your index.  You might need to add and/or delete a 
document for this to have an effect if the index is already optimized. 
This method is only in the latest sources, available from subversion. 
It should dramatically reduce the amount of memory required to open the 
index.  There are other changes in the latest sources that will also 
reduce memory requirements, so you may not even need to use 
IndexWriter.setTermIndexInterval().

Doug

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: searcher question

2005-03-30 Thread Michael Celona
Curious... what kind of search performance are you getting for an index this
size...

-Original Message-
From: Omar Didi [mailto:[EMAIL PROTECTED] 
Sent: Wednesday, March 30, 2005 3:15 PM
To: java-user@lucene.apache.org
Subject: RE: searcher question

my java heap is between 128 and 1024 MB, I have 2GB of RAM and about 10
million documents in the index which is broken down to 6 indexes. I am using
a multi-searcher to query the index. I am using lucene1.4.3.
-Original Message-
From: Doug Cutting [mailto:[EMAIL PROTECTED]
Sent: Wednesday, March 30, 2005 1:59 PM
To: java-user@lucene.apache.org
Subject: Re: searcher question


Omar Didi wrote:
> I am having a large index (100GB) and when i run the following code :
> 
> String indexLocation = servlet.getServletContext().getInitParameter(
"com.lucene.index" );
> logger.log( Level.INFO, "got the index location from:  " + indexLocation
);
> searcher = new IndexSearcher(indexLocation);
> logger.log( Level.INFO, "we created an instance of  SearchIndex" );
> 
> I never get to see the last message "we created an instance of
SearchIndex" and I get 
> java.lang.OutOfMemoryError: Java heap space.

How big is your java heap?  How much RAM do you have on the machine? 
How many documents are in the index?  What version of Lucene?

You might try calling IndexWriter.setTermIndexInterval(512) and 
re-optimizing your index.  You might need to add and/or delete a 
document for this to have an effect if the index is already optimized. 
This method is only in the latest sources, available from subversion. 
It should dramatically reduce the amount of memory required to open the 
index.  There are other changes in the latest sources that will also 
reduce memory requirements, so you may not even need to use 
IndexWriter.setTermIndexInterval().

Doug

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: HTML pages highlighter

2005-03-30 Thread Erik Hatcher
How did you index "contents"?  If you did not use a stored field type, 
then that is the issue.

Erik
On Mar 30, 2005, at 12:31 PM, Yagnesh Shah wrote:
Hello Lucene-User,
Is any one try to do highlighting with HTML pages?
I am trying to do this using demo example by Keld H. Hansen article 
"Unweaving a Tangled Web HTMLParser and Lucene" but I am getting 
"null" value for text at line #47 Any Idea?

  1 package org.apache.lucene.search.highlight;
  2
  3 import java.io.StringReader;
  4
  5 import org.apache.lucene.analysis.Analyzer;
  6 import org.apache.lucene.analysis.TokenStream;
  7 import org.apache.lucene.analysis.standard.StandardAnalyzer;
  8 import org.apache.lucene.queryParser.QueryParser;
  9 import org.apache.lucene.search.Hits;
 10 import org.apache.lucene.search.IndexSearcher;
 11 import org.apache.lucene.search.Query;
 12 import org.apache.lucene.search.highlight.Formatter;
 13 import org.apache.lucene.search.highlight.Highlighter;
 14 import org.apache.lucene.search.highlight.QueryScorer;
 15 import org.apache.lucene.search.highlight.SimpleFragmenter;
 16
 17 public class Searcher {
 18
 19static Query query;
 20static Hits hits;
 21
 22private static final String FIELD_NAME = "contents";
 23private static final String indexDir = 
"/opt/dynamo/prod/hww-doc/hww/help/index";
 24
 25private static Analyzer analyzer = new StandardAnalyzer();
 26
 27public static void main(String[] args) throws Exception {
 28
 29   IndexSearcher is   = new IndexSearcher(indexDir);
 30   String searchCriteria = "scholarly";
 31   query = QueryParser.parse(searchCriteria, "contents", 
analyzer);
 32
 33   hits  = is.search(query);
 34   System.out.println("found in: " + query 
+"\nhits-length:" +hits.length());
 35
 36   doStandardHighlights();
 37
 38   is.close();
 39}
 40
 41static void doStandardHighlights() throws Exception {
 42   Highlighter highlighter = new Highlighter(new 
MyBolder(), new QueryScorer(query));
 43   System.out.println("Highlighter: " + highlighter 
+"\nhits-length:" +hits.length());
 44   highlighter.setTextFragmenter(new SimpleFragmenter(20));
 45   for (int i = 0; i < hits.length(); i++) {
 46  System.out.println("URL " + (i + 1) + ": " + 
hits.doc(i).getField("path").stringValue());
 47  String text = hits.doc(i).get("FIELD_NAME");
 48  int maxNumFragmentsRequired = 2;
 49  String fragmentSeparator = "...";
 50  TokenStream tokenStream = 
analyzer.tokenStream(FIELD_NAME, new StringReader(text));
 51
 52  String result =
 53 highlighter.getBestFragments(
 54tokenStream,
 55text,
 56maxNumFragmentsRequired,
 57fragmentSeparator);
 58  System.out.println("\tfound in: " + result);
 59   }
 60}
 61
 62private static class MyBolder implements Formatter {
 63 public String highlightTerm(String originalText , 
TokenGroup group)
 64 {
 65 if(group.getTotalScore()<=0)
 66 {
 67 return originalText;
 68 }
 69 return "" + originalText + "";
 70 }
 71}
 72
 73 }

Yagnesh N. Shah
Senior Technology Engineer
CS Dept., 4th Floor
H. W. Wilson
950 University Avenue,
Bronx NY 10452
(718) 588 8400 x2721
http://www.hwwilson.com

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


RE: HTML pages highlighter

2005-03-30 Thread Yagnesh Shah
Hi! Erik,
Here is what I used :
cd /opt/dynamo/prod/hww-doc/hww
java org.apache.lucene.demo.IndexHTML -create -index help/index help

-Original Message-
From: Erik Hatcher [mailto:[EMAIL PROTECTED]
Sent: Wednesday, March 30, 2005 4:01 PM
To: java-user@lucene.apache.org
Subject: Re: HTML pages highlighter


How did you index "contents"?  If you did not use a stored field type, 
then that is the issue.

Erik

On Mar 30, 2005, at 12:31 PM, Yagnesh Shah wrote:

> Hello Lucene-User,
>   Is any one try to do highlighting with HTML pages?
>
> I am trying to do this using demo example by Keld H. Hansen article 
> "Unweaving a Tangled Web HTMLParser and Lucene" but I am getting 
> "null" value for text at line #47 Any Idea?
>
>   1 package org.apache.lucene.search.highlight;
>   2
>   3 import java.io.StringReader;
>   4
>   5 import org.apache.lucene.analysis.Analyzer;
>   6 import org.apache.lucene.analysis.TokenStream;
>   7 import org.apache.lucene.analysis.standard.StandardAnalyzer;
>   8 import org.apache.lucene.queryParser.QueryParser;
>   9 import org.apache.lucene.search.Hits;
>  10 import org.apache.lucene.search.IndexSearcher;
>  11 import org.apache.lucene.search.Query;
>  12 import org.apache.lucene.search.highlight.Formatter;
>  13 import org.apache.lucene.search.highlight.Highlighter;
>  14 import org.apache.lucene.search.highlight.QueryScorer;
>  15 import org.apache.lucene.search.highlight.SimpleFragmenter;
>  16
>  17 public class Searcher {
>  18
>  19static Query query;
>  20static Hits hits;
>  21
>  22private static final String FIELD_NAME = "contents";
>  23private static final String indexDir = 
> "/opt/dynamo/prod/hww-doc/hww/help/index";
>  24
>  25private static Analyzer analyzer = new StandardAnalyzer();
>  26
>  27public static void main(String[] args) throws Exception {
>  28
>  29   IndexSearcher is   = new IndexSearcher(indexDir);
>  30   String searchCriteria = "scholarly";
>  31   query = QueryParser.parse(searchCriteria, "contents", 
> analyzer);
>  32
>  33   hits  = is.search(query);
>  34   System.out.println("found in: " + query 
> +"\nhits-length:" +hits.length());
>  35
>  36   doStandardHighlights();
>  37
>  38   is.close();
>  39}
>  40
>  41static void doStandardHighlights() throws Exception {
>  42   Highlighter highlighter = new Highlighter(new 
> MyBolder(), new QueryScorer(query));
>  43   System.out.println("Highlighter: " + highlighter 
> +"\nhits-length:" +hits.length());
>  44   highlighter.setTextFragmenter(new SimpleFragmenter(20));
>  45   for (int i = 0; i < hits.length(); i++) {
>  46  System.out.println("URL " + (i + 1) + ": " + 
> hits.doc(i).getField("path").stringValue());
>  47  String text = hits.doc(i).get("FIELD_NAME");
>  48  int maxNumFragmentsRequired = 2;
>  49  String fragmentSeparator = "...";
>  50  TokenStream tokenStream = 
> analyzer.tokenStream(FIELD_NAME, new StringReader(text));
>  51
>  52  String result =
>  53 highlighter.getBestFragments(
>  54tokenStream,
>  55text,
>  56maxNumFragmentsRequired,
>  57fragmentSeparator);
>  58  System.out.println("\tfound in: " + result);
>  59   }
>  60}
>  61
>  62private static class MyBolder implements Formatter {
>  63 public String highlightTerm(String originalText , 
> TokenGroup group)
>  64 {
>  65 if(group.getTotalScore()<=0)
>  66 {
>  67 return originalText;
>  68 }
>  69 return "" + originalText + "";
>  70 }
>  71}
>  72
>  73 }
>
> Yagnesh N. Shah
> Senior Technology Engineer
> CS Dept., 4th Floor
> H. W. Wilson
> 950 University Avenue,
> Bronx NY 10452
> (718) 588 8400 x2721
> http://www.hwwilson.com
>
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: HTML pages highlighter

2005-03-30 Thread Yagnesh Shah
Hi! Eric,
One more thing, I am using the same HTMLDocument.java that comes with 
/trunk/src/demo/org/apache/lucene/demo

-Original Message-
From: Erik Hatcher [mailto:[EMAIL PROTECTED]
Sent: Wednesday, March 30, 2005 4:01 PM
To: java-user@lucene.apache.org
Subject: Re: HTML pages highlighter


How did you index "contents"?  If you did not use a stored field type, 
then that is the issue.

Erik

On Mar 30, 2005, at 12:31 PM, Yagnesh Shah wrote:

> Hello Lucene-User,
>   Is any one try to do highlighting with HTML pages?
>
> I am trying to do this using demo example by Keld H. Hansen article 
> "Unweaving a Tangled Web HTMLParser and Lucene" but I am getting 
> "null" value for text at line #47 Any Idea?
>
>   1 package org.apache.lucene.search.highlight;
>   2
>   3 import java.io.StringReader;
>   4
>   5 import org.apache.lucene.analysis.Analyzer;
>   6 import org.apache.lucene.analysis.TokenStream;
>   7 import org.apache.lucene.analysis.standard.StandardAnalyzer;
>   8 import org.apache.lucene.queryParser.QueryParser;
>   9 import org.apache.lucene.search.Hits;
>  10 import org.apache.lucene.search.IndexSearcher;
>  11 import org.apache.lucene.search.Query;
>  12 import org.apache.lucene.search.highlight.Formatter;
>  13 import org.apache.lucene.search.highlight.Highlighter;
>  14 import org.apache.lucene.search.highlight.QueryScorer;
>  15 import org.apache.lucene.search.highlight.SimpleFragmenter;
>  16
>  17 public class Searcher {
>  18
>  19static Query query;
>  20static Hits hits;
>  21
>  22private static final String FIELD_NAME = "contents";
>  23private static final String indexDir = 
> "/opt/dynamo/prod/hww-doc/hww/help/index";
>  24
>  25private static Analyzer analyzer = new StandardAnalyzer();
>  26
>  27public static void main(String[] args) throws Exception {
>  28
>  29   IndexSearcher is   = new IndexSearcher(indexDir);
>  30   String searchCriteria = "scholarly";
>  31   query = QueryParser.parse(searchCriteria, "contents", 
> analyzer);
>  32
>  33   hits  = is.search(query);
>  34   System.out.println("found in: " + query 
> +"\nhits-length:" +hits.length());
>  35
>  36   doStandardHighlights();
>  37
>  38   is.close();
>  39}
>  40
>  41static void doStandardHighlights() throws Exception {
>  42   Highlighter highlighter = new Highlighter(new 
> MyBolder(), new QueryScorer(query));
>  43   System.out.println("Highlighter: " + highlighter 
> +"\nhits-length:" +hits.length());
>  44   highlighter.setTextFragmenter(new SimpleFragmenter(20));
>  45   for (int i = 0; i < hits.length(); i++) {
>  46  System.out.println("URL " + (i + 1) + ": " + 
> hits.doc(i).getField("path").stringValue());
>  47  String text = hits.doc(i).get("FIELD_NAME");
>  48  int maxNumFragmentsRequired = 2;
>  49  String fragmentSeparator = "...";
>  50  TokenStream tokenStream = 
> analyzer.tokenStream(FIELD_NAME, new StringReader(text));
>  51
>  52  String result =
>  53 highlighter.getBestFragments(
>  54tokenStream,
>  55text,
>  56maxNumFragmentsRequired,
>  57fragmentSeparator);
>  58  System.out.println("\tfound in: " + result);
>  59   }
>  60}
>  61
>  62private static class MyBolder implements Formatter {
>  63 public String highlightTerm(String originalText , 
> TokenGroup group)
>  64 {
>  65 if(group.getTotalScore()<=0)
>  66 {
>  67 return originalText;
>  68 }
>  69 return "" + originalText + "";
>  70 }
>  71}
>  72
>  73 }
>
> Yagnesh N. Shah
> Senior Technology Engineer
> CS Dept., 4th Floor
> H. W. Wilson
> 950 University Avenue,
> Bronx NY 10452
> (718) 588 8400 x2721
> http://www.hwwilson.com
>
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: HTML pages highlighter

2005-03-30 Thread Erik Hatcher
On Mar 30, 2005, at 4:17 PM, Yagnesh Shah wrote:
Hi! Eric,
	One more thing, I am using the same HTMLDocument.java that comes with 
/trunk/src/demo/org/apache/lucene/demo
Which does this:
 doc.add(new Field("contents", parser.getReader()));
That is not a stored field.  In other words, the original "contents" 
are not available from the Lucene index.   You will have to adjust your 
indexing code to store the contents, or adjust your highlighting code 
to pull the contents from the original source again.

Erik

-Original Message-
From: Erik Hatcher [mailto:[EMAIL PROTECTED]
Sent: Wednesday, March 30, 2005 4:01 PM
To: java-user@lucene.apache.org
Subject: Re: HTML pages highlighter
How did you index "contents"?  If you did not use a stored field type,
then that is the issue.
Erik
On Mar 30, 2005, at 12:31 PM, Yagnesh Shah wrote:
Hello Lucene-User,
Is any one try to do highlighting with HTML pages?
I am trying to do this using demo example by Keld H. Hansen article
"Unweaving a Tangled Web HTMLParser and Lucene" but I am getting
"null" value for text at line #47 Any Idea?
  1 package org.apache.lucene.search.highlight;
  2
  3 import java.io.StringReader;
  4
  5 import org.apache.lucene.analysis.Analyzer;
  6 import org.apache.lucene.analysis.TokenStream;
  7 import org.apache.lucene.analysis.standard.StandardAnalyzer;
  8 import org.apache.lucene.queryParser.QueryParser;
  9 import org.apache.lucene.search.Hits;
 10 import org.apache.lucene.search.IndexSearcher;
 11 import org.apache.lucene.search.Query;
 12 import org.apache.lucene.search.highlight.Formatter;
 13 import org.apache.lucene.search.highlight.Highlighter;
 14 import org.apache.lucene.search.highlight.QueryScorer;
 15 import org.apache.lucene.search.highlight.SimpleFragmenter;
 16
 17 public class Searcher {
 18
 19static Query query;
 20static Hits hits;
 21
 22private static final String FIELD_NAME = "contents";
 23private static final String indexDir =
"/opt/dynamo/prod/hww-doc/hww/help/index";
 24
 25private static Analyzer analyzer = new StandardAnalyzer();
 26
 27public static void main(String[] args) throws Exception {
 28
 29   IndexSearcher is   = new IndexSearcher(indexDir);
 30   String searchCriteria = "scholarly";
 31   query = QueryParser.parse(searchCriteria, "contents",
analyzer);
 32
 33   hits  = is.search(query);
 34   System.out.println("found in: " + query
+"\nhits-length:" +hits.length());
 35
 36   doStandardHighlights();
 37
 38   is.close();
 39}
 40
 41static void doStandardHighlights() throws Exception {
 42   Highlighter highlighter = new Highlighter(new
MyBolder(), new QueryScorer(query));
 43   System.out.println("Highlighter: " + highlighter
+"\nhits-length:" +hits.length());
 44   highlighter.setTextFragmenter(new SimpleFragmenter(20));
 45   for (int i = 0; i < hits.length(); i++) {
 46  System.out.println("URL " + (i + 1) + ": " +
hits.doc(i).getField("path").stringValue());
 47  String text = hits.doc(i).get("FIELD_NAME");
 48  int maxNumFragmentsRequired = 2;
 49  String fragmentSeparator = "...";
 50  TokenStream tokenStream =
analyzer.tokenStream(FIELD_NAME, new StringReader(text));
 51
 52  String result =
 53 highlighter.getBestFragments(
 54tokenStream,
 55text,
 56maxNumFragmentsRequired,
 57fragmentSeparator);
 58  System.out.println("\tfound in: " + result);
 59   }
 60}
 61
 62private static class MyBolder implements Formatter {
 63 public String highlightTerm(String originalText ,
TokenGroup group)
 64 {
 65 if(group.getTotalScore()<=0)
 66 {
 67 return originalText;
 68 }
 69 return "" + originalText + "";
 70 }
 71}
 72
 73 }
Yagnesh N. Shah
Senior Technology Engineer
CS Dept., 4th Floor
H. W. Wilson
950 University Avenue,
Bronx NY 10452
(718) 588 8400 x2721
http://www.hwwilson.com

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscrib

error when query contains numbers

2005-03-30 Thread Omar Didi
hi guys,

I am using a QueryParser to search the index. when the query has numbers, i don 
t get any results?? 
any suggestions??



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Newbie question

2005-03-30 Thread Luis Medina
Newbie question here,
is upgrading Lucene as easy as replacing the old Jar file with a newer
version's Jar file? or do I need to recompile the application's code?

Thanks,
Luis


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: HTML pages highlighter

2005-03-30 Thread Yagnesh Shah
Hi! Eric,
I try to modified that with this but I get compile error. Do you have 
any code snippet of highlighting code to pull the contents from the original 
source? or Do you know how I can do field store?

  doc.add(new Field("contents", parser.getReader(), Field.Store.YES, 
Field.Index.NO));


-Original Message-
From: Erik Hatcher [mailto:[EMAIL PROTECTED]
Sent: Wednesday, March 30, 2005 4:35 PM
To: java-user@lucene.apache.org
Subject: Re: HTML pages highlighter



On Mar 30, 2005, at 4:17 PM, Yagnesh Shah wrote:

> Hi! Eric,
>   One more thing, I am using the same HTMLDocument.java that comes with 
> /trunk/src/demo/org/apache/lucene/demo

Which does this:

 doc.add(new Field("contents", parser.getReader()));

That is not a stored field.  In other words, the original "contents" 
are not available from the Lucene index.   You will have to adjust your 
indexing code to store the contents, or adjust your highlighting code 
to pull the contents from the original source again.

Erik


>
> -Original Message-
> From: Erik Hatcher [mailto:[EMAIL PROTECTED]
> Sent: Wednesday, March 30, 2005 4:01 PM
> To: java-user@lucene.apache.org
> Subject: Re: HTML pages highlighter
>
>
> How did you index "contents"?  If you did not use a stored field type,
> then that is the issue.
>
>   Erik
>
> On Mar 30, 2005, at 12:31 PM, Yagnesh Shah wrote:
>
>> Hello Lucene-User,
>>  Is any one try to do highlighting with HTML pages?
>>
>> I am trying to do this using demo example by Keld H. Hansen article
>> "Unweaving a Tangled Web HTMLParser and Lucene" but I am getting
>> "null" value for text at line #47 Any Idea?
>>
>>   1 package org.apache.lucene.search.highlight;
>>   2
>>   3 import java.io.StringReader;
>>   4
>>   5 import org.apache.lucene.analysis.Analyzer;
>>   6 import org.apache.lucene.analysis.TokenStream;
>>   7 import org.apache.lucene.analysis.standard.StandardAnalyzer;
>>   8 import org.apache.lucene.queryParser.QueryParser;
>>   9 import org.apache.lucene.search.Hits;
>>  10 import org.apache.lucene.search.IndexSearcher;
>>  11 import org.apache.lucene.search.Query;
>>  12 import org.apache.lucene.search.highlight.Formatter;
>>  13 import org.apache.lucene.search.highlight.Highlighter;
>>  14 import org.apache.lucene.search.highlight.QueryScorer;
>>  15 import org.apache.lucene.search.highlight.SimpleFragmenter;
>>  16
>>  17 public class Searcher {
>>  18
>>  19static Query query;
>>  20static Hits hits;
>>  21
>>  22private static final String FIELD_NAME = "contents";
>>  23private static final String indexDir =
>> "/opt/dynamo/prod/hww-doc/hww/help/index";
>>  24
>>  25private static Analyzer analyzer = new StandardAnalyzer();
>>  26
>>  27public static void main(String[] args) throws Exception {
>>  28
>>  29   IndexSearcher is   = new IndexSearcher(indexDir);
>>  30   String searchCriteria = "scholarly";
>>  31   query = QueryParser.parse(searchCriteria, "contents",
>> analyzer);
>>  32
>>  33   hits  = is.search(query);
>>  34   System.out.println("found in: " + query
>> +"\nhits-length:" +hits.length());
>>  35
>>  36   doStandardHighlights();
>>  37
>>  38   is.close();
>>  39}
>>  40
>>  41static void doStandardHighlights() throws Exception {
>>  42   Highlighter highlighter = new Highlighter(new
>> MyBolder(), new QueryScorer(query));
>>  43   System.out.println("Highlighter: " + highlighter
>> +"\nhits-length:" +hits.length());
>>  44   highlighter.setTextFragmenter(new SimpleFragmenter(20));
>>  45   for (int i = 0; i < hits.length(); i++) {
>>  46  System.out.println("URL " + (i + 1) + ": " +
>> hits.doc(i).getField("path").stringValue());
>>  47  String text = hits.doc(i).get("FIELD_NAME");
>>  48  int maxNumFragmentsRequired = 2;
>>  49  String fragmentSeparator = "...";
>>  50  TokenStream tokenStream =
>> analyzer.tokenStream(FIELD_NAME, new StringReader(text));
>>  51
>>  52  String result =
>>  53 highlighter.getBestFragments(
>>  54tokenStream,
>>  55text,
>>  56maxNumFragmentsRequired,
>>  57fragmentSeparator);
>>  58  System.out.println("\tfound in: " + result);
>>  59   }
>>  60}
>>  61
>>  62private static class MyBolder implements Formatter {
>>  63 public String highlightTerm(String originalText ,
>> TokenGroup group)
>>  64 {
>>  65 if(group.getTotalScore()<=0)
>>  66 {
>>  67 return originalText;
>>  68 }
>>  69 return "" + originalText + 

Re: error when query contains numbers

2005-03-30 Thread Erik Hatcher
On Mar 30, 2005, at 8:05 PM, Omar Didi wrote:
the .toString() looks excactly like the query I enter: if I search for 
"yahoo AND 200" it returns 0 hits. I am sure there are documents that 
have 200 hundreds in them. The analyzer I am using is a custom 
analyzer that has a list of stop words. I don t know much about the 
way data was indexed, I am just developing an aplication to search 
using the analyzer that was used while indexing.
Try the tips here: 
http://wiki.apache.org/jakarta-lucene/AnalysisParalysis - you need to 
analyze your analyzer and ensure what you think was indexed actually 
was.  Also, look into using Luke - http://www.getopt.org/luke/ - to see 
what makes your index tick.

my concern now is if there is an error with the way the indexing was 
do I have to reindex the documents?
Yes.  That's just the nature of how it works.  Getting the analysis 
right is important stuff, and if you didn't index it, you can't search 
for it!

Feel free to share more details of your analyzer, and we'd be happy to 
"analyze" it.

Erik

thanks
On Mar 30, 2005, at 4:41 PM, Omar Didi wrote:
I am using a QueryParser to search the index. when the query has
numbers, i don t get any results??
any suggestions??
What is the .toString of the Query object instance returned from
QueryParser?  What Analyzer are you using?  How did you index the
field(s) being queried?
Erik
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


RE: error when query contains numbers

2005-03-30 Thread Omar Didi
Thanks Eric, I have looked at the way the documents were indexed and they are 
using 90% of the code using in chapter 2 and 4 of your book LIA. except for the 
stop words. 
I will try to use Luke to see if there are any numbers indexed first.
 



From: Erik Hatcher [mailto:[EMAIL PROTECTED]
Sent: Wed 3/30/2005 8:58 PM
To: java-user@lucene.apache.org
Subject: Re: error when query contains numbers




On Mar 30, 2005, at 8:05 PM, Omar Didi wrote:
> the .toString() looks excactly like the query I enter: if I search for
> "yahoo AND 200" it returns 0 hits. I am sure there are documents that
> have 200 hundreds in them. The analyzer I am using is a custom
> analyzer that has a list of stop words. I don t know much about the
> way data was indexed, I am just developing an aplication to search
> using the analyzer that was used while indexing.

Try the tips here:
http://wiki.apache.org/jakarta-lucene/AnalysisParalysis - you need to
analyze your analyzer and ensure what you think was indexed actually
was.  Also, look into using Luke - http://www.getopt.org/luke/ - to see
what makes your index tick.

> my concern now is if there is an error with the way the indexing was
> do I have to reindex the documents?

Yes.  That's just the nature of how it works.  Getting the analysis
right is important stuff, and if you didn't index it, you can't search
for it!

Feel free to share more details of your analyzer, and we'd be happy to
"analyze" it.

Erik


> thanks
>
> On Mar 30, 2005, at 4:41 PM, Omar Didi wrote:
>> I am using a QueryParser to search the index. when the query has
>> numbers, i don t get any results??
>> any suggestions??
>
> What is the .toString of the Query object instance returned from
> QueryParser?  What Analyzer are you using?  How did you index the
> field(s) being queried?
>
>   Erik
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

LUKE [ NEW VERSION ]

2005-03-30 Thread Karthik N S



Hi 
Guys.
Apologies. :(
 
Can Somebody 
Please Tell me   How to add  Custom Analyzer's  to the  
new Version of  LUKE  ,
or is there 
an existing Process to do the same.
 
Thx in 
advance
WITH WARM REGARDS HAVE A NICE DAY [ 
N.S.KARTHIK] 


Re: LUKE [ NEW VERSION ]

2005-03-30 Thread Andrzej Bialecki
Karthik N S wrote:
* * 

* Can Somebody Please Tell me   How to add  Custom Analyzer's  to the  
new Version of  LUKE  , *
The same way as to the old version - you put them on your classpath when 
you run Luke, like this:

java -cp lukeall.jar;myAnalyzers.jar org.getopt.luke.Luke
--
Best regards,
Andrzej Bialecki
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


new bie ..

2005-03-30 Thread pashupathinath
hi,
  i'm a new lucene user. i've few questions regarding
indexing and searching.
  1)how do i search within tokens ..for example if
i've a string "my name is abc123". using whitespace
analyser i can search for any of these strings but
when i search for 123 the search returns zero results.
how can i search such tokens r strings ?? i want the
search to display abc123 when i search for either abc
or 123 not as a complete string.
  2)i'm fetching records from the database and adding
it to the index. how can i update the existing index
when i add a new row or delete a row from the
database. 
  

thanks,
pashupathinath.k

Send instant messages to your online friends http://uk.messenger.yahoo.com 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



searching within tokens

2005-03-30 Thread pashupathinath
hi,
  i'm a new lucene user. i've few questions regarding
indexing and searching.
  1)how do i search within tokens ..for example if
i've a string "my name is abc123". using whitespace
analyser i can search for any of these strings but
when i search for 123 the search returns zero results.
how can i search such tokens r strings ?? i want the
search to display abc123 when i search for either abc
or 123 not as a complete string.
  2)i'm fetching records from the database and adding
it to the index. how can i update the existing index
when i add a new row or delete a row from the
database. 
  

thanks,
pashupathinath.k

Send instant messages to your online friends http://uk.messenger.yahoo.com 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: pre computing possible search results narrowing and hit counts on those

2005-03-30 Thread Antony Sequeira
On Wed, 30 Mar 2005 09:42:32 -0800, Doug Cutting <[EMAIL PROTECTED]> wrote:
> Antony Sequeira wrote:
> > A user does a search for say "condominium", and i show him the 50,000
> > properties that meet that description.
> >
> > I need two other pieces of information for display -
> > 1. I want to show a "select" box on the UI, which contains all the
> > cities that appear in those 50,000 documents
> > 2. Against each city I want to show the count of matching documents.
> >
> > For example the drop down might look like
> > "Los Angeles"  1
> > "San Francisco" 5000
> >
> > (But, I do not want to show "San Jose" if none of the 50,000 documents
> > contain it)
> 
> You can use the FieldCache & HitCollector:
> 
> private class Count { int value; }
> 
> String[] docToCity = FieldCache.getStrings(indexReader, "city");
> Map cityToCount = new HashMap();
> 
> searcher.search(query, new HitCollector() {
>public void collect(int doc, float score) {
>  String city = docToCity[doc];
>  Count count = cityToCount.get(city);
>  if (count == null) {
>count = new Count();
>cityToCount.put(city, count);
>  }
>  count.value++;
>}
> });
> 
> // sort & display entries in cityToCount
> 
> Doug
> 
Based on a previous reply , I went through the java docs and came up with

 public class PreFilterCollector extends HitCollector {
final BitVector bits = new BitVector(reader.maxDoc());
java.util.HashMap statemap = new
java.util.HashMap() ;

public void collect(int id, float score) {
bits.set(id);
}

public java.util.HashMap getStateCounts() {
try {
int k = bits.size();
int j = 0;
for (int i =0; i < k; i++) {
if (!bits.get(i))
continue;
Document doc = reader.document(i); 
j++;
String state = doc.get("state"); // we assume one
state for now
if (statemap.containsKey(state)) {
statemap.put(state,statemap.get(state) + 1); 
} else {
statemap.put(state,1);
}
}
} catch (Exception e) {
throw new RuntimeException(e);
}
return statemap;
}
  }

But, I have the following questions
1. My code first collects all the doc ids and then iterates over them
to collect field info. I did this becasue,
http://lucene.apache.org/java/docs/api/org/apache/lucene/search/HitCollector.html
says "This is called in an inner search loop. For good search
performance, implementations of this method should not call
Searchable.doc(int) or IndexReader.document(int) on every document
number encountered"
Have I misunderstood and doing this wrongly ?

2. Would your code be faster (under what circumstances) ?

3.  One problem i see with my current solution is that it accesses
every doc of the result  set.
One of the previous responses pointed to a solution in
http://www.mail-archive.com/java-dev@lucene.apache.org/msg00034.html
After reading it, to me it looked like that solution won't be any
better. (Looks like it walks values of terms that do not even occur in
teh current search result set).  Have I got this right ?


I am a newbee to lucene. Thanks for all the replies. Appreciate it very much.

-Antony

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: error when query contains numbers

2005-03-30 Thread Erik Hatcher
On Mar 30, 2005, at 4:41 PM, Omar Didi wrote:
I am using a QueryParser to search the index. when the query has 
numbers, i don t get any results??
any suggestions??
What is the .toString of the Query object instance returned from 
QueryParser?  What Analyzer are you using?  How did you index the 
field(s) being queried?

Erik
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Newbie question

2005-03-30 Thread Erik Hatcher
On Mar 30, 2005, at 4:42 PM, Luis Medina wrote:
Newbie question here,
is upgrading Lucene as easy as replacing the old Jar file with a newer
version's Jar file? or do I need to recompile the application's code?
Try it and see :)
It should work fine by replacing the JAR, with no recompilation 
necessary.  The more important question is do you need to reindex.  
Most likely not, but there have been some versions of Lucene that have 
changed how some factors were computed and a reindex is needed in those 
cases to keep things in sync.

Erik
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: HTML pages highlighter

2005-03-30 Thread Erik Hatcher
On Mar 30, 2005, at 4:46 PM, Yagnesh Shah wrote:
Hi! Eric,
Erik - with a 'k' - Sorry, I let it slide once though :)
	I try to modified that with this but I get compile error. Do you have 
any code snippet of highlighting code to pull the contents from the 
original source?
I have a whole book full of code examples :)   
http://www.lucenebook.com - Grab the source code and look in 
src/lia/tools at Highlight*.java

 or Do you know how I can do field store?
  doc.add(new Field("contents", parser.getReader(), 
Field.Store.YES, Field.Index.NO));
You cannot store it with a Reader.  You need to use Field.Text(String, 
String), or one of the other variations.

Erik
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


RE: error when query contains numbers

2005-03-30 Thread Omar Didi

the .toString() looks excactly like the query I enter: if I search for "yahoo 
AND 200" it returns 0 hits. I am sure there are documents that have 200 
hundreds in them. The analyzer I am using is a custom analyzer that has a list 
of stop words. I don t know much about the way data was indexed, I am just 
developing an aplication to search using the analyzer that was used while 
indexing.
my concern now is if there is an error with the way the indexing was do I have 
to reindex the documents?
thanks

On Mar 30, 2005, at 4:41 PM, Omar Didi wrote:
> I am using a QueryParser to search the index. when the query has 
> numbers, i don t get any results??
> any suggestions??

What is the .toString of the Query object instance returned from 
QueryParser?  What Analyzer are you using?  How did you index the 
field(s) being queried?

Erik


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]