Re: Search speed
Jeff Munson wrote: Single word searches return pretty fast, but when I try phrases, searching seems to slow considerably. [ ... ] However, if I use this query, contents:"all parts including picture tube guaranteed", it returns hits in 2890 millseconds. Other phrases take longer as well. You could use an analyzer that inserts bigrams for common terms. Nutch does this. So, if you declare that "all" and "including" are common terms, then this could be tokenized as the following tokens: 0 - all all.parts 1 - parts parts.including 2 - including including.picture 3 - picture 4 - tube 5 - guaranteed Two tokens at a position indicate where the second has position increment of zero. Then your phrase search could be converted to: "all.parts parts.including including.picture picture tube guaranteed" which should be much faster, since it has replaced common terms with rare terms. This approach does make the index larger, and hence makes indexing somewhat slower. So you don't want to declare too many words as common, but a handful can make a big difference if they're used frequently in queries. Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Search speed
If you know all the phrases your are going to search for, you could modify an analyzer to make those phrases into whole terms when you are analyzing. Other than that, you can test the speed of breaking the phrase query up into term queries. You would have to do an AND on all the words in the phrase. You would then need to get the documents that match all the terms, then do a substring search for your exact phrase. Any documents that match you would then return. search: death && notice for each hit if contents contains "death notice" add hit to final result list loop On Tue, 2 Nov 2004 18:07:26 +0100, Paul Elschot <[EMAIL PROTECTED]> wrote: > On Tuesday 02 November 2004 17:50, Jeff Munson wrote: > > Thanks for the info Paul. The requirements of my search engine are that > > I need to search for phrases like "death notice" or "world war ii". You > > suggested that I break the phrases into words. Is there a way to break > > the phrases into words, do the search, and just return the documents > > with the phrase? I'm just looking for a way to speed up the phrase > > searches. > > If you know the phrases in advance, ie. before indexing, you can index > and search them as terms with a special purpose analyzer. > It's an unusual solution, though. > > > > Regards, > Paul Elschot > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Search speed
On Tuesday 02 November 2004 17:50, Jeff Munson wrote: > Thanks for the info Paul. The requirements of my search engine are that > I need to search for phrases like "death notice" or "world war ii". You > suggested that I break the phrases into words. Is there a way to break > the phrases into words, do the search, and just return the documents > with the phrase? I'm just looking for a way to speed up the phrase > searches. If you know the phrases in advance, ie. before indexing, you can index and search them as terms with a special purpose analyzer. It's an unusual solution, though. Regards, Paul Elschot - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Search speed
Thanks for the info Paul. The requirements of my search engine are that I need to search for phrases like "death notice" or "world war ii". You suggested that I break the phrases into words. Is there a way to break the phrases into words, do the search, and just return the documents with the phrase? I'm just looking for a way to speed up the phrase searches. -Original Message- From: Paul Elschot [mailto:[EMAIL PROTECTED] Sent: Tuesday, November 02, 2004 2:05 AM To: [EMAIL PROTECTED] Subject: Re: Search speed On Monday 01 November 2004 21:02, Jeff Munson wrote: > I'm looking for tips on speeding up searches since I am a relatively new > user of Lucene. > > I've created a single index with 4.5 million documents. The index has > about 22 fields and one of those fields is the contents of the body tag > which can range from 5K to 35K. When I create the field (named > "contents") that houses the contents of the body tag, the field is > stored, indexed, and tokenized. The term position vectors are not > stored. > > Single word searches return pretty fast, but when I try phrases, > searching seems to slow considerably. When constructing the query I am > using the standard query object where analyzer is the StandardAnalyzer: > > Code Example: > Query objQuery = QueryParser.parse(sSearchString, "contents", analyzer); > > For example, the following query, contents:Zanesville, it returns over > 163,000 hits in 78 milliseconds. > > However, if I use this query, contents:"all parts including picture tube > guaranteed", it returns hits in 2890 millseconds. Other phrases take > longer as well. > > My question is, are there any indexing tips (storing term vectors?) or > query tips that I can use to speed up the searching of phrases? Term vectors should not influence search times for phrases. What you're seeing is this: for each term in your query Lucene has to walk all the documents containing the term. For a single term there is no speed problem because the document set for the term is stored in a compact way on disk. For multiple terms with large document sets the disk head needs to move between the document sets of the terms because all sets need to be walked synchronously over the documents to compute the document scores. For phrases even more disk accesses are needed to access the term positions within the documents. Normally the disk head seeks are degrading the performance. One way to avoid the disk head seeks is to use fewer terms in the phrases. Another way is to avoid using the term positions by querying for words instead of phrases. In case you have hardware/resources there are more options like using faster disks and/or using RAM for critical parts of the index. Lucene can use extra RAM in various ways. To configure that one may have to do some java coding. Profiling can guide you there. Regards, Paul Elschot - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Search speed
On Monday 01 November 2004 21:02, Jeff Munson wrote: > I'm looking for tips on speeding up searches since I am a relatively new > user of Lucene. > > I've created a single index with 4.5 million documents. The index has > about 22 fields and one of those fields is the contents of the body tag > which can range from 5K to 35K. When I create the field (named > "contents") that houses the contents of the body tag, the field is > stored, indexed, and tokenized. The term position vectors are not > stored. > > Single word searches return pretty fast, but when I try phrases, > searching seems to slow considerably. When constructing the query I am > using the standard query object where analyzer is the StandardAnalyzer: > > Code Example: > Query objQuery = QueryParser.parse(sSearchString, "contents", analyzer); > > For example, the following query, contents:Zanesville, it returns over > 163,000 hits in 78 milliseconds. > > However, if I use this query, contents:"all parts including picture tube > guaranteed", it returns hits in 2890 millseconds. Other phrases take > longer as well. > > My question is, are there any indexing tips (storing term vectors?) or > query tips that I can use to speed up the searching of phrases? Term vectors should not influence search times for phrases. What you're seeing is this: for each term in your query Lucene has to walk all the documents containing the term. For a single term there is no speed problem because the document set for the term is stored in a compact way on disk. For multiple terms with large document sets the disk head needs to move between the document sets of the terms because all sets need to be walked synchronously over the documents to compute the document scores. For phrases even more disk accesses are needed to access the term positions within the documents. Normally the disk head seeks are degrading the performance. One way to avoid the disk head seeks is to use fewer terms in the phrases. Another way is to avoid using the term positions by querying for words instead of phrases. In case you have hardware/resources there are more options like using faster disks and/or using RAM for critical parts of the index. Lucene can use extra RAM in various ways. To configure that one may have to do some java coding. Profiling can guide you there. Regards, Paul Elschot - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Search speed
I'm looking for tips on speeding up searches since I am a relatively new user of Lucene. I've created a single index with 4.5 million documents. The index has about 22 fields and one of those fields is the contents of the body tag which can range from 5K to 35K. When I create the field (named "contents") that houses the contents of the body tag, the field is stored, indexed, and tokenized. The term position vectors are not stored. Single word searches return pretty fast, but when I try phrases, searching seems to slow considerably. When constructing the query I am using the standard query object where analyzer is the StandardAnalyzer: Code Example: Query objQuery = QueryParser.parse(sSearchString, "contents", analyzer); For example, the following query, contents:Zanesville, it returns over 163,000 hits in 78 milliseconds. However, if I use this query, contents:"all parts including picture tube guaranteed", it returns hits in 2890 millseconds. Other phrases take longer as well. My question is, are there any indexing tips (storing term vectors?) or query tips that I can use to speed up the searching of phrases? Thanks in advance for any tips. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]