Re: Sharding Techniques
Hi, Though we have 30 GB total index, size of the indexes that are used in 75%-80% searches is 5 GB. and we have average search time around 700 ms. (yes, we have optimized index). Could someone please throw some light on my original doubt!!! If I want to keep smaller indexes on different servers so that CPU and memory may be optimized, how can I aggregate the results of a query from each of the server. One thing I know is RMI which I studied a few years back, but that was too slow (or i thought so that time). What are other techniques? Is 1 second a bad search time for following? total index size: 30 GB index size which is being used in 80% searches - 5 GB number of fields - 40 most of the fields being numeric fields. one big "contents" field with 500 - 1000 words. 3500 queries / second mostly on on an average a query uses 7 fields (1 big 6 small) with 25-30 tokens Are there any benchmarks from which I can compare the performance of my application? Or any approximate formula which can guide me calculating (using system parameters and index/search stats) the "best" expected search time? Thanks in advance On Tue, May 10, 2011 at 9:59 AM, Ganesh wrote: > We are using similar technique as yours. We keep smaller indexes and use > ParallelMultiSearcher to search across the index. Keeping smaller indexes is > good as index and index optimzation would be faster. There will be small > delay while searching across the indexes. > > 1. What is your search time? > 2. Is your index optimized? > > I have a doubt, If we keep the indexes size to 30 GB then each file size > (fdt, fdx etc) would in GB's. Small addition or deletion to the file will > not cause more IO as it has to skip those bytes and write it at the end of > file. > > Regards > Ganesh > > > > - Original Message - > From: "Samarendra Pratap" > To: > Sent: Monday, May 09, 2011 5:26 PM > Subject: Sharding Techniques > > > > Hi list, > > We have an index directory of 30 GB which is divided into 3 > subdirectories > > (idx1, idx2, idx3) which are again divided into 21 sub-subdirectories > > (idx1-1, idx1-2, , idx2-1, , idx3-1, , idx3-21). > > > > We are running with java 1.6, lucene 2.9 (going to upgrade to 3.1 very > > soon), linux (fedora core - kernel 2.6.17-13.1), reiserfs. > > > > We have almost 40 fields in each index (is it a bad to have so many > > fields?). most of them are id based fields. > > We are using 8 servers for search, and each of which receives > approximately > > 3000/hour queries in peak hour and search time of more than 1 second is > > considered bad (is it really bad?) as per the business requirement. > > > > Since past few months we are experiencing issues (load and search time) > on > > our search servers, due to which I am looking for sharding techniques. > Can > > someone guide or give me pointers where i can read more and test? > > > > Keeping parts of indexes on different servers search on all of them and > then > > merging the results - what could be the best approach? > > > > Let me tell you that most queries use only 6-7 indexes and 4 - 5 fields > (to > > search for) but some queries (searching all the data) require all the > > indexes and are primary cause of the performance degradation. > > > > Any suggestions/ideas are greatly appreciated. And further more will > > sharding (or similar thing) really reduce search time? (load is a less > > severe issue when compared to search time) > > > > > > -- > > Regards, > > Samar > > > > - > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > > -- Regards, Samar
Re: Sharding Techniques
On May 10, 2011, at 9:42 AM, Samarendra Pratap wrote: > Hi, > Though we have 30 GB total index, size of the indexes that are used > in 75%-80% searches is 5 GB. and we have average search time around 700 ms. > (yes, we have optimized index). > > Could someone please throw some light on my original doubt!!! > If I want to keep smaller indexes on different servers so that CPU and > memory may be optimized, how can I aggregate the results of a query from > each of the server. One thing I know is RMI which I studied a few years > back, but that was too slow (or i thought so that time). What are other > techniques? There is also http://katta.sourceforge.net/ out there... Johannes > > Is 1 second a bad search time for following? > total index size: 30 GB > index size which is being used in 80% searches - 5 GB > number of fields - 40 > most of the fields being numeric fields. > one big "contents" field with 500 - 1000 words. > 3500 queries / second mostly on > on an average a query uses 7 fields (1 big 6 small) with 25-30 tokens > > Are there any benchmarks from which I can compare the performance of my > application? Or any approximate formula which can guide me > calculating (using system parameters and index/search stats) the "best" > expected search time? > > Thanks in advance > > On Tue, May 10, 2011 at 9:59 AM, Ganesh wrote: > >> We are using similar technique as yours. We keep smaller indexes and use >> ParallelMultiSearcher to search across the index. Keeping smaller indexes is >> good as index and index optimzation would be faster. There will be small >> delay while searching across the indexes. >> >> 1. What is your search time? >> 2. Is your index optimized? >> >> I have a doubt, If we keep the indexes size to 30 GB then each file size >> (fdt, fdx etc) would in GB's. Small addition or deletion to the file will >> not cause more IO as it has to skip those bytes and write it at the end of >> file. >> >> Regards >> Ganesh >> >> >> >> - Original Message - >> From: "Samarendra Pratap" >> To: >> Sent: Monday, May 09, 2011 5:26 PM >> Subject: Sharding Techniques >> >> >>> Hi list, >>> We have an index directory of 30 GB which is divided into 3 >> subdirectories >>> (idx1, idx2, idx3) which are again divided into 21 sub-subdirectories >>> (idx1-1, idx1-2, , idx2-1, , idx3-1, , idx3-21). >>> >>> We are running with java 1.6, lucene 2.9 (going to upgrade to 3.1 very >>> soon), linux (fedora core - kernel 2.6.17-13.1), reiserfs. >>> >>> We have almost 40 fields in each index (is it a bad to have so many >>> fields?). most of them are id based fields. >>> We are using 8 servers for search, and each of which receives >> approximately >>> 3000/hour queries in peak hour and search time of more than 1 second is >>> considered bad (is it really bad?) as per the business requirement. >>> >>> Since past few months we are experiencing issues (load and search time) >> on >>> our search servers, due to which I am looking for sharding techniques. >> Can >>> someone guide or give me pointers where i can read more and test? >>> >>> Keeping parts of indexes on different servers search on all of them and >> then >>> merging the results - what could be the best approach? >>> >>> Let me tell you that most queries use only 6-7 indexes and 4 - 5 fields >> (to >>> search for) but some queries (searching all the data) require all the >>> indexes and are primary cause of the performance degradation. >>> >>> Any suggestions/ideas are greatly appreciated. And further more will >>> sharding (or similar thing) really reduce search time? (load is a less >>> severe issue when compared to search time) >>> >>> >>> -- >>> Regards, >>> Samar >>> >> >> - >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >> For additional commands, e-mail: java-user-h...@lucene.apache.org >> >> > > > -- > Regards, > Samar - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Sharding Techniques
On Mon, 2011-05-09 at 13:56 +0200, Samarendra Pratap wrote: > We have an index directory of 30 GB which is divided into 3 subdirectories > (idx1, idx2, idx3) which are again divided into 21 sub-subdirectories > (idx1-1, idx1-2, , idx2-1, , idx3-1, , idx3-21). So each part is about ½ GB in size? That gives you a serious logistic overhead. You state later that you only update the index once a day, so it would seem that you have no need for the fast update times that such small indexes give you. My guess is that you will get faster search times by using a single index. Down to basics, Lucene searches work by locating terms and resolving documents from them. For standard term queries, a term is located by a process akin to binary search. That means that it uses log(n) seeks to get the term. Let's say you have 10M terms in your corpus. If you stored that in a single field in a single index with a single segment, it would take log(10M) ~= 24 seeks to locate a term. This is of course very simplified. When you have 63 indexes, log(n) works against you. Even with the unrealistic assumption that the 10M terms are evenly distributed and without duplicates, the number of seeks for a search that hits all parts will still be 63 * log(10M/63) ~= 63 * 18 = 1134. And we haven't even begun to estimate the merging part. Due to caching, a seek is not equal to the storage being hit, but the probability for a storage hit rises with the number of seeks and the inevitable term duplicates when splitting the index. > We have almost 40 fields in each index (is it a bad to have so many > fields?). most of them are id based fields. Nah, our index is about 40GB with 100+ fields and 8M documents. We use a single index, optimized to 5 segments. Response times for raw searches are a few ms, while response times for the full package (heavy faceting) is generally below 300ms. Our queries are mostly simple boolean queries across 13 fields. > Keeping parts of indexes on different servers search on all of them and then > merging the results - what could be the best approach? Locate your bottleneck. Some well-placed log statements or a quick peek with visualvm (comes with the Oracle JVM) should help a lot. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Sharding Techniques
Thanks to Johannes - I am looking into katta. Seems promising. to Toke - Great explanation. That's what I was looking for. I'll come back and share my experience. Thank you very much. On Tue, May 10, 2011 at 1:31 PM, Toke Eskildsen wrote: > On Mon, 2011-05-09 at 13:56 +0200, Samarendra Pratap wrote: > > We have an index directory of 30 GB which is divided into 3 > subdirectories > > (idx1, idx2, idx3) which are again divided into 21 sub-subdirectories > > (idx1-1, idx1-2, , idx2-1, , idx3-1, , idx3-21). > > So each part is about ½ GB in size? That gives you a serious logistic > overhead. You state later that you only update the index once a day, so > it would seem that you have no need for the fast update times that such > small indexes give you. My guess is that you will get faster search > times by using a single index. > > > Down to basics, Lucene searches work by locating terms and resolving > documents from them. For standard term queries, a term is located by a > process akin to binary search. That means that it uses log(n) seeks to > get the term. Let's say you have 10M terms in your corpus. If you stored > that in a single field in a single index with a single segment, it would > take log(10M) ~= 24 seeks to locate a term. This is of course very > simplified. > > When you have 63 indexes, log(n) works against you. Even with the > unrealistic assumption that the 10M terms are evenly distributed and > without duplicates, the number of seeks for a search that hits all parts > will still be 63 * log(10M/63) ~= 63 * 18 = 1134. And we haven't even > begun to estimate the merging part. > > Due to caching, a seek is not equal to the storage being hit, but the > probability for a storage hit rises with the number of seeks and the > inevitable term duplicates when splitting the index. > > > We have almost 40 fields in each index (is it a bad to have so many > > fields?). most of them are id based fields. > > Nah, our index is about 40GB with 100+ fields and 8M documents. We use a > single index, optimized to 5 segments. Response times for raw searches > are a few ms, while response times for the full package (heavy faceting) > is generally below 300ms. Our queries are mostly simple boolean queries > across 13 fields. > > > Keeping parts of indexes on different servers search on all of them and > then > > merging the results - what could be the best approach? > > Locate your bottleneck. Some well-placed log statements or a quick peek > with visualvm (comes with the Oracle JVM) should help a lot. > > > - > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > > -- Regards, Samar
PDF Highlighting using PDF Highlight File
Hi all, in our Lucene 3.0.3-based web application when a user clicks on a hit link the targeted PDF should be opened in the browser with highlighted hits. For this purpose using the Acrobat Highlight File (Parameter xml, see http://www.pdfbox.org/userguide/highlighting.html and http://partners.adobe.com/public/developer/en/pdf/HighlightFileFormat.pdf) seems most reasonable to me. Since the position to highlight are given by (page and) character offsets and Lucene uses offsets as well I think it could be easy (for more Lucene-skilled people than me) to create an Highlighter which produces this highlight file. Does such a Highlighter already exists in the Lucene World? If not could someone please point me the direction (e.g. where to hook into the existing (fast vector?) highlighter just to extract the offsets). BTW: Luke gyve me the impression that Term Vectors are only stored when the field content is sored as well. Is that true? Wulf - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
An unexpected network error occurred
Three Instance of My application & lucene index directory shared for all instance Lucene version 3.1 Lock factory:- NativeFSLockFactory Instance1 jdk64 ,64 os Instance2 jdk64 ,64 os Instance3 jdk32 ,32 os When I try to search the data from the index directory from Instance1 I got bellow error An unexpected network error occurred In lucene directory there write.lock file I cannot read data & update data in index from Instance1 But for other two Instances its working fine is there a way to handle such error Thanks & Regards Yogesh
Re: An unexpected network error occurred
A full stack trace dump is always helpful. Are the three instances on one server with a local index directory, or on different servers accessing a network drive (how?) or what? If the index is locked it would be surprising that you could update it from 2 of the instances. -- Ian. On Tue, May 10, 2011 at 1:05 PM, Yogesh Dabhi wrote: > Three Instance of My application & lucene index directory shared for all > instance > > Lucene version 3.1 > > Lock factory:- NativeFSLockFactory > > Instance1 jdk64 ,64 os > > Instance2 jdk64 ,64 os > > Instance3 jdk32 ,32 os > > > > When I try to search the data from the index directory from Instance1 > I got bellow error > > An unexpected network error occurred > > In lucene directory there write.lock file > > > > I cannot read data & update data in index from Instance1 > > But for other two Instances its working fine > > is there a way to handle such error > > Thanks & Regards > > Yogesh > > - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
RE: SpanNearQuery - inOrder parameter
Anyone able to help me with the problem below? Thanks Greg -Original Message- From: Gregory Tarr [mailto:gregory.t...@detica.com] Sent: 09 May 2011 12:33 To: java-user@lucene.apache.org Subject: RE: SpanNearQuery - inOrder parameter Attachment didn't work - test below: import org.apache.lucene.analysis.standard.StandardAnalyzer; import org.apache.lucene.document.Document; import org.apache.lucene.document.Field; import org.apache.lucene.index.IndexWriter; import org.apache.lucene.index.Term; import org.apache.lucene.search.IndexSearcher; import org.apache.lucene.search.TopDocsCollector; import org.apache.lucene.search.TopScoreDocCollector; import org.apache.lucene.search.spans.SpanNearQuery; import org.apache.lucene.search.spans.SpanQuery; import org.apache.lucene.search.spans.SpanTermQuery; import org.apache.lucene.store.RAMDirectory; import org.apache.lucene.util.Version; import org.junit.Assert; import org.junit.Test; public class TestSpanNearQueryInOrder { @Test public void testSpanNearQueryInOrder() { RAMDirectory directory = new RAMDirectory(); IndexWriter writer = new IndexWriter(directory, new StandardAnalyzer(Version.LUCENE_29), true, IndexWriter.MaxFieldLength.UNLIMITED); TopDocsCollector collector = TopScoreDocCollector.create(3, false); Document doc = new Document(); // DOC1 doc.add(new Field("text"," ", Field.Store.YES, Field.Index.ANALYZED)); writer.addDocument(doc); doc = new Document(); // DOC2 doc.add(new Field("text"," ")); writer.addDocument(doc); doc = new Document(); // DOC3 doc.add(new Field("text"," ")); writer.addDocument(doc); writer.optimize(); writer.close(); searcher = new IndexSearcher(directory, false); SpanQuery[] clauses = new SpanQuery[2]; clauses[0] = new SpanTermQuery(new Term("text", "")); clauses[1] = new SpanTermQuery(new Term("text", "")); // Don't care about order, so setting inOrder = false SpanNearQuery q = new SpanNearQuery(clauses, 1, false); searcher.search(q, collector); // This assert fails - 3 docs are returned. Expecting only DOC2 and DOC3 Assert.assertEquals("Check 2 results", 2, collector.getTotalHits()); collector = new TopScoreDocCollector.create(3, false); clauses = new SpanQuery[2]; clauses[0] = new SpanTermQuery(new Term("text", "")); clauses[1] = new SpanTermQuery(new Term("text", "")); // Don't care about order, so setting inOrder = false q = new SpanNearQuery(clauses, 0, false); searcher.search(q, collector); // This assert fails - 3 docs are returned. Expecting only DOC2 Assert.assertEquals("Check 1 result", 1, collector.getTotalHits()); } } From: Gregory Tarr [mailto:gregory.t...@detica.com] Sent: 09 May 2011 12:29 To: java-user@lucene.apache.org Subject: SpanNearQuery - inOrder parameter I attach a junit test which shows strange behaviour of the inOrder parameter on the SpanNearQuery constructor, using Lucene 2.9.4. My understanding of this parameter is that true forces the order and false doesn't care about the order. Using true always works. However using false works fine when the terms in the query are distinct, but if they are equivalent, e.g. searching for "john john", I do not get the expected results. The workaround seems to be to always use true for queries with repeated terms. Any help? Thanks Greg <> Please consider the environment before printing this email. This message should be regarded as confidential. If you have received this email in error please notify the sender and destroy it immediately. Statements of intent shall only become binding when confirmed in hard copy by an authorised signatory. The contents of this email may relate to dealings with other companies within the Detica Limited group of companies. Detica Limited is registered in England under No: 1337451. Registered offices: Surrey Research Park, Guildford, Surrey, GU2 7YP, England. Please consider the environment before printing this email. This message should be regarded as confidential. If you have received this email in error please notify the sender and destroy it immediately. Statements of intent shall only become binding when confirmed in hard copy by an authorised signatory. The contents of this email may relate to dealings with other companies within the Detica Limited group of companies. Detica Limited is registered in England under No: 1337451. Registered offices: Surrey Research Park, Guildford, Surrey, GU2 7YP, England. Please consider the environment before printing this email. This message should be regarded as confidential. If you have received this email in error please notify the sender and destroy it immediately. Statements of intent shall only become binding when confirmed in hard copy by an authorised signatory. The contents of this email may relate to dealings with other companies within the Detica Limited group of com
Re: Sharding Techniques
Down to basics, Lucene searches work by locating terms and resolving documents from them. For standard term queries, a term is located by a process akin to binary search. That means that it uses log(n) seeks to get the term. Let's say you have 10M terms in your corpus. If you stored that in a single field in a single index with a single segment, it would take log(10M) ~= 24 seeks to locate a term. This is of course very simplified. When you have 63 indexes, log(n) works against you. Even with the unrealistic assumption that the 10M terms are evenly distributed and without duplicates, the number of seeks for a search that hits all parts will still be 63 * log(10M/63) ~= 63 * 18 = 1134. And we haven't even begun to estimate the merging part. This is true, but if the indexes are kept on 63 separate servers, those seeks will be carried out in parallel. The OP did indicate his indexes would be on different servers, I think? I still agree with your overall point - at this scale a single server is probably best. And if there are performance issues, I think the usual approach is to create multiple mirrored copies (slaves) rather than sharding. Sharding is useful for very large indexes: indexes to big to store on disk and cache in memory on one commodity box -Mike - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Sharding Techniques
Hi Mike, *"I think the usual approach is to create multiple mirrored copies (slaves) rather than sharding"* This is where my eyes stuck. We do have mirrors and in-fact a good number of those. 6 servers are being used for serving regular queries (2 are for specific queries that do take time) and each of them receives around 3-3.5 K queries per hour in peak hours. The problem is that the interface being used by end users has a lot of options plus a few text boxes where they can type up to 64 words each. (and unfortunately i am not able to reduce these things as these are business requirements) Normal queries go fine under 500 ms but when people start searching "anything" some queries take up to > 100 seconds. Don't you think distributing smaller indexes on different machines would reduce the average search time. (Although I have a feeling that search time for smaller queries may be slightly increased) On Tue, May 10, 2011 at 6:32 PM, Mike Sokolov wrote: > > Down to basics, Lucene searches work by locating terms and resolving >> documents from them. For standard term queries, a term is located by a >> process akin to binary search. That means that it uses log(n) seeks to >> get the term. Let's say you have 10M terms in your corpus. If you stored >> that in a single field in a single index with a single segment, it would >> take log(10M) ~= 24 seeks to locate a term. This is of course very >> simplified. >> >> When you have 63 indexes, log(n) works against you. Even with the >> unrealistic assumption that the 10M terms are evenly distributed and >> without duplicates, the number of seeks for a search that hits all parts >> will still be 63 * log(10M/63) ~= 63 * 18 = 1134. And we haven't even >> begun to estimate the merging part. >> > This is true, but if the indexes are kept on 63 separate servers, those > seeks will be carried out in parallel. The OP did indicate his indexes > would be on different servers, I think? I still agree with your overall > point - at this scale a single server is probably best. And if there are > performance issues, I think the usual approach is to create multiple > mirrored copies (slaves) rather than sharding. Sharding is useful for very > large indexes: indexes to big to store on disk and cache in memory on one > commodity box > > -Mike > > > - > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > > -- Regards, Samar
RE: Sharding Techniques
Hi Samar, >>Normal queries go fine under 500 ms but when people start searching >>"anything" some queries take up to > 100 seconds. Don't you think >>distributing smaller indexes on different machines would reduce the average >>.search time. (Although I have a feeling that search time for smaller queries >>may be slightly increased) What are the characteristics of your slow queries? Can you give examples? Are the slow queries always slow or only under heavy load? What the bottleneck is and whether splitting into smaller indexes would help depends on just what your bottleneck is. It's not clear that your index is large enough that the size of the index is causing your bottleneck. We run indexes of about 350GB with average response times under 200ms and 99th percentile reponse times of under 2 seconds. (We have a very low qps rate however). Tom - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Query on using Payload with MoreLikeThis class
Hi, In the Lucene 2.9.4 project, there is a requirement to boost some of the keywords in the document using payload. Now while searching, is there a way I can boost the MoreLikeThis result using the index time payload values? Or can I merge MoreLikeThis output and PayloadTermQuery output somehow to get the final percentage output?
Re: SpanNearQuery - inOrder parameter
Since no one else is jumping in, I'll say that I suspect that the span query code does not bother to check to see if two of the terms are the same. I think that would account for the behavior you are seeing. Since the second SpanTermQuery would match the same term the first one did. Note that I'm not familiar with the span query code, so this is just a quick deduction. Not sure how easy it would be to add this duplicate term detection, if that's the problem. Tom On Tue, May 10, 2011 at 5:58 AM, Gregory Tarr wrote: > Anyone able to help me with the problem below? > > Thanks > > Greg > > -Original Message- > From: Gregory Tarr [mailto:gregory.t...@detica.com] > Sent: 09 May 2011 12:33 > To: java-user@lucene.apache.org > Subject: RE: SpanNearQuery - inOrder parameter > > Attachment didn't work - test below: > > import org.apache.lucene.analysis.standard.StandardAnalyzer; > import org.apache.lucene.document.Document; > import org.apache.lucene.document.Field; import > org.apache.lucene.index.IndexWriter; > import org.apache.lucene.index.Term; > import org.apache.lucene.search.IndexSearcher; > import org.apache.lucene.search.TopDocsCollector; > import org.apache.lucene.search.TopScoreDocCollector; > import org.apache.lucene.search.spans.SpanNearQuery; > import org.apache.lucene.search.spans.SpanQuery; > import org.apache.lucene.search.spans.SpanTermQuery; > import org.apache.lucene.store.RAMDirectory; > import org.apache.lucene.util.Version; > import org.junit.Assert; > import org.junit.Test; > > public class TestSpanNearQueryInOrder { > > @Test > public void testSpanNearQueryInOrder() { RAMDirectory directory = new > RAMDirectory(); IndexWriter writer = new IndexWriter(directory, new > StandardAnalyzer(Version.LUCENE_29), true, > IndexWriter.MaxFieldLength.UNLIMITED); > TopDocsCollector collector = TopScoreDocCollector.create(3, false); > > Document doc = new Document(); > > // DOC1 > doc.add(new Field("text"," ", Field.Store.YES, > Field.Index.ANALYZED)); > > writer.addDocument(doc); > doc = new Document(); > > // DOC2 > doc.add(new Field("text"," ")); > > writer.addDocument(doc); > doc = new Document(); > > // DOC3 > doc.add(new Field("text"," ")); > > writer.addDocument(doc); > writer.optimize(); > writer.close(); > > searcher = new IndexSearcher(directory, false); > > SpanQuery[] clauses = new SpanQuery[2]; clauses[0] = new > SpanTermQuery(new Term("text", "")); clauses[1] = new > SpanTermQuery(new Term("text", "")); > > // Don't care about order, so setting inOrder = false SpanNearQuery q > = new SpanNearQuery(clauses, 1, false); searcher.search(q, collector); > > // This assert fails - 3 docs are returned. Expecting only DOC2 and > DOC3 > Assert.assertEquals("Check 2 results", 2, collector.getTotalHits()); > > collector = new TopScoreDocCollector.create(3, false); clauses = new > SpanQuery[2]; clauses[0] = new SpanTermQuery(new Term("text", "")); > clauses[1] = new SpanTermQuery(new Term("text", "")); > > // Don't care about order, so setting inOrder = false q = new > SpanNearQuery(clauses, 0, false); searcher.search(q, collector); > > // This assert fails - 3 docs are returned. Expecting only DOC2 > Assert.assertEquals("Check 1 result", 1, collector.getTotalHits()); } > > } > > > > From: Gregory Tarr [mailto:gregory.t...@detica.com] > Sent: 09 May 2011 12:29 > To: java-user@lucene.apache.org > Subject: SpanNearQuery - inOrder parameter > > > > I attach a junit test which shows strange behaviour of the inOrder > parameter on the SpanNearQuery constructor, using Lucene 2.9.4. > > My understanding of this parameter is that true forces the order and > false doesn't care about the order. > > Using true always works. However using false works fine when the terms > in the query are distinct, but if they are equivalent, e.g. searching > for "john john", I do not get the expected results. The workaround seems > to be to always use true for queries with repeated terms. > > Any help? > > Thanks > > Greg > > <> > > > Please consider the environment before printing this email. > > This message should be regarded as confidential. If you have received > this email in error please notify the sender and destroy it immediately. > Statements of intent shall only become binding when confirmed in hard > copy by an authorised signatory. The contents of this email may relate > to dealings with other companies within the Detica Limited group of > companies. > > Detica Limited is registered in England under No: 1337451. > > Registered offices: Surrey Research Park, Guildford, Surrey, GU2 7YP, > England. > > > Please consider the environment before printing this email. > > This message should be regarded as confidential. If you have received > this email in error please notify the sender and destroy it immediately. > Statements of intent shall only become binding when c
RE: SpanNearQuery - inOrder parameter
: I attach a junit test which shows strange behaviour of the inOrder : parameter on the SpanNearQuery constructor, using Lucene 2.9.4. : : My understanding of this parameter is that true forces the order and : false doesn't care about the order. : : Using true always works. However using false works fine when the terms : in the query are distinct, but if they are equivalent, e.g. searching : for "john john", I do not get the expected results. The workaround seems : to be to always use true for queries with repeated terms. I don't think the situation of "overlapping spans" has changed much since this thread... http://search.lucidimagination.com/search/document/ee23395e5a93c525/non_overlapping_span_queries#868b3a3ec6431afc the crux of hte issue (as i recall) is that there is really no conecptual reason to why a query for "'john' near 'john', in any order, with slop of Z" shouldn't match a doc that contains only one instance of "john" ... the first SpanTermQuery says "i found a match at position X" the second SpanTermQuery says "i found a match at position Y" and the SpanNearQuery says "the differnece between X and Y is less then Z" therefore i have a match. (The SpanNearQuery can't fail just because X and Y are the same -- they might be two distinct term instances, with differnet payloads perhaps, that just happen to have the same position). However: if true==inOrder case works because the SpanNearQuery enforces that "X must be less then Y" so the same term can't ever match twice. -Hoss - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: How do I sort lucene search results by relevance and time?
Thanks for your suggestion! I try to set document boost factor when indexing document. In order to bubble up recent documents' scores, I set last three month's documents' boost to 2 , and set other documents' boost factor to 0.5. The I search index sorting by two fields, lucene default score and time desc. The sorting results seem good. It meet my requirement. On Mon, May 9, 2011 at 6:31 PM, Ian Lea wrote: > Well, you can use one of the sorting search methods and pass multiple > sort keys including relevance and a timestamp. But I suspect the > Google algorithm may be a bit more complex than that. > > One technique is boosting: set an index time document boost on recent > documents. Of course what is recent today may not be next week. > There are other, more complex ways of customizing lucene scoring. A > Google search for something like "customized lucene scoring" will find > lots of info, some recent, some older, but probably all relevant one > way or another. > > > -- > Ian. > > > On Mon, May 9, 2011 at 4:59 AM, Johnbin Wang > wrote: > > What do I want to do is just like Google search results. The results in > the > > first page is the most relevant and also recent documents, but not > > absolutely sorted by time desc. > > > > -- > > cheers, > > Johnbin Wang > > > > - > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > > -- cheers, Johnbin Wang
Can I omit ShingleFilter's filler tokens
Hi, Can I remove the filler token _ from the n-gram-tokens that are generated by a ShingleFilter? I'm using a chain of filters: ClassicFilter, StopFilter, LowerCaseFilter, and ShingleFilter to create phrase n-grams. The ShingleFilter inserts FILLER_TOKENs in place of the stopwords, but I don't want them. How can I omit the filler tokens? thanks Bill
Re: Sharding Techniques
We also use similar kind of technique, breaking indexes in to smaller and search using ParallelMultiSearcher. We have to do incremental indexing and the records older than 6 months or 1 year (based on ageout setting) should be deleted. Having multiple small indexes is really fast in terms of indexing. Since you guys mentioned about keeping single large index. Search time woule be faster but the indexing and index optimization will take more time. How you are handling it in case of incremental indexing. If we keep the indexes size to 100+ GB then each file size (fdt, fdx etc) would in GB's. Small addition or deletion to the file will not cause more IO as it has to skip those bytes and write it at the end of file. Regards Ganesh - Original Message - From: "Burton-West, Tom" To: Sent: Tuesday, May 10, 2011 9:46 PM Subject: RE: Sharding Techniques Hi Samar, >>Normal queries go fine under 500 ms but when people start searching >>"anything" some queries take up to > 100 seconds. Don't you think >>distributing smaller indexes on different machines would reduce the average >>.search time. (Although I have a feeling that search time for smaller queries >>may be slightly increased) What are the characteristics of your slow queries? Can you give examples? Are the slow queries always slow or only under heavy load? What the bottleneck is and whether splitting into smaller indexes would help depends on just what your bottleneck is. It's not clear that your index is large enough that the size of the index is causing your bottleneck. We run indexes of about 350GB with average response times under 200ms and 99th percentile reponse times of under 2 seconds. (We have a very low qps rate however). Tom - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org