Re: Re: Re: Lucene search problem
AUTOMATIC REPLY LUX is closed until 5th January 2009 - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Re: Lucene search problem
AUTOMATIC REPLY LUX is closed until 5th January 2009 - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Lucene search problem
hi Erick, I agree lucene do not index the object. in the following example I have quoted fields are indexed as chain.chainName. I am able to retrieve recipe objects using FullTextQuery as "chain.chainName:something' ... question is in somecase chain itself is null. I can be able to achieve required as below: class Recipe { @DocumentId Integer id; @IndexedEmbedded Chain chain = new Chain(); //gettter and setter } class Chain { @DocumentId Integer id; @Field(index = Index.TOKENIZED, name="chainName") String name = "NULLNULNUL"; //getter and setter } so this means to always there is will be chain along with Recipe object with has default name to "NULLNULNUL" and that wil be indexed.. We dont want to do that, Recipe is our persistence object and we hate to do that. -Amar On Tue, Dec 23, 2008 at 8:05 PM, Erick Erickson wrote: > How do you intend to index these? Lucene will not > index objects for you. You have to break the object > down into a series of fields. At that point you can > substitute whatever you want. > > Best > Erick > > On Tue, Dec 23, 2008 at 3:36 AM, wrote: > > > Hi Aaron Schon/EricK, > > > > That really make sense to me but it really seems easy if is the string > > object. See the object structure I have it below hopefully that gives you > > some idea > > > > class Recipe { > > @DocumentId > > Integer id; > > @IndexedEmbedded > > Chain chain; > > //gettter and setter > > } > > > > class Chain { > > @DocumentId > > Integer id; > > @Field(index = Index.TOKENIZED, name="chainName") > > String name; > > //getter and setter > > } > > > > I am creating index on the recipe object. and for some recipe.m_chain > would > > be null. So can you tell me how do I assign the value "NULLNULNULLNULL" > for > > object chain in recipe. > > > > I also was thinking if #FieldBridge help me this way. My plan was to have > > default value where chain is null as you mentioned. but it does not seems > > to > > work for null values. > > > > Please suggest > > > > Thanks in advance. > > -Amar > > > > On Tue, Dec 23, 2008 at 12:04 AM, Aaron Schon > > wrote: > > > > > I would second Erick's recommendation - create an arbitrary > > representation > > > for NULL such as "NULL" (if you are certain the term "NULL" does not > > occur > > > in actual docs. Alternatively, use "NULLNULNULLNULL" or something to > that > > > effect. > > > > > > > > > > > > - Original Message > > > From: Erick Erickson > > > To: java-user@lucene.apache.org > > > Sent: Monday, December 22, 2008 8:58:21 AM > > > Subject: Re: Lucene search problem > > > > > > Try searching the mailing list archives for a fuller discussion, but > > > the short answer is usually to index an unique value for your > > > "null" entries, then search on that, something totally > > > outrageous like, say AAABBBCCCDDDEEEFFF. > > > > > > Alternatively, you could create, at startup time, a > > > Filter of all the docs that *do* contain terms for the > > > field in question, flip the bits and use the Filter in your > > > searches. (Hint: see TermDocs/TermEnum) > > > > > > Best > > > Erick > > > > > > On Mon, Dec 22, 2008 at 8:11 AM, wrote: > > > > > > > Hi, > > > > > > > > I have problem with lucene search, I am quite new to this. Can some > > body > > > > help or just push me to who can please. > > > > > > > > Problem what I am facing we need search for object whose attribute > > > "chain" > > > > contaning null, but lucene does not help indexing the null values.. > > > > > > > > how can I achieve this, or please guide me the alternative way of > doing > > > > this. > > > > > > > > Thanks in advance. > > > > -Amar > > > > > > > > > > > > > > > > > > > > > > - > > > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > > > For additional commands, e-mail: java-user-h...@lucene.apache.org > > > > > > > > > > > > -- > > Amar Sannaik | Programmer | ATHARVA LIBSON Software Pvt Ltd., > > # 9886476270, amarsann...@atharvalibson.com > > > -- Amar Sannaik | Programmer | ATHARVA LIBSON Software Pvt Ltd., # 9886476270, amarsann...@atharvalibson.com
Re: Optimize and Out Of Memory Errors
Mark Miller wrote: Lebiram wrote: Also, what are norms Norms are a byte value per field stored in the index that is factored into the score. Its used for length normalization (shorter documents = more important) and index time boosting. If you want either of those, you need norms. When norms are loaded up into an IndexReader, its loaded into a byte[maxdoc] array for each field - so even if one document out of 400 million has a field, its still going to load byte[maxdoc] for that field (so a lot of wasted RAM). Did you say you had 400 million docs and 7 fields? Google says that would be: **400 million x 7 byte = 2 670.28809 megabytes** On top of your other RAM usage. Just to avoid confusion, that should really read a byte per document per field. If I remember right, it gives 255 boost possibilities, limited to 25 with length normalization. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Optimize and Out Of Memory Errors
Lebiram wrote: Also, what are norms Norms are a byte value per field stored in the index that is factored into the score. Its used for length normalization (shorter documents = more important) and index time boosting. If you want either of those, you need norms. When norms are loaded up into an IndexReader, its loaded into a byte[maxdoc] array for each field - so even if one document out of 400 million has a field, its still going to load byte[maxdoc] for that field (so a lot of wasted RAM). Did you say you had 400 million docs and 7 fields? Google says that would be: **400 million x 7 byte = 2 670.28809 megabytes** On top of your other RAM usage. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Optimize and Out Of Memory Errors
>>how do I turn off norms and where is it set? doc.add(new Field("field2", "sender" + i, Field.Store.NO, Field.Index.ANALYZED_NO_NORMS)); - Original Message From: Lebiram To: java-user@lucene.apache.org Sent: Tuesday, 23 December, 2008 17:03:07 Subject: Re: Optimize and Out Of Memory Errors Hi All, Thanks for the replies, I've just managed to reproduced the error on my test machine. What we did was, generate about 100,000,000 documents with about 7 fields in it, with terms from 1 to 10. After the index of about 20GB, we did an optimize and it was able to make 1 big index of the size 17GB... Now, when we do a normal search, it just fails. Here is the stack trace: 2008-12-23 16:56:05,388 [main] INFO LuceneTesterMain - Max Memory:1598226432 2008-12-23 16:56:05,388 [main] INFO LuceneTesterMain - Available Memory:854133192 2008-12-23 16:56:05,388 [main] ERROR LuceneTesterMain - Seaching failed. java.lang.OutOfMemoryError at java.io.RandomAccessFile.readBytes(Native Method) at java.io.RandomAccessFile.read(RandomAccessFile.java:315) at org.apache.lucene.store.FSDirectory$FSIndexInput.readInternal(FSDirectory.java:550) at org.apache.lucene.store.BufferedIndexInput.readBytes(BufferedIndexInput.java:131) at org.apache.lucene.index.CompoundFileReader$CSIndexInput.readInternal(CompoundFileReader.java:240) at org.apache.lucene.store.BufferedIndexInput.readBytes(BufferedIndexInput.java:131) at org.apache.lucene.store.BufferedIndexInput.readBytes(BufferedIndexInput.java:87) at org.apache.lucene.index.SegmentReader.norms(SegmentReader.java:834) at org.apache.lucene.index.MultiSegmentReader.norms(MultiSegmentReader.java:335) at org.apache.lucene.search.TermQuery$TermWeight.scorer(TermQuery.java:69) at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:143) at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:113) at org.apache.lucene.search.Searcher.search(Searcher.java:132) The search code is quite simple in fact. 2008-12-23 16:56:05,076 [main] INFO LuceneTesterMain - query.toString()=+content:test and the code to search. Filter filter = new RangeFilter("timestamp", DateTools.dateToString(start, DateTools.Resolution.SECOND), DateTools.dateToString(end, DateTools.Resolution.SECOND), true, true); Searcher = new IndexSearcher(FSDirectory.getDirectory(IndexName)); TopDocs hits = Searcher.search(query, filter, 1000); I really have no idea why this is breaking. Also, what are norms and how do I turn off norms and where is it set? This is code on adding documents: Document doc = new Document(); doc.add(new Field("id", String.valueOf(i), Field.Store.YES, Field.Index.UN_TOKENIZED)); doc.add(new Field("timestamp", DateTools.dateToString(new Date(), DateTools.Resolution.SECOND), Field.Store.YES, Field.Index.UN_TOKENIZED)); doc.add(new Field("content", LuceneTesterMain.StaticString, Field.Store.NO, Field.Index.TOKENIZED)); doc.add(new Field("field2", "sender" + i, Field.Store.NO, Field.Index.TOKENIZED)); doc.add(new Field("field3", LuceneTesterMain.StaticString, Field.Store.NO, Field.Index.TOKENIZED)); doc.add(new Field("field4", "group" + i, Field.Store.NO, Field.Index.TOKENIZED)); doc.add(new Field("field5", "groupId" + i, Field.Store.YES, Field.Index.UN_TOKENIZED)); writer.addDocument(doc); From: mark harwood To: java-user@lucene.apache.org Sent: Tuesday, December 23, 2008 2:42:25 PM Subject: Re: Optimize and Out Of Memory Errors I've had reports of OOM exceptions during optimize on a couple of large deployments recently (based on Lucene 2.4.0) I've given the usual advice of turning off norms, providing plenty of RAM and also suggested setting IndexWriter.setTermIndexInterval(). I don't have access to these deployment environments and have tried hard to reproduce the circumstances that lead to this. For the record, I've experimented with huge indexes with hundreds of fields, several "unique value" fields e.g. primary keys, "fixed-vocab" fields with limited values e.g. male/female and fields with "power-curve" distributions e.g. plain text. I've wound my index up to 22GB with several commit sessions involving deletions, full optimises and partial optimises along the way. Still no error. However, the errors that have been reported to me from 2 different environments with large indexes make me think there is still something to be uncovered here... - Original Message From: Michael McCandless To: java-user@lucene.apache.org Cc: Utan Bisaya Sent: Tuesday, 23 December, 2008 14:08:26 Subject: Re: Optimize and Out Of Memory Errors How many indexed fields do you have, overall, in the index? If you have a
Re: Optimize and Out Of Memory Errors
Hi All, Thanks for the replies, I've just managed to reproduced the error on my test machine. What we did was, generate about 100,000,000 documents with about 7 fields in it, with terms from 1 to 10. After the index of about 20GB, we did an optimize and it was able to make 1 big index of the size 17GB... Now, when we do a normal search, it just fails. Here is the stack trace: 2008-12-23 16:56:05,388 [main] INFO LuceneTesterMain - Max Memory:1598226432 2008-12-23 16:56:05,388 [main] INFO LuceneTesterMain - Available Memory:854133192 2008-12-23 16:56:05,388 [main] ERROR LuceneTesterMain - Seaching failed. java.lang.OutOfMemoryError at java.io.RandomAccessFile.readBytes(Native Method) at java.io.RandomAccessFile.read(RandomAccessFile.java:315) at org.apache.lucene.store.FSDirectory$FSIndexInput.readInternal(FSDirectory.java:550) at org.apache.lucene.store.BufferedIndexInput.readBytes(BufferedIndexInput.java:131) at org.apache.lucene.index.CompoundFileReader$CSIndexInput.readInternal(CompoundFileReader.java:240) at org.apache.lucene.store.BufferedIndexInput.readBytes(BufferedIndexInput.java:131) at org.apache.lucene.store.BufferedIndexInput.readBytes(BufferedIndexInput.java:87) at org.apache.lucene.index.SegmentReader.norms(SegmentReader.java:834) at org.apache.lucene.index.MultiSegmentReader.norms(MultiSegmentReader.java:335) at org.apache.lucene.search.TermQuery$TermWeight.scorer(TermQuery.java:69) at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:143) at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:113) at org.apache.lucene.search.Searcher.search(Searcher.java:132) The search code is quite simple in fact. 2008-12-23 16:56:05,076 [main] INFO LuceneTesterMain - query.toString()=+content:test and the code to search. Filter filter = new RangeFilter("timestamp", DateTools.dateToString(start, DateTools.Resolution.SECOND), DateTools.dateToString(end, DateTools.Resolution.SECOND), true, true); Searcher = new IndexSearcher(FSDirectory.getDirectory(IndexName)); TopDocs hits = Searcher.search(query, filter, 1000); I really have no idea why this is breaking. Also, what are norms and how do I turn off norms and where is it set? This is code on adding documents: Document doc = new Document(); doc.add(new Field("id", String.valueOf(i), Field.Store.YES, Field.Index.UN_TOKENIZED)); doc.add(new Field("timestamp", DateTools.dateToString(new Date(), DateTools.Resolution.SECOND), Field.Store.YES, Field.Index.UN_TOKENIZED)); doc.add(new Field("content", LuceneTesterMain.StaticString, Field.Store.NO, Field.Index.TOKENIZED)); doc.add(new Field("field2", "sender" + i, Field.Store.NO, Field.Index.TOKENIZED)); doc.add(new Field("field3", LuceneTesterMain.StaticString, Field.Store.NO, Field.Index.TOKENIZED)); doc.add(new Field("field4", "group" + i, Field.Store.NO, Field.Index.TOKENIZED)); doc.add(new Field("field5", "groupId" + i, Field.Store.YES, Field.Index.UN_TOKENIZED)); writer.addDocument(doc); From: mark harwood To: java-user@lucene.apache.org Sent: Tuesday, December 23, 2008 2:42:25 PM Subject: Re: Optimize and Out Of Memory Errors I've had reports of OOM exceptions during optimize on a couple of large deployments recently (based on Lucene 2.4.0) I've given the usual advice of turning off norms, providing plenty of RAM and also suggested setting IndexWriter.setTermIndexInterval(). I don't have access to these deployment environments and have tried hard to reproduce the circumstances that lead to this. For the record, I've experimented with huge indexes with hundreds of fields, several "unique value" fields e.g. primary keys, "fixed-vocab" fields with limited values e.g. male/female and fields with "power-curve" distributions e.g. plain text. I've wound my index up to 22GB with several commit sessions involving deletions, full optimises and partial optimises along the way. Still no error. However, the errors that have been reported to me from 2 different environments with large indexes make me think there is still something to be uncovered here... - Original Message From: Michael McCandless To: java-user@lucene.apache.org Cc: Utan Bisaya Sent: Tuesday, 23 December, 2008 14:08:26 Subject: Re: Optimize and Out Of Memory Errors How many indexed fields do you have, overall, in the index? If you have a very large number of fields that are "sparse" (meaning any given document would only have a small subset of the fields), then norms could explain what you are seeing. Norms are not stored sparsely, so when segments get merged the "holes" get filled (occupy bytes on disk and in RAM) and consume more resources. Turning off no
Re: lucene explanation
That worked perfectly. Thanks alot! Sincerely, Chris Salem - Original Message - To: java-user@lucene.apache.org From: Erick Erickson Sent: 12/22/2008 5:00:51 PM Subject: Re: lucene explanation Warning! I'm really reaching on this But it seems you could use TermDocs/TermEnum to good effect here. Basically, you should be able, for a given term, use the above to determine whether doc N had a hit in one of your fields pretty efficiently. There's even a WildcardTermEnum that will iterate over wildcards. Filters are surprisingly fast to construct, so you could use the above to construct a filter on each term for each field. Then determining whether the doc is a hit for a particular field is just a matter of seeing if that bit is on in the relevant filter. Either one should be wy under 30 seconds, although I don't know how big your index is or how encompassing your wildcard searches are... FWIW Erick On Mon, Dec 22, 2008 at 4:48 PM, Chris Salem wrote: > Hello, > I'm wondering what the best way to accomplish this is. > When a user enters text to search on it customarily searches 3 fields, > resume_text, profile_text, and summary_text, so a standard query would be > something like: > (resume_text:(query) OR profile_text:(query) OR summary_text:(query)) > For each hit (up to 50) I'd like to find out which part of the query > matched with the document. Right now I use the Explanation object, here's > the code: > int len = hits.length(); > if(len > 50) len = 50; > for(int i=0; i Explanation ex = searcher.explain(Query.parse("resume_text:(query)"), > hits.id(i)); > if(ex.isMatch()) ... > ex = searcher.explain(Query.parse("profile_text:(query)"), hits.id(i)); > if(ex.isMatch()) ... > ex = searcher.explain(Query.parse("summary_text:(query)"), hits.id(i)); > if(ex.isMatch()) ... > } > This works fine with regular queries, but if someone does a query with a > wildcard search times increase to more than 30 seconds. Is there a better > way to do this? > Thanks > Sincerely, > Chris Salem >
Re: Optimize and Out Of Memory Errors
I've had reports of OOM exceptions during optimize on a couple of large deployments recently (based on Lucene 2.4.0) I've given the usual advice of turning off norms, providing plenty of RAM and also suggested setting IndexWriter.setTermIndexInterval(). I don't have access to these deployment environments and have tried hard to reproduce the circumstances that lead to this. For the record, I've experimented with huge indexes with hundreds of fields, several "unique value" fields e.g. primary keys, "fixed-vocab" fields with limited values e.g. male/female and fields with "power-curve" distributions e.g. plain text. I've wound my index up to 22GB with several commit sessions involving deletions, full optimises and partial optimises along the way. Still no error. However, the errors that have been reported to me from 2 different environments with large indexes make me think there is still something to be uncovered here... - Original Message From: Michael McCandless To: java-user@lucene.apache.org Cc: Utan Bisaya Sent: Tuesday, 23 December, 2008 14:08:26 Subject: Re: Optimize and Out Of Memory Errors How many indexed fields do you have, overall, in the index? If you have a very large number of fields that are "sparse" (meaning any given document would only have a small subset of the fields), then norms could explain what you are seeing. Norms are not stored sparsely, so when segments get merged the "holes" get filled (occupy bytes on disk and in RAM) and consume more resources. Turning off norms on sparse fields would resolve it, but you must rebuild the entire index since if even a single doc in the index has norms enabled for a given field, it "spreads". Mike Utan Bisaya wrote: > Recently, our lucene index version was upgraded to 2.3.1 and the index had to > be rebuilt for several weeks which made the entire index a total of 20 GB or > so. > > After the the rebuild, a weekly sunday task was executed for optimization. > > During that time, the optimization failed several times complaining about OOM > errors but then after a couple of tries, it completes. > > So the entire index is now 1 segment that is 20 GB. > > The problem is that any subsequent searches on that index fails with OOM > errors at Lucene's reading of bytes. > > Our environment: > jvm Xmx1600 (This is the max we could set the box since it's windows) > 8G Memory available on box > 4G CPU (8 core) but only 12.5% is used. (Not sure if this would impact it) > Harddisk available is 120GB > > mergeFactor, and other lucene config is set at default. > > We > checked this 20GB using luke and it has 400,000,000 documents in it. It was > able to count the docs however when we do the search in Luke it fails giving > us OOM errors. > > We also did the check index and the tool fails at the 20GB segment but > succeeds on the others. > > We've managed to rollback to a previously unoptimized index copy (about 20GB > or so) and the searches were find now. This unoptimized index is made up of > several 8GB segments and a few smaller segments. > > However there is a big possibility that the optimization error could happen > again... > > Does anybody have insights on why this is happening? > > > > > > > - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: QueryWrapperFilter
My first bit of advice would be to step back and take a deep breath and "take off your DB hat". Lucene is a *text* search application, not an RDBMS. The usual solution is to flatten your data representation when you index so you can use simpler searches. Others have posted that it's hard to use Lucene to express relationships satisfactorily. Best Erick On Tue, Dec 23, 2008 at 5:24 AM, csantos wrote: > > Hello, > > I need to filter a FullTextSearch against a query, that means, i search a > term in a indexed entity "A", A contains a embedded Index "B", entity B has > a m:1 bidirectional relationship with entity "C", the foreign Key in "B" is > "c_id". My filter condition would be like "filter the fulltext search for > entries where the c_id equals some value", where value is given. > > I thought of using the QueryWrapperFilter. But the JavaDoc says for the > TermQuery: "A Query that matches documents containing a term.". My problem > is that the field I want to use do not appear on the Lucene Index. Which is > the best approach? > > thanks in advanced > -- > View this message in context: > http://www.nabble.com/QueryWrapperFilter-tp21142252p21142252.html > Sent from the Lucene - Java Users mailing list archive at Nabble.com. > > > - > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > >
Re: Lucene search problem
How do you intend to index these? Lucene will not index objects for you. You have to break the object down into a series of fields. At that point you can substitute whatever you want. Best Erick On Tue, Dec 23, 2008 at 3:36 AM, wrote: > Hi Aaron Schon/EricK, > > That really make sense to me but it really seems easy if is the string > object. See the object structure I have it below hopefully that gives you > some idea > > class Recipe { > @DocumentId > Integer id; > @IndexedEmbedded > Chain chain; > //gettter and setter > } > > class Chain { > @DocumentId > Integer id; > @Field(index = Index.TOKENIZED, name="chainName") > String name; > //getter and setter > } > > I am creating index on the recipe object. and for some recipe.m_chain would > be null. So can you tell me how do I assign the value "NULLNULNULLNULL" for > object chain in recipe. > > I also was thinking if #FieldBridge help me this way. My plan was to have > default value where chain is null as you mentioned. but it does not seems > to > work for null values. > > Please suggest > > Thanks in advance. > -Amar > > On Tue, Dec 23, 2008 at 12:04 AM, Aaron Schon > wrote: > > > I would second Erick's recommendation - create an arbitrary > representation > > for NULL such as "NULL" (if you are certain the term "NULL" does not > occur > > in actual docs. Alternatively, use "NULLNULNULLNULL" or something to that > > effect. > > > > > > > > - Original Message > > From: Erick Erickson > > To: java-user@lucene.apache.org > > Sent: Monday, December 22, 2008 8:58:21 AM > > Subject: Re: Lucene search problem > > > > Try searching the mailing list archives for a fuller discussion, but > > the short answer is usually to index an unique value for your > > "null" entries, then search on that, something totally > > outrageous like, say AAABBBCCCDDDEEEFFF. > > > > Alternatively, you could create, at startup time, a > > Filter of all the docs that *do* contain terms for the > > field in question, flip the bits and use the Filter in your > > searches. (Hint: see TermDocs/TermEnum) > > > > Best > > Erick > > > > On Mon, Dec 22, 2008 at 8:11 AM, wrote: > > > > > Hi, > > > > > > I have problem with lucene search, I am quite new to this. Can some > body > > > help or just push me to who can please. > > > > > > Problem what I am facing we need search for object whose attribute > > "chain" > > > contaning null, but lucene does not help indexing the null values.. > > > > > > how can I achieve this, or please guide me the alternative way of doing > > > this. > > > > > > Thanks in advance. > > > -Amar > > > > > > > > > > > > > > > - > > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > > For additional commands, e-mail: java-user-h...@lucene.apache.org > > > > > > > -- > Amar Sannaik | Programmer | ATHARVA LIBSON Software Pvt Ltd., > # 9886476270, amarsann...@atharvalibson.com >
Re: Combining results of multiple indexes
You're kind of in uncharted territory. I've been watching this list for quite a while and you're the first person I remember who's said "indexing speed is more important than querying speed" . Mostly I'll leave responses to folks who understand the guts of indexing, except to say that for point (e) you can always use a Sort object in your queries that insures this. And that merging indexes occurs in order. That is, if you merge index1, index2 and index3, the following holds true, just based upon the position of the index in the merge array. max(id in index 1) < min (id in indedx2) max(id in index 2) < min(id in index 3) Best Erick On Tue, Dec 23, 2008 at 2:44 AM, Preetham Kajekar (preetham) < preet...@cisco.com> wrote: > Hi Erick, > Thanks for the heads up. I understand that I am using an implementation > detail rather than a feature. > > Looks like having a single index is the best option. Hence, any > optimizations (to improve indexing speed) you would suggest given that > > a) once a doc is added to an index, it will not get modified/deleted > b) all the fields added are keywords (mostly numbers) - no analysis is > required. > c) indexing speed is more important than querying speed. > d) every document is the same - there is no boost or relevancy required. > > e) Query results should be sorted in the order they were indexed. > > > Thanks, > ~preetham > > -Original Message- > From: Erick Erickson [mailto:erickerick...@gmail.com] > Sent: Friday, December 19, 2008 12:12 AM > To: java-user@lucene.apache.org > Subject: Re: Combining results of multiple indexes > > I would recommend, very strongly, that you don't rely on the doc IDs > being > the same in two different indexes. Doc IDs are just incremented by one > for each doc added, but. > > optimization can change the doc ID. and is guaranteed to change at > least some of them if there are deletions from your index. If you, for > whatever reason indexed document N in one index and then skipped > it in the other, all subsequent document IDs would not match. If. > > The fact that your IDs are the same is more than undocumented, it > is coincidental. > > Best > Erick > > On Thu, Dec 18, 2008 at 11:46 AM, Preetham Kajekar > wrote: > > > Hi, > > I noticed that the doc id is the same. So, if I have HitCollector, > just > > collect the doc-ids of both Searchers (for the two indexes) and find > the > > intersection between them, it would work. Also, get the doc is even > where > > there are large number of hits is fast. > > > > Of course, I am using something undocumented of Lucene. > > > > > > Thanks, > > ~preetham > > > > Preetham Kajekar wrote: > > > >> Thanks. Yep the code is very easy. However, it take about 3 mins to > >> complete merging. > >> > >> Looks like I will need to have an out of band merging of indexes once > they > >> are closed (planning to store about 50mil entries in each index > partition) > >> > >> > >> However, as the data is being indexed, is there any other way to > combine > >> results ? > >> > >> I could get the results of one index, get all the hits and then apply > this > >> as a filter for the next index. But if there are large number of hits > (which > >> is likely to be the case), this would not perform too well. > >> > >> Do you think the document id can be used in anyway. How is the > document id > >> generated ? After all, i have the two indexes operating on a common > List of > >> objects. Would the doc is in index1 and index2 for object N in the > list be > >> the same ? > >> > >> > >> Thanks, > >> ~preetham > >> > >> Erick Erickson wrote: > >> > >>> You will be stunned at how easy it is. The merging code should be > >>> a dozen lines (and that only if you are merging 6 or so indexes) > >>> > >>> See IndexWriter.addIndexes or > >>> IndexWriter.addIndexesNoOptimize > >>> > >>> Best > >>> Erick > >>> > >>> On Thu, Dec 18, 2008 at 5:03 AM, Preetham Kajekar > >>> >wrote: > >>> > >>> > >>> > Hi, > I tried out a single IndexWriter used by two threads to index > different > fields. It is slower than using two separate IndexWriters. These > are my > findings > > All Fields (9) using 1 IndexWriter 1 Thread - 38,000 object per sec > 5 Fields using 1 IndexWriter 1 Thread - 62,000 object per sec > All Fields (9) using 1 IndexWriter 2 Thread - 29,000 object per sec > All Fields (9) using 2 IndexWriter 2 Thread - 55,000 object per sec > > So, it looks like I will have figure how to combine results of > multiple > indexes. > > Thanks, > ~preetham > > > Preetham Kajekar wrote: > > > > > Thanks Erick and Michael. > > I will try out these suggestions and post my findings. > > > > ~preetham > > > > Erick Erickson wrote: > > > > > > > >> Well, maybe if I'd read the original post more carefully I'd have > >> figured > >> that out, > >> sorry 'bout that. > >
Re: Optimize and Out Of Memory Errors
How many indexed fields do you have, overall, in the index? If you have a very large number of fields that are "sparse" (meaning any given document would only have a small subset of the fields), then norms could explain what you are seeing. Norms are not stored sparsely, so when segments get merged the "holes" get filled (occupy bytes on disk and in RAM) and consume more resources. Turning off norms on sparse fields would resolve it, but you must rebuild the entire index since if even a single doc in the index has norms enabled for a given field, it "spreads". Mike Utan Bisaya wrote: Recently, our lucene index version was upgraded to 2.3.1 and the index had to be rebuilt for several weeks which made the entire index a total of 20 GB or so. After the the rebuild, a weekly sunday task was executed for optimization. During that time, the optimization failed several times complaining about OOM errors but then after a couple of tries, it completes. So the entire index is now 1 segment that is 20 GB. The problem is that any subsequent searches on that index fails with OOM errors at Lucene's reading of bytes. Our environment: jvm Xmx1600 (This is the max we could set the box since it's windows) 8G Memory available on box 4G CPU (8 core) but only 12.5% is used. (Not sure if this would impact it) Harddisk available is 120GB mergeFactor, and other lucene config is set at default. We checked this 20GB using luke and it has 400,000,000 documents in it. It was able to count the docs however when we do the search in Luke it fails giving us OOM errors. We also did the check index and the tool fails at the 20GB segment but succeeds on the others. We've managed to rollback to a previously unoptimized index copy (about 20GB or so) and the searches were find now. This unoptimized index is made up of several 8GB segments and a few smaller segments. However there is a big possibility that the optimization error could happen again... Does anybody have insights on why this is happening? - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Multiple IndexReaders from the same Index Directory - issues with Locks / performance
Locking is completely unused from IndexReader unless you do deletes or change norms, so sharing a remote mounted index is just fine (except for performance concerns). If you're using 2.4, you should open your readers with readOnly=true. Mike Tomer Gabel wrote: Ultimately it depends on your specific usage patterns. Generally speaking, if you have IndexReaders (and do not use their delete functionality) you don't need locking at all; you can use a no-op lock factory, in which case you'll pretty much only be constrained by your storage subsystem. Kay Kay-3 wrote: For one of our projects - we were planning to have the system of multiple individual Lucene readers (just read-only instances and no writes whatsoever ) in different physical machines having their IndexReader-s warmed up from the same directory for the indices and working on the same. I was reading about locks (implemented as files) that Lucene uses internally. I am just curious if using multiple readers would be a feasible option here, all sharing the same index directory (across NFS / similar network mounted storage ) in terms of locking etc. Would there be a performance hit ( ignoring the NFS related performance of course) that would hinder multiple readers to serve query search simultaneously from the same set of index files. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - -- http://www.tomergabel.com Tomer Gabel -- View this message in context: http://www.nabble.com/Multiple-IndexReaders-from-the-same-Index-Directory---issues-with-Locks---performance-tp21136262p21142273.html Sent from the Lucene - Java Users mailing list archive at Nabble.com. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Multiple IndexReaders from the same Index Directory - issues with Locks / performance
Ultimately it depends on your specific usage patterns. Generally speaking, if you have IndexReaders (and do not use their delete functionality) you don't need locking at all; you can use a no-op lock factory, in which case you'll pretty much only be constrained by your storage subsystem. Kay Kay-3 wrote: > > For one of our projects - we were planning to have the system of > multiple individual Lucene readers (just read-only instances and no > writes whatsoever ) in different physical machines having their > IndexReader-s warmed up from the same directory for the indices and > working on the same. > > I was reading about locks (implemented as files) that Lucene uses > internally. I am just curious if using multiple readers would be a > feasible option here, all sharing the same index directory (across NFS / > similar network mounted storage ) in terms of locking etc. > > Would there be a performance hit ( ignoring the NFS related performance > of course) that would hinder multiple readers to serve query search > simultaneously from the same set of index files. > > - > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > > > - -- http://www.tomergabel.com Tomer Gabel -- View this message in context: http://www.nabble.com/Multiple-IndexReaders-from-the-same-Index-Directory---issues-with-Locks---performance-tp21136262p21142273.html Sent from the Lucene - Java Users mailing list archive at Nabble.com. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
QueryWrapperFilter
Hello, I need to filter a FullTextSearch against a query, that means, i search a term in a indexed entity "A", A contains a embedded Index "B", entity B has a m:1 bidirectional relationship with entity "C", the foreign Key in "B" is "c_id". My filter condition would be like "filter the fulltext search for entries where the c_id equals some value", where value is given. I thought of using the QueryWrapperFilter. But the JavaDoc says for the TermQuery: "A Query that matches documents containing a term.". My problem is that the field I want to use do not appear on the Lucene Index. Which is the best approach? thanks in advanced -- View this message in context: http://www.nabble.com/QueryWrapperFilter-tp21142252p21142252.html Sent from the Lucene - Java Users mailing list archive at Nabble.com. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Lucene search problem
Hi Aaron Schon/EricK, That really make sense to me but it really seems easy if is the string object. See the object structure I have it below hopefully that gives you some idea class Recipe { @DocumentId Integer id; @IndexedEmbedded Chain chain; //gettter and setter } class Chain { @DocumentId Integer id; @Field(index = Index.TOKENIZED, name="chainName") String name; //getter and setter } I am creating index on the recipe object. and for some recipe.m_chain would be null. So can you tell me how do I assign the value "NULLNULNULLNULL" for object chain in recipe. I also was thinking if #FieldBridge help me this way. My plan was to have default value where chain is null as you mentioned. but it does not seems to work for null values. Please suggest Thanks in advance. -Amar On Tue, Dec 23, 2008 at 12:04 AM, Aaron Schon wrote: > I would second Erick's recommendation - create an arbitrary representation > for NULL such as "NULL" (if you are certain the term "NULL" does not occur > in actual docs. Alternatively, use "NULLNULNULLNULL" or something to that > effect. > > > > - Original Message > From: Erick Erickson > To: java-user@lucene.apache.org > Sent: Monday, December 22, 2008 8:58:21 AM > Subject: Re: Lucene search problem > > Try searching the mailing list archives for a fuller discussion, but > the short answer is usually to index an unique value for your > "null" entries, then search on that, something totally > outrageous like, say AAABBBCCCDDDEEEFFF. > > Alternatively, you could create, at startup time, a > Filter of all the docs that *do* contain terms for the > field in question, flip the bits and use the Filter in your > searches. (Hint: see TermDocs/TermEnum) > > Best > Erick > > On Mon, Dec 22, 2008 at 8:11 AM, wrote: > > > Hi, > > > > I have problem with lucene search, I am quite new to this. Can some body > > help or just push me to who can please. > > > > Problem what I am facing we need search for object whose attribute > "chain" > > contaning null, but lucene does not help indexing the null values.. > > > > how can I achieve this, or please guide me the alternative way of doing > > this. > > > > Thanks in advance. > > -Amar > > > > > > > > - > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > > -- Amar Sannaik | Programmer | ATHARVA LIBSON Software Pvt Ltd., # 9886476270, amarsann...@atharvalibson.com