Re: IndexWriter and memory usage
H... Your usage (searching for old doc updating it, to add new fields) is fine. But: what memory usage do you see if you open a searcher, and search for all docs, but don't open an IndexWriter? We need to tease apart the IndexReader vs IndexWriter memory usage you are seeing. Also, can you post the output of CheckIndex (java org.apache.lucene.index.CheckIndex /path/to/index) of your fully built index? That may give some hints about expected memory usage of IR (eg if # unique terms is large). More comments below: On Thu, May 6, 2010 at 1:03 PM, Woolf, Ross ross_wo...@bmc.com wrote: Sorry to be so long in getting back on this. The patch you provided has improved the situation but we are still seeing some memory loss. The following are some images from the heap dump. I'll share with you what we are seeing now. This first image shows the memory pattern. Our fist commit takes place at about 3:54 when the steady trend up takes a drop and the new cycle begins. What we have found is the 2422 fix has made the memory in the first phase before the commit much better (and I'm sure throughout the entire run). But as you can see after the commit then we then again begin to lose memory. One of the pieces of info to know about this is what you are seeing, we have 5 threads that are pushing data to our Lucene plugin. If we drop it down to 1 thread then we are much more successful and can actually index all of our data without running out of memory but at 5 threads it gets into trouble. We still see a trend up in memory usage, but not as severe as when we use the multiple threads. http://tinypic.com/view.php?pic=2w6bf68s=5 Can you post the output of svn diff on the 2.9 code base you're using? I just want to look verify all issues we've discussed are included in your changes. The fact that 1 thread is fine and 5 threads are not still sounds like a symptom of LUCENE-2283. Also, does that heap usage graph exclude garbage? Or, alternatively, can you provoke an OOME w/ 512 MB heap and then capture the heap dump at that point? There is another piece of the picture that I think might be coming into play. We have plugged Lucene into a legacy app and are subject to how we can get it to deliver the data that we are indexing. In some scenarios (like the one we are having this problem with) we are building our documents progressively (adding fields to the document through the process). What you see before the first commit is the legacy system handing us the first field for many documents. Once we have gotten all of field 1 for all documents then we commit that data into the index. Then the system starts feeding us field 2. So we perform a search to see if the document already exists (for the scenario you are seeing it does) and so it retrieves the original document (we store a document ID) and it then adds the new field of data to the existing document and we update the document in the index. After the first commit, the rest of the process is one where a document already exist and so the new field is added and and the document is updated. It is in this process that we start rapidly losing memory. The following images show some examples of common areas where memory is being held. http://tinypic.com/view.php?pic=11vkwnbs=5 This looks like normal memory usage of IndexWriter -- these are the recycled buffers used for holding stored fields. However: the net RAM used by this allocation should not exceed your 16 MB IW ram buffer size -- does it? http://tinypic.com/view.php?pic=abq9fps=5 This one is the byte[] buffer used by CompoundFileReader, opened by IndexReader. It's odd that you have so many of these (if I'm reading this correctly) -- are you certain all opened readers are being closed? How many segments do you have in your index? Or... are there many unique threads doing the searching? EG do you create a new thread for every search or update? http://tinypic.com/view.php?pic=25pskyps=5 This one is also normal memory used by IndexWriter, but as above, the net RAM used by this allocation (summed w/ the above one) should not exceed your 16 MB IW ram buffer size. As mentioned, we are subject to how we can have the legacy app feed us the data and so this is why we do it this way. We treat this system as a real time system and at anytime the legacy system may send us a field that needs to be added or updated to a document. So we search for the document and if found we either add or update a field if the field is already existing in the document. So I started to wonder if a clue in this memory loss comes from the fact that we are retrieving an existing document and then adding to it and updating. Now, if we eliminate the updating and simply add each item as a new document (which we did just to test but won't be adequate for our running system), then we still see a slight trend upward in memory usage and the
The best way to stop indexing quickly?
Hi all, Imagine a situation: Lucene started indexing a huge file, and just after this user demands the application to be shut down immediately. What would be the recommended way of doing this, so that application shuts down within seconds, but with least possible damage to the index? best regards, Alex
Re: Trace only exactly matching terms!
Hi Anshum Erick, As you have mentioned, I used SnowballAnalyzer for stemming purposes. It worked nicely. Thnks a lot for your guidence. Manjula. On Fri, May 7, 2010 at 8:27 PM, Erick Erickson erickerick...@gmail.comwrote: The other approach is to use a stemmer both at index and query time. BTW, it's very easy to make a custom analyzer by chaining together the Tokenizer and as many filters (e.g. PorterStemFilter), essentially composing your analyzer from various pre-built Lucene parts. HTH Erick On Fri, May 7, 2010 at 9:07 AM, Anshum ansh...@gmail.com wrote: Hi Manjula, Yes lucene by default would only tackle exact term matches unless you use a custom analyzer to expand the index/query. -- Anshum Gupta http://ai-cafe.blogspot.com The facts expressed here belong to everybody, the opinions to me. The distinction is yours to draw On Fri, May 7, 2010 at 2:22 PM, manjula wijewickrema manjul...@gmail.com wrote: Hi, I am using Lucene 2.9.1 . I have downloaded and run the 'HelloLucene.java' class by modifing the input document and user query in various ways. Once I put the document sentenses as 'Lucene in actions' insted of 'Lucene in action', and I gave the query as 'action' and run the programme. But it did not show me the 'Lucene in action as a hit'! What is the reason for this? Why it doesn't tackle word 'actions' as a hit? Does Lucene identify only the exactly matching words? Thanks Manjula
Class_for_HighFrequencyTerms
Hi, If I index a document (single document) in Lucene, then how can I get the term frequencies (even the first and second highest occuring terms) of that document? Is there any class/method to do taht? If anybody knows, pls. help me. Thanks Manjula
MatchAllDocsQuery and MatchNoDocsQuery
Hi, Can anybody confirm whether MatchAllDocsQuery can be used as an immutable singletone? By this I mean creating a single instance and sharing it whenever I need to either use it on its own or in cojunction with other queries put into a BooleanQuery; to result all documents in a search result. Can the same instance even be reused among different threads? What would be the best way implementing MatchNoDocsQuery? My initial tests show that a new BooleanQuery() without any additional clauses would just do the job, but I just wanted to double check whether this is be a reliable assumption. Above questions also apply - could this be reused among different contexts, threads? Thanks in advance. Regards, Mindaugas - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: MatchAllDocsQuery and MatchNoDocsQuery
Yes on all counts. Lucene doesn't modify query objects, so they are save for reuse among multiple threads. -Yonik Apache Lucene Eurocon 2010 18-21 May 2010 | Prague 2010/5/10 Mindaugas Žakšauskas min...@gmail.com: Hi, Can anybody confirm whether MatchAllDocsQuery can be used as an immutable singletone? By this I mean creating a single instance and sharing it whenever I need to either use it on its own or in cojunction with other queries put into a BooleanQuery; to result all documents in a search result. Can the same instance even be reused among different threads? What would be the best way implementing MatchNoDocsQuery? My initial tests show that a new BooleanQuery() without any additional clauses would just do the job, but I just wanted to double check whether this is be a reliable assumption. Above questions also apply - could this be reused among different contexts, threads? Thanks in advance. Regards, Mindaugas - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Class_for_HighFrequencyTerms
Have you looked at TermFreqVector? Best Erick On Mon, May 10, 2010 at 8:10 AM, manjula wijewickrema manjul...@gmail.comwrote: Hi, If I index a document (single document) in Lucene, then how can I get the term frequencies (even the first and second highest occuring terms) of that document? Is there any class/method to do taht? If anybody knows, pls. help me. Thanks Manjula
Re: merge results from physically separate hosts
Sorry for the delayed response... Thanks, thats what I thought. In my case, the schema of each index would be slightly different, so I would want to run a PrefixQuery against each index (all fields in each index) using the same query text. Maybe I would be able to take the results from each index and then simply sort based on the ScoreDoc or something to get the most relevant docs. Is there a technical reason why Solr requires the index schema to be the same, or was this simply the design that was chosen? Shaun On Mon, Apr 26, 2010 at 6:59 PM, Erik Hatcher erik.hatc...@gmail.com wrote: Solr's distributed search feature is about querying multiple indexes and merging the results. Different indexes, but same schema. Erik On Apr 25, 2010, at 6:02 AM, Shaun Senecal wrote: Is there currently a way to take a query, run it on multiple hosts containing different indexes, then merge the results from each host to present to the user? It looks like Solr can handle multiple hosts supporting the same index, but my case requires each index to be different. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org