IndexDocValues
I came across this type when I checked this blog: http://blog.trifork.com/2011/10/27/introducing-lucene-index-doc-values/ The blog mentions that the IndexDocValues are created as sorting types indexed specifically for the purpose and reduce the overhead created by the FieldCache. I could not locate this class in the Lucene 4.7.2 hierarchy. Is this replaced by somewhat similar SortedDocValuesField? And are there any benchmarks that show the memory and sorting time using this field as opposed to sorting on a regular StringField. --- Thanks n Regards, Sandeep Ramesh Khanzode
About lucene memory consumption
Hi, all I fould that the memory consumption of my lucene server is abnormal, and “jmap -histo ${pid}” show that the class of byte[] consume almost all of the memory. Is there memory leak in my app? Why so many byte[] instances? The following is the top output of jmap: num #instances #bytes class name -- 1: 1786575 1831556144 [B 2:704618 80078064 [C 3:839932 33597280 java.util.LinkedHashMap$Entry 4:686770 21976640 java.lang.String Thanks Best Regards!
Searching on Large Indexes
Hi, I have an index that runs into 200-300GB. It is not frequently updated. What are the best strategies to query on this index? 1.] Should I, at index time, split the content, like a hash based partition, into multiple separate smaller indexes and aggregate the results programmatically? 2.] Should I replicate this index and provide some sort of document ID, and search on each node for a specific range of document IDs? 3.] Is there any way I can split or move individual segments to different nodes and aggregate the results? I am not fully aware of the large scale query strategies. Can you please share your findings or experiences? Thanks, --- Thanks n Regards, Sandeep Ramesh Khanzode
Re: Searching on Large Indexes
Some points based on my experience. You can think of SolrCloud implementation, if you want to distribute your index over multiple servers. Use MMapDirectory locally for each Solr instance in cluster. Hit warm-up query on sever start-up. So most of the documents will be cached, you will start saving on Disk IO on subsequent requests. For e.g. If you have 4 Solr instances with 64GB RAM on each. most of your documents will stay in RAM for 200GB index, and this will give you better performance. To take advantage of multi-core system. You can increase Searcher Threads, ideally up-to the cores you have on single instance. On Fri, Jun 27, 2014 at 4:03 PM, Sandeep Khanzode sandeep_khanz...@yahoo.com.invalid wrote: Hi, I have an index that runs into 200-300GB. It is not frequently updated. What are the best strategies to query on this index? 1.] Should I, at index time, split the content, like a hash based partition, into multiple separate smaller indexes and aggregate the results programmatically? 2.] Should I replicate this index and provide some sort of document ID, and search on each node for a specific range of document IDs? 3.] Is there any way I can split or move individual segments to different nodes and aggregate the results? I am not fully aware of the large scale query strategies. Can you please share your findings or experiences? Thanks, --- Thanks n Regards, Sandeep Ramesh Khanzode
Can Lucene based application be made to work with Scaled Elastic Beanstalk environemnt on Amazon Web Services
Hi I have a simple WAR based web application that uses lucene created indexes to provide search results in a xml format. It works fine locally but I want to deploy it using Elastic Beanstalk within Amazon Webservices Problem 1 is that WAR definition doesn't seem to provide a location for data files (rather than config files) so when I deploy the WAR with EB it doesnt work at first because has no access to the data (lucene indexes) , however I solved this by connecting to the underlying EC2 instance and copy the lucene indexes from S3 to the instance, and ensuring the file location is defined in the Wars web.xml file. Problem 2 is more problematic, Im looking at AWS and EB because I wanted a way to deploy the application with little ongoing admin overhead and I like the way EB does load balancing and auto scaling for you, starting and stopping additional instances as required to meet demand. However these automatically started instances will not have access to the index files. Possible solutions could be 1. Is there a location I can store the data index within the WAR itself, the index is only 5GB so I do have space on my root disk to store the indexes in the WAR if there is a way to use them, Tomcat was also be need to unwar the file at deployement, I cant see if tomcat on AWSdoes this. 2. A way for EC2 instances to be started with data preloaded i some way (BTW Im aware of CloudSearch but its not an avenue I want to go down) Does anybody have any experience of this,please ? Paul - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Searching on Large Indexes
On Fri, 2014-06-27 at 12:33 +0200, Sandeep Khanzode wrote: I have an index that runs into 200-300GB. It is not frequently updated. not frequently means different things for different people. Could you give an approximate time span? If it is updated monthly, you might consider a full optimization after update. What are the best strategies to query on this index? 1.] Should I, at index time, split the content, like a hash based partition, into multiple separate smaller indexes and aggregate the results programmatically? Assuming you use multiple machines or independent storage for the multiple indexes, this will bring down latency. Do this if your searches are too slow. 2.] Should I replicate this index and provide some sort of document ID, and search on each node for a specific range of document IDs? I don't really follow that idea. Are your searches only ID-based? Anyway, replication increases throughput. Do this if your server have trouble keeping up with the full amount of work. 3.] Is there any way I can split or move individual segments to different nodes and aggregate the results? Copy the full index. Delete all documents in copy 1 that matches one half of your ID-hash function, do the reverse for the other. As your corpus is semi-randomly distributed, scores should be comparable between the indexes so that the result sets can be easily merged. But at Jigar says, you should consider switching to SolrCloud (or ElasticSearch) which does all this for you. I am not fully aware of the large scale query strategies. Can you please share your findings or experiences? Depends on what you mean by large scale. You have a running system - what do you want? Scaling up? Lowering latency? Increasing throughput? More complex queries? - Toke Eskildsen, State and University Library, Denmark - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
RE: About lucene memory consumption
Hi, The number of byte[] instances and the total size shows that each byte[] is approx. 1024 bytes long. This is exactly the size used by RAMDirectory for allocated heap blocks. So the important question: Do you use RAMDirectory to hold your index? This is not recommended, it is better to use MMapDirectory. RAMDirectory is a class made for testing lucene, not for production (does not scale well, is not GC-friendly, and is therefore slow in most cases for large indexes). Also the index is not persisted to disk. If you want an in-memory index, use a linux tmpfs filesystem (ramdisk) and write your index to it (and use MMapDirectory to access it). To help you, give more information on how you use Lucene and its directory implementations. Uwe - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de -Original Message- From: 308181687 [mailto:308181...@qq.com] Sent: Friday, June 27, 2014 10:42 AM To: java-user Subject: About lucene memory consumption Hi, all I fould that the memory consumption of my lucene server is abnormal, and “jmap -histo ${pid}” show that the class of byte[] consume almost all of the memory. Is there memory leak in my app? Why so many byte[] instances? The following is the top output of jmap: num #instances #bytes class name -- 1: 1786575 1831556144 [B 2:704618 80078064 [C 3:839932 33597280 java.util.LinkedHashMap$Entry 4:686770 21976640 java.lang.String Thanks Best Regards! - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re:RE: About lucene memory consumption
Hi, Thanks very much for your reply. Because we need near real time search, we decide to use NRTCachingDirectory instead of MMapDirectory. Code to create Directory as follows : Directory indexDir = FSDirectory.open(new File(indexDirName)); NRTCachingDirectory cachedFSDir = new NRTCachingDirectory(indexDir, 5.0, 60.0); But I think that NRTCachingDirectory will only use RAMDirectory for caching and use MMapDirectory to access index file on disk, right? The `top ` command seems prove this, the VIRT memory of lucene server is 28.5G, and RES memory is only 5G. PIDUSER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 4004 root 20 0 28.5g 5.0g 49m S 2.0 65.6 140:34.50 java Now our lucene server have indexed 2 million email and provide near real time search service, and some times we can not commit the index because of OutOfMemoryError, and we have to restart the JVM. By the way, we commit the index for every 1000 email document. Could you give me kindly give me some tips to solve this problem? Thanks Best Regards! -- Original -- From: Uwe Schindler;u...@thetaphi.de; Date: Fri, Jun 27, 2014 08:36 PM To: java-userjava-user@lucene.apache.org; Subject: RE: About lucene memory consumption Hi, The number of byte[] instances and the total size shows that each byte[] is approx. 1024 bytes long. This is exactly the size used by RAMDirectory for allocated heap blocks. So the important question: Do you use RAMDirectory to hold your index? This is not recommended, it is better to use MMapDirectory. RAMDirectory is a class made for testing lucene, not for production (does not scale well, is not GC-friendly, and is therefore slow in most cases for large indexes). Also the index is not persisted to disk. If you want an in-memory index, use a linux tmpfs filesystem (ramdisk) and write your index to it (and use MMapDirectory to access it). To help you, give more information on how you use Lucene and its directory implementations. Uwe - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de -Original Message- From: 308181687 [mailto:308181...@qq.com] Sent: Friday, June 27, 2014 10:42 AM To: java-user Subject: About lucene memory consumption Hi, all I fould that the memory consumption of my lucene server is abnormal, and “jmap -histo ${pid}” show that the class of byte[] consume almost all of the memory. Is there memory leak in my app? Why so many byte[] instances? The following is the top output of jmap: num #instances #bytes class name -- 1: 1786575 1831556144 [B 2:704618 80078064 [C 3:839932 33597280 java.util.LinkedHashMap$Entry 4:686770 21976640 java.lang.String Thanks Best Regards! - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org .
Re:RE: About lucene memory consumption
Could it be that you forgot to close older IndexReaders after getting a new NRT one? This would be a huge memory leak. I recommend to use SearcherManager to handle real time reopen correctly. Uwe Am 27. Juni 2014 16:05:19 MESZ, schrieb 308181687 308181...@qq.com: Hi, Thanks very much for your reply. Because we need near real time search, we decide to use NRTCachingDirectory instead of MMapDirectory. Code to create Directory as follows : Directory indexDir = FSDirectory.open(new File(indexDirName)); NRTCachingDirectory cachedFSDir = new NRTCachingDirectory(indexDir, 5.0, 60.0); But I think that NRTCachingDirectory will only use RAMDirectory for caching and use MMapDirectory to access index file on disk, right? The `top ` command seems prove this, the VIRT memory of lucene server is 28.5G, and RES memory is only 5G. PIDUSER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 4004 root 20 0 28.5g 5.0g 49m S 2.0 65.6 140:34.50 java Now our lucene server have indexed 2 million email and provide near real time search service, and some times we can not commit the index because of OutOfMemoryError, and we have to restart the JVM. By the way, we commit the index for every 1000 email document. Could you give me kindly give me some tips to solve this problem? Thanks Best Regards! -- Original -- From: Uwe Schindler;u...@thetaphi.de; Date: Fri, Jun 27, 2014 08:36 PM To: java-userjava-user@lucene.apache.org; Subject: RE: About lucene memory consumption Hi, The number of byte[] instances and the total size shows that each byte[] is approx. 1024 bytes long. This is exactly the size used by RAMDirectory for allocated heap blocks. So the important question: Do you use RAMDirectory to hold your index? This is not recommended, it is better to use MMapDirectory. RAMDirectory is a class made for testing lucene, not for production (does not scale well, is not GC-friendly, and is therefore slow in most cases for large indexes). Also the index is not persisted to disk. If you want an in-memory index, use a linux tmpfs filesystem (ramdisk) and write your index to it (and use MMapDirectory to access it). To help you, give more information on how you use Lucene and its directory implementations. Uwe - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de -Original Message- From: 308181687 [mailto:308181...@qq.com] Sent: Friday, June 27, 2014 10:42 AM To: java-user Subject: About lucene memory consumption Hi, all I fould that the memory consumption of my lucene server is abnormal, and “jmap -histo ${pid}” show that the class of byte[] consume almost all of the memory. Is there memory leak in my app? Why so many byte[] instances? The following is the top output of jmap: num #instances #bytes class name -- 1: 1786575 1831556144 [B 2:704618 80078064 [C 3:839932 33597280 java.util.LinkedHashMap$Entry 4:686770 21976640 java.lang.String Thanks Best Regards! - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org . -- Uwe Schindler H.-H.-Meier-Allee 63, 28213 Bremen http://www.thetaphi.de
Re: Can Lucene based application be made to work with Scaled Elastic Beanstalk environemnt on Amazon Web Services
I would just use S3 as a data push mechanism. In your servlet's init(), you could download the index from S3 and unpack it to a local directory, then initialize your Lucene searcher to that directory. Downloading from S3 to EC2 instances is free, and 5G would take a minute or two. Also, if you pack the index inside your war file, the new instance has to download that data anyway. The big advantage is it also allows you to update your index without repacking your deployment .war. Just upload the new index to the same location in S3, then restart your webapp :) Hope this helps, Tri On Jun 27, 2014, at 04:13 AM, Paul Taylor paul_t...@fastmail.fm wrote: Hi I have a simple WAR based web application that uses lucene created indexes to provide search results in a xml format. It works fine locally but I want to deploy it using Elastic Beanstalk within Amazon Webservices Problem 1 is that WAR definition doesn't seem to provide a location for data files (rather than config files) so when I deploy the WAR with EB it doesnt work at first because has no access to the data (lucene indexes) , however I solved this by connecting to the underlying EC2 instance and copy the lucene indexes from S3 to the instance, and ensuring the file location is defined in the Wars web.xml file. Problem 2 is more problematic, Im looking at AWS and EB because I wanted a way to deploy the application with little ongoing admin overhead and I like the way EB does load balancing and auto scaling for you, starting and stopping additional instances as required to meet demand. However these automatically started instances will not have access to the index files. Possible solutions could be 1. Is there a location I can store the data index within the WAR itself, the index is only 5GB so I do have space on my root disk to store the indexes in the WAR if there is a way to use them, Tomcat was also be need to unwar the file at deployement, I cant see if tomcat on AWSdoes this. 2. A way for EC2 instances to be started with data preloaded i some way (BTW Im aware of CloudSearch but its not an avenue I want to go down) Does anybody have any experience of this,please ? Paul - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org