Re: Data Import Handelr Question
Thanks Shawn In your opinion, what do you think is easier, writing the importer from scratch or extending the DIH (for example: adding the state etc...)? Yuval On Thu, Apr 24, 2014 at 6:47 PM, Shawn Heisey s...@elyograg.org wrote: On 4/24/2014 9:24 AM, Yuval Dotan wrote: I want to use the DIH component in order to import data from old postgresql DB. I want to be able to recover from errors and crashes. If an error occurs I should be able to restart and continue indexing from where it stopped. Is the DIH good enough for my requirements ? If not is it possible to extend one of its classes in order to support the recovery? The entity in the Dataimport Handler (DIH) config has an onError attribute. http://wiki.apache.org/solr/DataImportHandler#Schema_for_the_data_config https://cwiki.apache.org/confluence/display/solr/ Uploading+Structured+Data+Store+Data+with+the+Data+Import+Handler# UploadingStructuredDataStoreDatawiththeDataImportHandler-EntityProcessors But honestly, if you want a really robust Java program that indexes to Solr and does precisely what you want, you may be better off writing it yourself using SolrJ and JDBC. DIH is powerful and efficient, but when you write the program yourself, you can do anything you want with your data. You also have the possibility of resuming an import after a Solr crash. Because DIH is embedded in Solr and doesn't save any kind of state data about an import in progress, that's pretty much impossible with DIH. With a SolrJ program, you'd have to handle that yourself, but it would be *possible*. https://cwiki.apache.org/confluence/display/solr/Using+SolrJ Thanks, Shawn
Data Import Handelr Question
Hi I want to use the DIH component in order to import data from old postgresql DB. I want to be able to recover from errors and crashes. If an error occurs I should be able to restart and continue indexing from where it stopped. Is the DIH good enough for my requirements ? If not is it possible to extend one of its classes in order to support the recovery? Thanks Yuval
Re: distributed search is significantly slower than direct search
Hi Thanks very much for your answers :) Manuel, if you have a patch I will be glad to test it's performance Yuval On Mon, Nov 18, 2013 at 10:49 AM, Shalin Shekhar Mangar shalinman...@gmail.com wrote: Manuel, that sounds very interesting. Would you be willing to contribute this back to the community? On Mon, Nov 18, 2013 at 9:53 AM, Manuel Le Normand manuel.lenorm...@gmail.com wrote: In order to accelerate the BinaryResponseWriter.write we extended this writer class to implement the docid to id tranformation by docValues (on memory) with no need to access stored field for id reading nor lazy loading of fields that also has a cost. That should improve read rate as docValues are sequential and should avoid disk IO. This docValues implementation is accessed during both query stages (as mentioned above) in case you ask for id's only, or only once, during the distributed search stage, in case you intend asking for stored fields different than id. We just started testing it for performance. I would love hearing any oppinions or test performances for this implementation Manu -- Regards, Shalin Shekhar Mangar.
Re: distributed search is significantly slower than direct search
Hi, I isolated the case Installed on a new machine (2 x Xeon E5410 2.33GHz) I have an environment with 12Gb of memory. I assigned 6gb of memory to Solr and I’m not running any other memory consuming process so no memory issues should arise. Removed all indexes apart from two: emptyCore – empty – used for routing core1 – holds the stored data – has ~750,000 docs and size of 400Mb Again this is a single machine that holds both indexes. The query http://localhost:8210/solr/emptyCore/select?rows=5000q=*:*shards=127.0.0.1:8210/solr/core1wt=jsonQTime takes ~3 seconds and direct query http://localhost:8210/solr/core1/select?rows=5000q=*:*wt=json Qtime takes ~15 ms - a magnitude difference. I ran the long query several times and got an improvement of about a sec (33%) but that’s it. I need to better understand why this is happening. I tried looking at Solr code and debugging the issue but with no success. The one thing I did notice is that the getFirstMatch method which receives the doc id, searches the term dict and returns the internal id takes most of the time for some reason. I am pretty stuck and would appreciate any ideas My only solution for the moment is to bypass the distributed query, implement code in my own app that directly queries the relevant cores and handles the sorting etc.. Thanks On Sat, Nov 16, 2013 at 2:39 PM, Michael Sokolov msoko...@safaribooksonline.com wrote: Did you say what the memory profile of your machine is? How much memory, and how large are the shards? This is just a random guess, but it might be that if you are memory-constrained, there is a lot of thrashing caused by paging (swapping?) in and out the sharded indexes while a single index can be scanned linearly, even if it does need to be paged in. -Mike On 11/14/2013 8:10 AM, Elran Dvir wrote: Hi, We tried returning just the id field and got exactly the same performance. Our system is distributed but all shards are in a single machine so network issues are not a factor. The code we found where Solr is spending its time is on the shard and not on the routing core, again all shards are local. We investigated the getFirstMatch() method and noticed that the MultiTermEnum.reset (inside MultiTerm.iterator) and MultiTerm.seekExact take 99% of the time. Inside these methods, the call to BlockTreeTermsReader$ FieldReader$SegmentTermsEnum$Frame.loadBlock takes most of the time. Out of the 7 seconds run these methods take ~5 and BinaryResponseWriter.write takes the rest(~ 2 seconds). We tried increasing cache sizes and got hits, but it only improved the query time by a second (~6), so no major effect. We are not indexing during our tests. The performance is similar. (How do we measure doc size? Is it important due to the fact that the performance is the same when returning only id field?) We still don't completely understand why the query takes this much longer although the cores are on the same machine. Is there a way to improve the performance (code, configuration, query)? -Original Message- From: idokis...@gmail.com [mailto:idokis...@gmail.com] On Behalf Of Manuel Le Normand Sent: Thursday, November 14, 2013 1:30 AM To: solr-user@lucene.apache.org Subject: Re: distributed search is significantly slower than direct search It's surprising such a query takes a long time, I would assume that after trying consistently q=*:* you should be getting cache hits and times should be faster. Try see in the adminUI how do your query/doc cache perform. Moreover, the query in itself is just asking the first 5000 docs that were indexed (returing the first [docid]), so seems all this time is wasted on transfer. Out of these 7 secs how much is spent on the above method? What do you return by default? How big is every doc you display in your results? Might be the matter that both collections work on the same ressources. Try elaborating your use-case. Anyway, it seems like you just made a test to see what will be the performance hit in a distributed environment so I'll try to explain some things we encountered in our benchmarks, with a case that has at least the similarity of the num of docs fetched. We reclaim 2000 docs every query, running over 40 shards. This means every shard is actually transfering to our frontend 2000 docs every document-match request (the first you were referring to). Even if lazily loaded, reading 2000 id's (on 40 servers) and lazy loading the fields is a tough job. Waiting for the slowest shard to respond, then sorting the docs and reloading (lazy or not) the top 2000 docs might take a long time. Our times are 4-8 secs, but do it's not possible comparing cases. We've done few steps that improved it along the way, steps that led to others. These were our starters: 1. Profile these queries from different servers and solr instances, try putting your finger what collection is working hard and why. Check if you're
Re: distributed search is significantly slower than direct search
Hi Tomás This is just a test environment meant only to reproduce the issue I am currently investigating. The number of documents should grow substantially (billions of docs). On Sun, Nov 17, 2013 at 7:12 PM, Tomás Fernández Löbbe tomasflo...@gmail.com wrote: Hi Yuval, quick question. You say that your code has 750k docs and around 400mb? Is this some kind of test dataset and you expect it to grow significantly? For an index of this size, I wouldn't use distributed search, single shard should be fine. Tomás On Sun, Nov 17, 2013 at 6:50 AM, Yuval Dotan yuvaldo...@gmail.com wrote: Hi, I isolated the case Installed on a new machine (2 x Xeon E5410 2.33GHz) I have an environment with 12Gb of memory. I assigned 6gb of memory to Solr and I’m not running any other memory consuming process so no memory issues should arise. Removed all indexes apart from two: emptyCore – empty – used for routing core1 – holds the stored data – has ~750,000 docs and size of 400Mb Again this is a single machine that holds both indexes. The query http://localhost:8210/solr/emptyCore/select?rows=5000q=*:*shards=127.0.0.1:8210/solr/core1wt=jsonQTime takes ~3 seconds and direct query http://localhost:8210/solr/core1/select?rows=5000q=*:*wt=json Qtime takes ~15 ms - a magnitude difference. I ran the long query several times and got an improvement of about a sec (33%) but that’s it. I need to better understand why this is happening. I tried looking at Solr code and debugging the issue but with no success. The one thing I did notice is that the getFirstMatch method which receives the doc id, searches the term dict and returns the internal id takes most of the time for some reason. I am pretty stuck and would appreciate any ideas My only solution for the moment is to bypass the distributed query, implement code in my own app that directly queries the relevant cores and handles the sorting etc.. Thanks On Sat, Nov 16, 2013 at 2:39 PM, Michael Sokolov msoko...@safaribooksonline.com wrote: Did you say what the memory profile of your machine is? How much memory, and how large are the shards? This is just a random guess, but it might be that if you are memory-constrained, there is a lot of thrashing caused by paging (swapping?) in and out the sharded indexes while a single index can be scanned linearly, even if it does need to be paged in. -Mike On 11/14/2013 8:10 AM, Elran Dvir wrote: Hi, We tried returning just the id field and got exactly the same performance. Our system is distributed but all shards are in a single machine so network issues are not a factor. The code we found where Solr is spending its time is on the shard and not on the routing core, again all shards are local. We investigated the getFirstMatch() method and noticed that the MultiTermEnum.reset (inside MultiTerm.iterator) and MultiTerm.seekExact take 99% of the time. Inside these methods, the call to BlockTreeTermsReader$ FieldReader$SegmentTermsEnum$Frame.loadBlock takes most of the time. Out of the 7 seconds run these methods take ~5 and BinaryResponseWriter.write takes the rest(~ 2 seconds). We tried increasing cache sizes and got hits, but it only improved the query time by a second (~6), so no major effect. We are not indexing during our tests. The performance is similar. (How do we measure doc size? Is it important due to the fact that the performance is the same when returning only id field?) We still don't completely understand why the query takes this much longer although the cores are on the same machine. Is there a way to improve the performance (code, configuration, query)? -Original Message- From: idokis...@gmail.com [mailto:idokis...@gmail.com] On Behalf Of Manuel Le Normand Sent: Thursday, November 14, 2013 1:30 AM To: solr-user@lucene.apache.org Subject: Re: distributed search is significantly slower than direct search It's surprising such a query takes a long time, I would assume that after trying consistently q=*:* you should be getting cache hits and times should be faster. Try see in the adminUI how do your query/doc cache perform. Moreover, the query in itself is just asking the first 5000 docs that were indexed (returing the first [docid]), so seems all this time is wasted on transfer. Out of these 7 secs how much is spent on the above method? What do you return by default? How big is every doc you display in your results? Might be the matter that both collections work on the same ressources. Try elaborating your use-case. Anyway, it seems like you just made a test to see what will be the performance hit in a distributed environment so I'll try to explain some things we encountered in our benchmarks, with a case that has
Re: Performance improvement for solr faceting on large index
you could always try the fc facet method and maybe increase the filtercache size On Thu, Nov 22, 2012 at 2:53 PM, Pravin Agrawal pravin_agra...@persistent.co.in wrote: Hi All, We are using solr 3.4 with following schema fields. schema.xml--- fieldType name=autosuggest_text class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.ShingleFilterFactory maxShingleSize=5 outputUnigrams=true/ filter class=solr.PatternReplaceFilterFactory pattern=^([0-9. ])*$ replacement= replace=all/ filter class=solr.RemoveDuplicatesTokenFilterFactory/ /analyzer analyzer type=query tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ /analyzer /fieldType field name=id type=string stored=true indexed=true/ field name=autoSuggestContent type=autosuggest_text stored=true indexed=true multiValued=true/ copyField source=content dest=autoSuggestContent/ copyField source=original_title dest=autoSuggestContent/ field name=content type=text stored=true indexed=true/ field name=original_title type=text stored=true indexed=true/ field name=site type=site stored=false indexed=true/ /schema.xml--- The index on above schema is distributed on two solr shards with each index size of about 1.2 million, and size on disk of about 195GB per shard. We want to retrieve (site, autoSuggestContent term, frequency of the term) information from our above main solr index. The site is a field in document and contains name of site to which that document belongs. The terms are retrieved from multivalued field autoSuggestContent which is created using shingles from content and title of the web page. As of now, we are using facet query to retrieve (term, frequency of term) for each site. Below is a sample query (you may ignore initial part of query) http://localhost:8080/solr/select?indent=onq=*:*fq=site:www.abc.comstart=0rows=0fl=idqt=dismaxfacet=truefacet.field=autoSuggestContentfacet.mincount=25facet.limit=-1facet.method=enumfacet.sort=index The problem is that with increase in index size, this method has started taking huge time. It used to take 7 minutes per site with index size of 0.4 million docs but takes around 60-90 minutes for index size of 2.5 million(). With this speed, it will take around 5-6 days to index complete 1500 sites. Also we are expecting the index size to grow with more documents and more sites and as such time to get the above information will increase further. Please let us know if there is any better way to extract (site, term, frequency) information compare to current method. Thanks, Pravin Agrawal DISCLAIMER == This e-mail may contain privileged and confidential information which is the property of Persistent Systems Ltd. It is intended only for the use of the individual or entity to which it is addressed. If you are not the intended recipient, you are not authorized to read, retain, copy, print, distribute or use this message. If you have received this communication in error, please notify the sender and delete all copies of this message. Persistent Systems Ltd. does not accept any liability for virus infected mails.
Re: Questions about query times
OK so I solved the question about the query that returns no results and still takes time - I needed to add the facet.mincount=1 parameter and this reduced the time to 200-300 ms instead of seconds. I still could't figure out why a query that returns very few results (like query number 2) still takes seconds to return even with the facet.mincount=1 parameter. I couldn't understand why the facet pivot takes so much time on 299 docs. Does anyone have any idea? Example Query: (2) q=*:*fq=(trimTime:[2012-09-04T15:23:48Z TO *])fq=(Severity:(High Critical))fq=(trimTime:[2012-09-04T15:23:48Z TO *])fq=(Confidence_Level:(N/A)) OR (Confidence_Level:(Medium-High)) OR (Confidence_Level:(High))f.product.facet.sort=indexf.product.facet.limit=-1f.Severity.facet.sort=indexf.Severity.facet.limit=-1f.trimTime.facet.sort=indexf.trimTime.facet.limit=-1facet=truef.product.facet.method=enumfacet.pivot=product,Severity,trimTime NumFound: 299 Times(ms): Qtime: 2,756 Query: 307 Facet: 2,449 On Thu, Sep 20, 2012 at 5:24 PM, Yuval Dotan yuvaldo...@gmail.com wrote: Hi, We have a system that inserts logs continuously (real-time). We have been using the Solr facet pivot feature for querying and have been experiencing slow query times and we were hoping to gain some insights with your help. schema and solrconfig are attached Here are our questions (data below): 1. Why is facet time so long in (3) and (5) - in cases where there are 0 or very few results? 2. We ran two queries that are only differ in the time limit (for the second query - time range is very small) - we got the same time for both queries although the second one returned very few results - again why is that? 3. Is there a way to improve pivot facet time? System Data: Index size: 63 GB RAM:4Gb CPU: 2 x Xeon E5410 2.33GHz Num of Documents: 109,278,476 query examples: - (1) Query: q=*:*fq=(trimTime:[2012-09-04T14:29:24Z TO *])fq=(trimTime:[2012-09-04T14:29:24Z TO *])f.product.facet.sort=indexf.product.facet.limit=-1f.Severity.facet.sort=indexf.Severity.facet.limit=-1f.trimTime.facet.sort=indexf.trimTime.facet.limit=-1facet=truef.product.facet.method=enumfacet.pivot=product,Severity,trimTime NumFound: 11,407,889 Times (ms): Qtime: 3,239 Query: 353 Facet: 2,885 - (2) Query: q=*:*fq=(trimTime:[2012-09-04T15:23:48Z TO *])fq=(Severity:(High Critical))fq=(trimTime:[2012-09-04T15:23:48Z TO *])fq=(Confidence_Level:(N/A)) OR (Confidence_Level:(Medium-High)) OR (Confidence_Level:(High))f.product.facet.sort=indexf.product.facet.limit=-1f.Severity.facet.sort=indexf.Severity.facet.limit=-1f.trimTime.facet.sort=indexf.trimTime.facet.limit=-1facet=truef.product.facet.method=enumfacet.pivot=product,Severity,trimTime NumFound: 299 Times(ms): Qtime: 2,756 Query: 307 Facet: 2,449 - (3) Query: q=*:*fq=(trimTime:[2012-09-11T12:55:00Z TO *])fq=(Severity:(High Critical))fq=(trimTime:[2012-09-04T15:23:48Z TO *])fq=(Confidence_Level:(N/A)) OR (Confidence_Level:(Medium-High)) OR (Confidence_Level:(High))f.product.facet.sort=indexf.product.facet.limit=-1f.Severity.facet.sort=indexf.Severity.facet.limit=-1f.trimTime.facet.sort=indexf.trimTime.facet.limit=-1facet=truef.product.facet.method=enumfacet.pivot=product,Severity,trimTime NumFound: 7 Times(ms): Qtime: 2,798 Query: 312 Facet: 2,485 - (4) Query: q=*:*fq=(trimTime:[2012-09-04T15:43:16Z TO *])fq=(trimTime:[2012-09-04T15:43:16Z TO *])fq=(product:(Application Control)) OR (product:(URL Filtering))f.appi_name.facet.sort=indexf.appi_name.facet.limit=-1f.app_risk.facet.sort=indexf.app_risk.facet.limit=-1f.matched_category.facet.sort=indexf.matched_category.facet.limit=-1f.trimTime.facet.sort=indexf.trimTime.facet.limit=-1facet=truef.appi_name.facet.method=enumfacet.pivot=appi_name,app_risk,matched_category,trimTimeexf.trimTime.facet.limit=-1facet=truef.product.facet.method=enumfacet.pivot=product,Severity,trimTime NumFound: more than 30M Times(ms): Qtime: 23,288 - (5) Query: q=*:*fq=(trimTime:[2012-09-05T06:03:55Z TO *])fq=(Severity:(High Critical))fq=(trimTime:[2012-09-05T06:03:55Z TO *])fq=(product:(IPS)) OR (product:(SmartDefense))fq=(action:(Detect)) OR (action:(mixed))fq=(Confidence_Level:(Medium
Re: Partition Question
Thanks Lance There is already a clear partition - as you assumed, by date. My requirement is for the best setup for: 1. A *single machine* 2. Quickly changing index - so i need to have the option to load and unload partitions dynamically Do you think that the sharding model that solr offers is the most suitable for this setup? What about the solr multi core model? On Wed, May 9, 2012 at 12:23 AM, Lance Norskog goks...@gmail.com wrote: Lucene does not support more 2^32 unique documents, so you need to partition. In Solr this is done with Distributed Search: http://www.lucidimagination.com/search/link?url=http://wiki.apache.org/solr/DistributedSearch First, you have to decide a policy for which documents go to which 'shard'. It is common to make a hash code as the unique id, then distribute the documents modulo this value. This gives a roughly equal distribution of documents. If there is already a clear partition, like the date of the document (like newspaper articles) you could use that also. You have new documents and existing documents. For new documents you need code for this policy to get all new documents to the right index. This could be one master program that passes them out, or each indexer could know which documents it gets. If you want to split up your current index, that's different. I have done this: for each shard, make a copy of the full index, delete-by-query all of the documents that are NOT in that shard, and optimize. We had to do this in sequence so it took a few days :) You don't need a full optimize. Use 'maxSegments=50' or '100' to suppress that last final giant merge. On Tue, May 8, 2012 at 12:02 AM, Yuval Dotan yuvaldo...@gmail.com wrote: Hi Can someone please guide me to the right way to partition the solr index? On Mon, May 7, 2012 at 11:41 AM, Yuval Dotan yuvaldo...@gmail.com wrote: Hi All Jan, thanks for the reply - answers for your questions are located below Please update me if you have ideas that can solve my problems. First, some corrections to my previous mail: Hi All We have an index of ~2,000,000,000 Documents and the query and facet times are too slow for us - our index in fact will be much larger Most of our queries will be limited by time, hence we want to partition the data by date/time - even when unlimited – which is mostly what will happen, we have results in the recent records and querying the whole dataset is redundant We want to partition the data because the index size is too big and doesn't fit into memory (80 Gb's) - our data actually continuously grows over time, it will never fit into memory, but has to be available for queries in case results are found in older records or a full facet is required 1. Is multi core the best way to implement my requirement? 2. I noticed there are some LOAD / UNLOAD actions on a core - should i use these action when managing my cores? if so how can i LOAD a core that i have unloaded for example: I have 7 partitions / cores - one for each day of the week - we might have 2000 per day In most cases I will search for documents only on the last day core. Once every 1 queries I need documents from all cores. Question: Do I need to unload all of the old cores and then load them on demand (when i see i need data from these cores)? 3. If the question to the last answer is no, how do i ensure that only cores that are loaded into memory are the ones I want? Thanks Yuval * * *Answers to Jan:* Hi, First you need to investigate WHY faceting and querying is too slow. What exactly do you mean by slow? Can you please tell us more about your setup? * How large documents and how many fields? small records ~200bytes, 20 fields avg most of them are not stored - attached schema and config file * What kind of queries? How many hits? How many facets? Have you studies debugQuery=true output? problem is not with queries being slow per se, it is with getting 50 matches out of billions of matching docs * Do you use filter queries (fq) extensively? user generated queries, fq would not reduce the dataset for some of our usecases * What data do you facet on? Many unique values per field? Text or ranges? What facet.method? problem is not just faceting, it’s with queries – let’s start there * What kind of hardware? RAM/CPU HP DL180G6 , 2 E5645 (12 core) 48 GB RAM * How have you configured your JVM? How much memory? GC? java -Xms512M -Xmx40960M -jar start.jar As you see, you will have to provide a lot more information on your use case and setup in order for us to judge correct action to take. You might need to adjust your config, or to optimize your queries or caches, slim your schema, buy some more RAM, or an SSD :) Normally, going multi core on one box will not necessarily help in itself, as there is overhead in sharding
Re: Partition Question
Hi Can someone please guide me to the right way to partition the solr index? On Mon, May 7, 2012 at 11:41 AM, Yuval Dotan yuvaldo...@gmail.com wrote: Hi All Jan, thanks for the reply - answers for your questions are located below Please update me if you have ideas that can solve my problems. First, some corrections to my previous mail: Hi All We have an index of ~2,000,000,000 Documents and the query and facet times are too slow for us - our index in fact will be much larger Most of our queries will be limited by time, hence we want to partition the data by date/time - even when unlimited – which is mostly what will happen, we have results in the recent records and querying the whole dataset is redundant We want to partition the data because the index size is too big and doesn't fit into memory (80 Gb's) - our data actually continuously grows over time, it will never fit into memory, but has to be available for queries in case results are found in older records or a full facet is required 1. Is multi core the best way to implement my requirement? 2. I noticed there are some LOAD / UNLOAD actions on a core - should i use these action when managing my cores? if so how can i LOAD a core that i have unloaded for example: I have 7 partitions / cores - one for each day of the week - we might have 2000 per day In most cases I will search for documents only on the last day core. Once every 1 queries I need documents from all cores. Question: Do I need to unload all of the old cores and then load them on demand (when i see i need data from these cores)? 3. If the question to the last answer is no, how do i ensure that only cores that are loaded into memory are the ones I want? Thanks Yuval * * *Answers to Jan:* Hi, First you need to investigate WHY faceting and querying is too slow. What exactly do you mean by slow? Can you please tell us more about your setup? * How large documents and how many fields? small records ~200bytes, 20 fields avg most of them are not stored - attached schema and config file * What kind of queries? How many hits? How many facets? Have you studies debugQuery=true output? problem is not with queries being slow per se, it is with getting 50 matches out of billions of matching docs * Do you use filter queries (fq) extensively? user generated queries, fq would not reduce the dataset for some of our usecases * What data do you facet on? Many unique values per field? Text or ranges? What facet.method? problem is not just faceting, it’s with queries – let’s start there * What kind of hardware? RAM/CPU HP DL180G6 , 2 E5645 (12 core) 48 GB RAM * How have you configured your JVM? How much memory? GC? java -Xms512M -Xmx40960M -jar start.jar As you see, you will have to provide a lot more information on your use case and setup in order for us to judge correct action to take. You might need to adjust your config, or to optimize your queries or caches, slim your schema, buy some more RAM, or an SSD :) Normally, going multi core on one box will not necessarily help in itself, as there is overhead in sharding multi cores as well. However, it COULD be a solution since you say that most of the time you only need to consider 1/7 of your data. I would perhaps consider one hot core for last 24h, and one archive core for older data. You could then tune these differently regarding caches etc. Can you get back with some more details? -- Jan Høydahl, search solution architect Cominvent AS - www.cominvent.com Solr Training - www.solrtraining.com
Partition Question
Hi All We have an index of ~2,000,000,000 Documents and the query and facet times are too slow for us. Before using the shards solution for improving performance, we thought about using the multicore feature (our goal is to maximize performance for a single machine). Most of our queries will be limited by time, hence we want to partition the data by date/time. We want to partition the data because the index size is too big and doesn't fit into memory (80 Gb's). 1. Is multi core the best way to implement my requirement? 2. I noticed there are some LOAD / UNLOAD actions on a core - should i use these action when managing my cores? if so how can i LOAD a core that i have unloaded for example: I have 7 partitions / cores - one for each day of the week In most cases I will search for documents only on the last day core. Once every 1 queries I need documents from all cores. Question: Do I need to unload all of the old cores and then load them on demand (when i see i need data from these cores)? 3. If the question to the last answer is no, how do i ensure that only cores that are loaded into memory are the ones I want? Thanks Yuval
Re: Java out of memory - with fieldcache faceting
Thanks for the fast answer One more question: Is there a way to know (some formula) what is the size of memory i need for these actions? Thanks Yuval On Mon, Apr 30, 2012 at 11:50, Dan Tuffery dan.tuff...@gmail.com wrote: You need to add more memory to the JVM that is running Solr: http://wiki.apache.org/solr/SolrPerformanceFactors#OutOfMemoryErrors Dan On Mon, Apr 30, 2012 at 9:43 AM, Yuval Dotan yuvaldo...@gmail.com wrote: Hi Guys I have a problem and i need your assistance I get an exception when doing field cache faceting (the enum method works perfectly): */solr/select?q=*:*facet=truefacet.field=src_ip_strfacet.limit=10* lst name=error str name=msgjava.lang.OutOfMemoryError: Java heap space/str str name=trace java.lang.RuntimeException: java.lang.OutOfMemoryError: Java heap space at org.apache.solr.servlet.SolrDispatchFilter.sendError(SolrDispatchFilter.java:449) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:277) at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1337) at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:484) at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:119) at org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:524) at org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:233) at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1065) at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:413) at org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:192) at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:999) at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:117) at org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:250) at org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:149) at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:111) at org.eclipse.jetty.server.Server.handle(Server.java:351) at org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:454) at org.eclipse.jetty.server.BlockingHttpConnection.handleRequest(BlockingHttpConnection.java:47) at org.eclipse.jetty.server.AbstractHttpConnection.headerComplete(AbstractHttpConnection.java:890) at org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.headerComplete(AbstractHttpConnection.java:944) at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:634) at org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:230) at org.eclipse.jetty.server.BlockingHttpConnection.handle(BlockingHttpConnection.java:66) at org.eclipse.jetty.server.bio.SocketConnector$ConnectorEndPoint.run(SocketConnector.java:254) at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:599) at org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:534) at java.lang.Thread.run(Thread.java:679) Caused by: java.lang.OutOfMemoryError: Java heap space at org.apache.lucene.util.packed.Direct16.init(Direct16.java:38) at org.apache.lucene.util.packed.PackedInts.getMutable(PackedInts.java:267) at org.apache.lucene.util.packed.GrowableWriter.set(GrowableWriter.java:81) at org.apache.lucene.search.FieldCacheImpl$DocTermsIndexCache.createValue(FieldCacheImpl.java:1178) at org.apache.lucene.search.FieldCacheImpl$Cache.get(FieldCacheImpl.java:248) at org.apache.lucene.search.FieldCacheImpl.getTermsIndex(FieldCacheImpl.java:1081) at org.apache.lucene.search.FieldCacheImpl.getTermsIndex(FieldCacheImpl.java:1077) at org.apache.solr.request.SimpleFacets.getFieldCacheCounts(SimpleFacets.java:459) at org.apache.solr.request.SimpleFacets.getTermCounts(SimpleFacets.java:310) at org.apache.solr.request.SimpleFacets.getFacetFieldCounts(SimpleFacets.java:396) at org.apache.solr.request.SimpleFacets.getFacetCounts(SimpleFacets.java:205) at org.apache.solr.handler.component.FacetComponent.process(FacetComponent.java:81) at org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:204) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1541) at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:435) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:256) at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1337) at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:484) at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:119