Re: Avoid re indexing
You do not want to add a new shard, first you want your docs evenly spread, secondly, they are spread using hash ranges, to add more capacity, you spread out those hash ranges using shard splitting. Adding a new shard doesnt really make any sense here. Unless you go for implicit routing where you decide for yourself which shard a doc goes into, but it seems too late to make that decision in your case. Upayavira On Sun, Aug 2, 2015, at 12:40 AM, Nagasharath wrote: Yes, shard splitting will only help in managing large clusters and to improve query performance. In my case as index size is fully grown (no capacity to hold in the existing shards) across the collection adding a new shard will help and for which I have to re index. On 01-Aug-2015, at 6:34 pm, Upayavira u...@odoko.co.uk wrote: Erm, that doesn't seem to make sense. Seems like you are talking about *merging* shards. Say you had two shards, 3m docs each: shard1: 3m docs shard2: 3m docs If you split shard1, you would have: shard1_0: 1.5m docs shard1_1: 1.5m docs shard2: 3m docs You could, of course, then split shard2. You could also split shard1 into three parts instead, if you preferred: shard1_0: 1m docs shard1_1: 1m docs shard1_2: 1m docs shard2: 3m docs Upayavira On Sun, Aug 2, 2015, at 12:25 AM, Nagasharath wrote: If my current shard is holding 3 million documents will the new subshard after splitting also be able to hold 3 million documents? If that is the case After shard splitting the sub shards should hold 6 million documents if a shard is split in to two. Am I right? On 01-Aug-2015, at 5:43 pm, Upayavira u...@odoko.co.uk wrote: On Sat, Aug 1, 2015, at 11:29 PM, naga sharathrayapati wrote: I am using solrj to index documents i agree with you regarding the index update but i should not see any deleted documents as it is a fresh index. Can we actually identify what are those deleted documents? If you post doc 1234, then you post doc 1234 a second time, you will see a deletion in your index. If you don't want deletions to show in your index, be sure NEVER to update a document, only add new ones with absolutely distinct document IDs. You cannot see (via Solr) which docs are deleted. You could, I suppose, introspect the Lucene index, but that would most definitely be an expert task. if there is no option of adding shards to existing collection i do not like the idea of re indexing the whole data (worth hours) and we have gone with good number of shards but there is a rapid increase of size in data over the past few days, do you think is it worth logging a ticket? You can split a shard. See the collections API: https://cwiki.apache.org/confluence/display/solr/Collections+API#CollectionsAPI-api3 What would you want to log a ticket for? I'm not sure that there's anything that would require that. Upayavira
Re: solr multicore vs sharding vs 1 big collection
The document contains around 30 fields and have stored set to true for almost 15 of them. And these stored fields are queried and updated all the time. You will notice that the deleted documents is almost 30% of the docs. And it has stayed around that percent and has not come down. I did try optimize but that was disruptive as it caused search errors. I have been playing with merge factor to see if that helps with deleted documents or not. It is currently set to 5. The server has 24 GB of memory out of which memory consumption is around 23 GB normally and the jvm is set to 6 GB. And have noticed that the available memory on the server goes to 100 MB at times during a day. All the updates are run through DIH. Every day at least once i see the following error, which result in search errors on the front end of the site. ERROR org.apache.solr.servlet.SolrDispatchFilter - null:org.eclipse.jetty.io.EofException From what I have read these are mainly due to timeout and my timeout is set to 30 seconds and cant set it to a higher number. I was thinking maybe due to high memory usage, sometimes it leads to bad performance/errors. My objective is to stop the errors, adding more memory to the server is not a good scaling strategy. That is why i was thinking maybe there is a issue with the way things are set up and need to be revisited. Thanks On Sat, Aug 1, 2015 at 7:06 PM, Shawn Heisey apa...@elyograg.org wrote: On 8/1/2015 6:49 PM, Jay Potharaju wrote: I currently have a single collection with 40 million documents and index size of 25 GB. The collections gets updated every n minutes and as a result the number of deleted documents is constantly growing. The data in the collection is an amalgamation of more than 1000+ customer records. The number of documents per each customer is around 100,000 records on average. Now that being said, I 'm trying to get an handle on the growing deleted document size. Because of the growing index size both the disk space and memory is being used up. And would like to reduce it to a manageable size. I have been thinking of splitting the data into multiple core, 1 for each customer. This would allow me manage the smaller collection easily and can create/update the collection also fast. My concern is that number of collections might become an issue. Any suggestions on how to address this problem. What are my other alternatives to moving to a multicore collections.? Solr: 4.9 Index size:25 GB Max doc: 40 million Doc count:29 million Replication:4 4 servers in solrcloud. Creating 1000+ collections in SolrCloud is definitely problematic. If you need to choose between a lot of shards and a lot of collections, I would definitely go with a lot of shards. I would also want a lot of servers for an index with that many pieces. https://issues.apache.org/jira/browse/SOLR-7191 I don't think it would matter how many collections or shards you have when it comes to how many deleted documents are in your index. If you want to clean up a large number of deletes in an index, the best option is an optimize. An optimize requires a large amount of disk I/O, so it can be extremely disruptive if the query volume is high. It should be done when the query volume is at its lowest. For the index you describe, a nightly or weekly optimize seems like a good option. Aside from having a lot of deleted documents in your index, what kind of problems are you trying to solve? Thanks, Shawn -- Thanks Jay Potharaju
Are Solr releases predictable? Every 2 months?
When is 5.3 coming out? When is SOLR-6273 https://issues.apache.org/jira/browse/SOLR-6273 (Cross Data Center Replication) to be released? Any way to tell?
Re: solr multicore vs sharding vs 1 big collection
On 8/2/2015 8:29 AM, Jay Potharaju wrote: The document contains around 30 fields and have stored set to true for almost 15 of them. And these stored fields are queried and updated all the time. You will notice that the deleted documents is almost 30% of the docs. And it has stayed around that percent and has not come down. I did try optimize but that was disruptive as it caused search errors. I have been playing with merge factor to see if that helps with deleted documents or not. It is currently set to 5. The server has 24 GB of memory out of which memory consumption is around 23 GB normally and the jvm is set to 6 GB. And have noticed that the available memory on the server goes to 100 MB at times during a day. All the updates are run through DIH. Using all availble memory is completely normal operation for ANY operating system. If you hold up Windows as an example of one that doesn't ... it lies to you about available memory. All modern operating systems will utilize memory that is not explicitly allocated for the OS disk cache. The disk cache will instantly give up any of the memory it is using for programs that request it. Linux doesn't try to hide the disk cache from you, but older versions of Windows do. In the newer versions of Windows that have the Resource Monitor, you can go there to see the actual memory usage including the cache. Every day at least once i see the following error, which result in search errors on the front end of the site. ERROR org.apache.solr.servlet.SolrDispatchFilter - null:org.eclipse.jetty.io.EofException From what I have read these are mainly due to timeout and my timeout is set to 30 seconds and cant set it to a higher number. I was thinking maybe due to high memory usage, sometimes it leads to bad performance/errors. Although this error can be caused by timeouts, it has a specific meaning. It means that the client disconnected before Solr responded to the request, so when Solr tried to respond (through jetty), it found a closed TCP connection. Client timeouts need to either be completely removed, or set to a value much longer than any request will take. Five minutes is a good starting value. If all your client timeout is set to 30 seconds and you are seeing EofExceptions, that means that your requests are taking longer than 30 seconds, and you likely have some performance issues. It's also possible that some of your client timeouts are set a lot shorter than 30 seconds. My objective is to stop the errors, adding more memory to the server is not a good scaling strategy. That is why i was thinking maybe there is a issue with the way things are set up and need to be revisited. You're right that adding more memory to the servers is not a good scaling strategy for the general case ... but in this situation, I think it might be prudent. For your index and heap sizes, I would want the company to pay for at least 32GB of RAM. Having said that ... I've seen Solr installs work well with a LOT less memory than the ideal. I don't know that adding more memory is necessary, unless your system (CPU, storage, and memory speeds) is particularly slow. Based on your document count and index size, your documents are quite small, so I think your memory size is probably good -- if the CPU, memory bus, and storage are very fast. If one or more of those subsystems aren't fast, then make up the difference with lots of memory. Some light reading, where you will learn why I think 32GB is an ideal memory size for your system: https://wiki.apache.org/solr/SolrPerformanceProblems It is possible that your 6GB heap is not quite big enough for good performance, or that your GC is not well-tuned. These topics are also discussed on that wiki page. If you increase your heap size, then the likelihood of needing more memory in the system becomes greater, because there will be less memory available for the disk cache. Thanks, Shawn
Re: Are Solr releases predictable? Every 2 months?
They are not that predictable. Somebody has to volunteer to be a release manager and then there is a flurry of cleanups, release candidates, etc. You can see all that on the Lucene-Dev mailing list. For example, a 5.3 has been proposed (as an idea) on July 30th. But not much happened since. But it must be fairly close. The specific JIRA is so far trunk only and not backported to 5_x. So, it is very unlikely to make 5.3, in my mind. Regards, Alex. Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter: http://www.solr-start.com/ On 2 August 2015 at 06:37, Gili Nachum gilinac...@gmail.com wrote: When is 5.3 coming out? When is SOLR-6273 https://issues.apache.org/jira/browse/SOLR-6273 (Cross Data Center Replication) to be released? Any way to tell?
How to use BitDocSet within a PostFilter
Hi everyone, I'm trying to write a PostFilter for Solr 5.1.0, which is meant to crawl through grandchild documents during a search through the parents and filter out documents based on statistics gathered from aggregating the grandchildren together. I've been successful in getting the logic correct, but it does not perform so well - I'm grabbing too many documents from the index along the way. I'm trying to filter out grandchild documents which are not relevant to the statistics I'm collecting, in order to reduce the number of document objects pulled from the IndexReader. I've implemented the following code in my DelegatingCollector.collect: if (inStockSkusBitSet == null) { SolrIndexSearcher SidxS = (SolrIndexSearcher) idxS; // type cast from IndexSearcher to expose getDocSet. inStockSkusDocSet = SidxS.getDocSet(inStockSkusQuery); inStockSkusBitDocSet = (BitDocSet) inStockSkusDocSet; // type cast from DocSet to expose getBits. inStockSkusBitSet = inStockSkusBitDocSet.getBits(); } My BitDocSet reports a size which matches a standard query for the more limited set of grandchildren, and the FixedBitSet (inStockSkusBitSet) also reports this same cardinality. Based on that fact, it seems that the getDocSet call itself must be working properly, and returning the right number of documents. However, when I try to filter out grandchild documents using either BitDocSet.exists or BitSet.get (passing over any grandchild document which doesn't exist in the bitdocset or return true from the bitset), I get about 1/3 less results than I'm supposed to. It seems many documents that should match the filter, are being excluded, and documents which should not match the filter, are being included. I'm trying to use it either of these ways: if (!inStockSkusBitSet.get(currentChildDocNumber)) continue; if (!inStockSkusBitDocSet.exists(currentChildDocNumber)) continue; The currentChildDocNumber is simply the docNumber which is passed to DelegatingCollector.collect, decremented until I hit a document that doesn't belong to the parent document. I can't seem to figure out a way to actually use the BitDocSet (or its derivatives) to quickly eliminate document IDs. It seems like this is how it's supposed to be used. What am I getting wrong? Sorry if this is a newbie question, I've never written a PostFilter before, and frankly, the documentation out there is a little sketchy (mostly for version 4) - so many classes have changed names and so many of the more well-documented techniques are deprecated or removed now, it's tough to follow what the current best practice actually is. I'm using the block join functionality heavily so I'm trying to keep more current than that. I would be happy to send along the full source privately if it would help figure this out, and plan to write up some more elaborate instructions (updated for Solr 5) for the next person who decides to write a PostFilter and work with block joins, if I ever manage to get this performing well enough. Thanks for any pointers! Totally open to doing this an entirely different way. I read DocValues might be a more elegant approach but currently that would require reindexing, so trying to avoid that. Also, I've been wondering if the query above would read from the filter cache or not. The query is constructed like this: private Term inStockTrueTerm = new Term(sku_history.is_in_stock, T); private Term objectTypeSkuHistoryTerm = new Term(object_type, sku_history); ... inStockTrueTermQuery = new TermQuery(inStockTrueTerm); objectTypeSkuHistoryTermQuery = new TermQuery(objectTypeSkuHistoryTerm); inStockSkusQuery = new BooleanQuery(); inStockSkusQuery.add(inStockTrueTermQuery, BooleanClause.Occur.MUST); inStockSkusQuery.add(objectTypeSkuHistoryTermQuery, BooleanClause.Occur.MUST); -- Steve WGSN is a global foresight business. Our experts provide deep insight and analysis of consumer, fashion and design trends. We inspire our clients to plan and trade their range with unparalleled confidence and accuracy. Together, we Create Tomorrow. WGSNhttp://www.wgsn.com/ is part of WGSN Limited, comprising of market-leading products including WGSN.comhttp://www.wgsn.com, WGSN Lifestyle Interiorshttp://www.wgsn.com/en/lifestyle-interiors, WGSN INstockhttp://www.wgsninstock.com/, WGSN StyleTrialhttp://www.wgsn.com/en/styletrial/ and WGSN Mindsethttp://www.wgsn.com/en/services/consultancy/, our bespoke consultancy services. The information in or attached to this email is confidential and may be legally privileged. If you are not the intended recipient of this message, any use, disclosure, copying, distribution or any action taken in reliance on it is prohibited and may be unlawful. If you have received this message in error, please notify the sender immediately by return email and delete this message and any copies from your computer and network. WGSN does not warrant that this
Collection APIs to create collection and custom cores naming
How to use the 'property.name=value' in the api example[1] to modify core.properties value of 'name' While creating the collection with below query[2], the core names become 'aggregator_shard1_replica1' and 'aggregator_shard2_replica1'. I wanted have specific/custom name for each of these cores. I tried passing the params as property.name=namename=aggregator_s1, but it did not work. Editing the core.properties key value pair of name=aggregator_s1 after collection is created, works! But I was looking for setting this property with create request itself. [2] http://example.com:8983/solr/admin/collections?action=CREATEname=aggregatornumShards=1replicationFactor=2maxShardsPerNode=1collection.configName=aggregator_configproperty.name=namename=aggregator_s1 [1] https://cwiki.apache.org/confluence/display/solr/Collections+API#CollectionsAPI-api1
Re: solr multicore vs sharding vs 1 big collection
Shawn, Thanks for the feedback. I agree that increasing timeout might alleviate the timeout issue. The main problem with increasing timeout is the detrimental effect it will have on the user experience, therefore can't increase it. I have looked at the queries that threw errors, next time I try it everything seems to work fine. Not sure how to reproduce the error. My concern with increasing the memory to 32GB is what happens when the index size grows over the next few months. One of the other solutions I have been thinking about is to rebuild index(weekly) and create a new collection and use it. Are there any good references for doing that? Thanks Jay On Sun, Aug 2, 2015 at 10:19 AM, Shawn Heisey apa...@elyograg.org wrote: On 8/2/2015 8:29 AM, Jay Potharaju wrote: The document contains around 30 fields and have stored set to true for almost 15 of them. And these stored fields are queried and updated all the time. You will notice that the deleted documents is almost 30% of the docs. And it has stayed around that percent and has not come down. I did try optimize but that was disruptive as it caused search errors. I have been playing with merge factor to see if that helps with deleted documents or not. It is currently set to 5. The server has 24 GB of memory out of which memory consumption is around 23 GB normally and the jvm is set to 6 GB. And have noticed that the available memory on the server goes to 100 MB at times during a day. All the updates are run through DIH. Using all availble memory is completely normal operation for ANY operating system. If you hold up Windows as an example of one that doesn't ... it lies to you about available memory. All modern operating systems will utilize memory that is not explicitly allocated for the OS disk cache. The disk cache will instantly give up any of the memory it is using for programs that request it. Linux doesn't try to hide the disk cache from you, but older versions of Windows do. In the newer versions of Windows that have the Resource Monitor, you can go there to see the actual memory usage including the cache. Every day at least once i see the following error, which result in search errors on the front end of the site. ERROR org.apache.solr.servlet.SolrDispatchFilter - null:org.eclipse.jetty.io.EofException From what I have read these are mainly due to timeout and my timeout is set to 30 seconds and cant set it to a higher number. I was thinking maybe due to high memory usage, sometimes it leads to bad performance/errors. Although this error can be caused by timeouts, it has a specific meaning. It means that the client disconnected before Solr responded to the request, so when Solr tried to respond (through jetty), it found a closed TCP connection. Client timeouts need to either be completely removed, or set to a value much longer than any request will take. Five minutes is a good starting value. If all your client timeout is set to 30 seconds and you are seeing EofExceptions, that means that your requests are taking longer than 30 seconds, and you likely have some performance issues. It's also possible that some of your client timeouts are set a lot shorter than 30 seconds. My objective is to stop the errors, adding more memory to the server is not a good scaling strategy. That is why i was thinking maybe there is a issue with the way things are set up and need to be revisited. You're right that adding more memory to the servers is not a good scaling strategy for the general case ... but in this situation, I think it might be prudent. For your index and heap sizes, I would want the company to pay for at least 32GB of RAM. Having said that ... I've seen Solr installs work well with a LOT less memory than the ideal. I don't know that adding more memory is necessary, unless your system (CPU, storage, and memory speeds) is particularly slow. Based on your document count and index size, your documents are quite small, so I think your memory size is probably good -- if the CPU, memory bus, and storage are very fast. If one or more of those subsystems aren't fast, then make up the difference with lots of memory. Some light reading, where you will learn why I think 32GB is an ideal memory size for your system: https://wiki.apache.org/solr/SolrPerformanceProblems It is possible that your 6GB heap is not quite big enough for good performance, or that your GC is not well-tuned. These topics are also discussed on that wiki page. If you increase your heap size, then the likelihood of needing more memory in the system becomes greater, because there will be less memory available for the disk cache. Thanks, Shawn -- Thanks Jay Potharaju