Re: Avoid re indexing
You do not want to add a new shard, first you want your docs evenly spread, secondly, they are spread using hash ranges, to add more capacity, you spread out those hash ranges using shard splitting. "Adding" a new shard doesnt really make any sense here. Unless you go for implicit routing where you decide for yourself which shard a doc goes into, but it seems too late to make that decision in your case. Upayavira On Sun, Aug 2, 2015, at 12:40 AM, Nagasharath wrote: > Yes, shard splitting will only help in managing large clusters and to > improve query performance. In my case as index size is fully grown (no > capacity to hold in the existing shards) across the collection adding a > new shard will help and for which I have to re index. > > > > On 01-Aug-2015, at 6:34 pm, Upayavira wrote: > > > > Erm, that doesn't seem to make sense. Seems like you are talking about > > *merging* shards. > > > > Say you had two shards, 3m docs each: > > > > shard1: 3m docs > > shard2: 3m docs > > > > If you split shard1, you would have: > > > > shard1_0: 1.5m docs > > shard1_1: 1.5m docs > > shard2: 3m docs > > > > You could, of course, then split shard2. You could also split shard1 > > into three parts instead, if you preferred: > > > > shard1_0: 1m docs > > shard1_1: 1m docs > > shard1_2: 1m docs > > shard2: 3m docs > > > > Upayavira > > > >> On Sun, Aug 2, 2015, at 12:25 AM, Nagasharath wrote: > >> If my current shard is holding 3 million documents will the new subshard > >> after splitting also be able to hold 3 million documents? > >> If that is the case After shard splitting the sub shards should hold 6 > >> million documents if a shard is split in to two. Am I right? > >> > >>> On 01-Aug-2015, at 5:43 pm, Upayavira wrote: > >>> > >>> > >>> > On Sat, Aug 1, 2015, at 11:29 PM, naga sharathrayapati wrote: > I am using solrj to index documents > > i agree with you regarding the index update but i should not see any > deleted documents as it is a fresh index. Can we actually identify what > are > those deleted documents? > >>> > >>> If you post doc 1234, then you post doc 1234 a second time, you will see > >>> a deletion in your index. If you don't want deletions to show in your > >>> index, be sure NEVER to update a document, only add new ones with > >>> absolutely distinct document IDs. > >>> > >>> You cannot see (via Solr) which docs are deleted. You could, I suppose, > >>> introspect the Lucene index, but that would most definitely be an expert > >>> task. > >>> > if there is no option of adding shards to existing collection i do not > like > the idea of re indexing the whole data (worth hours) and we have gone > with > good number of shards but there is a rapid increase of size in data over > the past few days, do you think is it worth logging a ticket? > >>> > >>> You can split a shard. See the collections API: > >>> > >>> https://cwiki.apache.org/confluence/display/solr/Collections+API#CollectionsAPI-api3 > >>> > >>> What would you want to log a ticket for? I'm not sure that there's > >>> anything that would require that. > >>> > >>> Upayavira
Re: Avoid re indexing
Yes, shard splitting will only help in managing large clusters and to improve query performance. In my case as index size is fully grown (no capacity to hold in the existing shards) across the collection adding a new shard will help and for which I have to re index. > On 01-Aug-2015, at 6:34 pm, Upayavira wrote: > > Erm, that doesn't seem to make sense. Seems like you are talking about > *merging* shards. > > Say you had two shards, 3m docs each: > > shard1: 3m docs > shard2: 3m docs > > If you split shard1, you would have: > > shard1_0: 1.5m docs > shard1_1: 1.5m docs > shard2: 3m docs > > You could, of course, then split shard2. You could also split shard1 > into three parts instead, if you preferred: > > shard1_0: 1m docs > shard1_1: 1m docs > shard1_2: 1m docs > shard2: 3m docs > > Upayavira > >> On Sun, Aug 2, 2015, at 12:25 AM, Nagasharath wrote: >> If my current shard is holding 3 million documents will the new subshard >> after splitting also be able to hold 3 million documents? >> If that is the case After shard splitting the sub shards should hold 6 >> million documents if a shard is split in to two. Am I right? >> >>> On 01-Aug-2015, at 5:43 pm, Upayavira wrote: >>> >>> >>> On Sat, Aug 1, 2015, at 11:29 PM, naga sharathrayapati wrote: I am using solrj to index documents i agree with you regarding the index update but i should not see any deleted documents as it is a fresh index. Can we actually identify what are those deleted documents? >>> >>> If you post doc 1234, then you post doc 1234 a second time, you will see >>> a deletion in your index. If you don't want deletions to show in your >>> index, be sure NEVER to update a document, only add new ones with >>> absolutely distinct document IDs. >>> >>> You cannot see (via Solr) which docs are deleted. You could, I suppose, >>> introspect the Lucene index, but that would most definitely be an expert >>> task. >>> if there is no option of adding shards to existing collection i do not like the idea of re indexing the whole data (worth hours) and we have gone with good number of shards but there is a rapid increase of size in data over the past few days, do you think is it worth logging a ticket? >>> >>> You can split a shard. See the collections API: >>> >>> https://cwiki.apache.org/confluence/display/solr/Collections+API#CollectionsAPI-api3 >>> >>> What would you want to log a ticket for? I'm not sure that there's >>> anything that would require that. >>> >>> Upayavira
Re: Avoid re indexing
Erm, that doesn't seem to make sense. Seems like you are talking about *merging* shards. Say you had two shards, 3m docs each: shard1: 3m docs shard2: 3m docs If you split shard1, you would have: shard1_0: 1.5m docs shard1_1: 1.5m docs shard2: 3m docs You could, of course, then split shard2. You could also split shard1 into three parts instead, if you preferred: shard1_0: 1m docs shard1_1: 1m docs shard1_2: 1m docs shard2: 3m docs Upayavira On Sun, Aug 2, 2015, at 12:25 AM, Nagasharath wrote: > If my current shard is holding 3 million documents will the new subshard > after splitting also be able to hold 3 million documents? > If that is the case After shard splitting the sub shards should hold 6 > million documents if a shard is split in to two. Am I right? > > > On 01-Aug-2015, at 5:43 pm, Upayavira wrote: > > > > > > > >> On Sat, Aug 1, 2015, at 11:29 PM, naga sharathrayapati wrote: > >> I am using solrj to index documents > >> > >> i agree with you regarding the index update but i should not see any > >> deleted documents as it is a fresh index. Can we actually identify what > >> are > >> those deleted documents? > > > > If you post doc 1234, then you post doc 1234 a second time, you will see > > a deletion in your index. If you don't want deletions to show in your > > index, be sure NEVER to update a document, only add new ones with > > absolutely distinct document IDs. > > > > You cannot see (via Solr) which docs are deleted. You could, I suppose, > > introspect the Lucene index, but that would most definitely be an expert > > task. > > > >> if there is no option of adding shards to existing collection i do not > >> like > >> the idea of re indexing the whole data (worth hours) and we have gone > >> with > >> good number of shards but there is a rapid increase of size in data over > >> the past few days, do you think is it worth logging a ticket? > > > > You can split a shard. See the collections API: > > > > https://cwiki.apache.org/confluence/display/solr/Collections+API#CollectionsAPI-api3 > > > > What would you want to log a ticket for? I'm not sure that there's > > anything that would require that. > > > > Upayavira
Re: Avoid re indexing
If my current shard is holding 3 million documents will the new subshard after splitting also be able to hold 3 million documents? If that is the case After shard splitting the sub shards should hold 6 million documents if a shard is split in to two. Am I right? > On 01-Aug-2015, at 5:43 pm, Upayavira wrote: > > > >> On Sat, Aug 1, 2015, at 11:29 PM, naga sharathrayapati wrote: >> I am using solrj to index documents >> >> i agree with you regarding the index update but i should not see any >> deleted documents as it is a fresh index. Can we actually identify what >> are >> those deleted documents? > > If you post doc 1234, then you post doc 1234 a second time, you will see > a deletion in your index. If you don't want deletions to show in your > index, be sure NEVER to update a document, only add new ones with > absolutely distinct document IDs. > > You cannot see (via Solr) which docs are deleted. You could, I suppose, > introspect the Lucene index, but that would most definitely be an expert > task. > >> if there is no option of adding shards to existing collection i do not >> like >> the idea of re indexing the whole data (worth hours) and we have gone >> with >> good number of shards but there is a rapid increase of size in data over >> the past few days, do you think is it worth logging a ticket? > > You can split a shard. See the collections API: > > https://cwiki.apache.org/confluence/display/solr/Collections+API#CollectionsAPI-api3 > > What would you want to log a ticket for? I'm not sure that there's > anything that would require that. > > Upayavira
Re: Avoid re indexing
On Sat, Aug 1, 2015, at 11:29 PM, naga sharathrayapati wrote: > I am using solrj to index documents > > i agree with you regarding the index update but i should not see any > deleted documents as it is a fresh index. Can we actually identify what > are > those deleted documents? If you post doc 1234, then you post doc 1234 a second time, you will see a deletion in your index. If you don't want deletions to show in your index, be sure NEVER to update a document, only add new ones with absolutely distinct document IDs. You cannot see (via Solr) which docs are deleted. You could, I suppose, introspect the Lucene index, but that would most definitely be an expert task. > if there is no option of adding shards to existing collection i do not > like > the idea of re indexing the whole data (worth hours) and we have gone > with > good number of shards but there is a rapid increase of size in data over > the past few days, do you think is it worth logging a ticket? You can split a shard. See the collections API: https://cwiki.apache.org/confluence/display/solr/Collections+API#CollectionsAPI-api3 What would you want to log a ticket for? I'm not sure that there's anything that would require that. Upayavira
Re: Avoid re indexing
I am using solrj to index documents i agree with you regarding the index update but i should not see any deleted documents as it is a fresh index. Can we actually identify what are those deleted documents? if there is no option of adding shards to existing collection i do not like the idea of re indexing the whole data (worth hours) and we have gone with good number of shards but there is a rapid increase of size in data over the past few days, do you think is it worth logging a ticket? On Sat, Aug 1, 2015 at 5:04 PM, Upayavira wrote: > > > On Sat, Aug 1, 2015, at 10:30 PM, naga sharathrayapati wrote: > > I have an exception with one of the document after indexing 6 mil > > documents > > out of 10 mil, is there any way i can avoid re indexing the 6 mil > > documents? > > How are you indexing your documents? Are you using the DIH? Personally, > I'd recommend you write your own app to push your content to Solr, then > you will be able to control exceptions more precisely and have the > behaviour you expect. > > > I also see that there are few documents that are deleted (based on the > > count) while indexing, is there a way to identify what are those > > documents? > > If you see deleted documents but are not actually deleting any, this > will be because you have updated documents with an existing ID. An > update is actually a delete followed by an insert. > > > can i add shard to a collection without re indexing? > > You cannot just add a new shard to an existing collection (at least, one > that is using the compositeId router (the default). If a shard is too > large, you will need to split an existing shard, which you can do with > the collections API. > > It is much better though, to start with the right number of shards if at > all possible. > > Upayavira >
Re: Avoid re indexing
On Sat, Aug 1, 2015, at 10:30 PM, naga sharathrayapati wrote: > I have an exception with one of the document after indexing 6 mil > documents > out of 10 mil, is there any way i can avoid re indexing the 6 mil > documents? How are you indexing your documents? Are you using the DIH? Personally, I'd recommend you write your own app to push your content to Solr, then you will be able to control exceptions more precisely and have the behaviour you expect. > I also see that there are few documents that are deleted (based on the > count) while indexing, is there a way to identify what are those > documents? If you see deleted documents but are not actually deleting any, this will be because you have updated documents with an existing ID. An update is actually a delete followed by an insert. > can i add shard to a collection without re indexing? You cannot just add a new shard to an existing collection (at least, one that is using the compositeId router (the default). If a shard is too large, you will need to split an existing shard, which you can do with the collections API. It is much better though, to start with the right number of shards if at all possible. Upayavira
Avoid re indexing
I have an exception with one of the document after indexing 6 mil documents out of 10 mil, is there any way i can avoid re indexing the 6 mil documents? I also see that there are few documents that are deleted (based on the count) while indexing, is there a way to identify what are those documents? can i add shard to a collection without re indexing?