Re: Avoid re indexing

2015-08-02 Thread Upayavira
You do not want to add a new shard, first you want your docs evenly
spread, secondly, they are spread using hash ranges, to add more
capacity, you spread out those hash ranges using shard splitting.
"Adding" a new shard doesnt really make any sense here. Unless you go
for implicit routing where you decide for yourself which shard a doc
goes into, but it seems too late to make that decision in your case.

Upayavira

On Sun, Aug 2, 2015, at 12:40 AM, Nagasharath wrote:
> Yes, shard splitting will only help in managing large clusters and to
> improve query performance. In my case as index size is fully grown (no
> capacity to hold in the existing shards) across the collection adding a
> new shard will help and for which I have to re index.
> 
> 
> > On 01-Aug-2015, at 6:34 pm, Upayavira  wrote:
> > 
> > Erm, that doesn't seem to make sense. Seems like you are talking about
> > *merging* shards.
> > 
> > Say you had two shards, 3m docs each:
> > 
> > shard1: 3m docs
> > shard2: 3m docs
> > 
> > If you split shard1, you would have:
> > 
> > shard1_0: 1.5m docs
> > shard1_1: 1.5m docs
> > shard2: 3m docs
> > 
> > You could, of course, then split shard2. You could also split shard1
> > into three parts instead, if you preferred:
> > 
> > shard1_0: 1m docs
> > shard1_1: 1m docs
> > shard1_2: 1m docs
> > shard2: 3m docs
> > 
> > Upayavira
> > 
> >> On Sun, Aug 2, 2015, at 12:25 AM, Nagasharath wrote:
> >> If my current shard is holding 3 million documents will the new subshard
> >> after splitting also be able to hold 3 million documents?
> >> If that is the case After shard splitting the sub shards should hold 6
> >> million documents if a shard is split in to two. Am I right?
> >> 
> >>> On 01-Aug-2015, at 5:43 pm, Upayavira  wrote:
> >>> 
> >>> 
> >>> 
>  On Sat, Aug 1, 2015, at 11:29 PM, naga sharathrayapati wrote:
>  I am using solrj to index documents
>  
>  i agree with you regarding the index update but i should not see any
>  deleted documents as it is a fresh index. Can we actually identify what
>  are
>  those deleted documents?
> >>> 
> >>> If you post doc 1234, then you post doc 1234 a second time, you will see
> >>> a deletion in your index. If you don't want deletions to show in your
> >>> index, be sure NEVER to update a document, only add new ones with
> >>> absolutely distinct document IDs.
> >>> 
> >>> You cannot see (via Solr) which docs are deleted. You could, I suppose,
> >>> introspect the Lucene index, but that would most definitely be an expert
> >>> task.
> >>> 
>  if there is no option of adding shards to existing collection i do not
>  like
>  the idea of re indexing the whole data (worth hours) and we have gone
>  with
>  good number of shards but there is a rapid increase of size in data over
>  the past few days, do you think is it worth logging a ticket?
> >>> 
> >>> You can split a shard. See the collections API:
> >>> 
> >>> https://cwiki.apache.org/confluence/display/solr/Collections+API#CollectionsAPI-api3
> >>> 
> >>> What would you want to log a ticket for? I'm not sure that there's
> >>> anything that would require that.
> >>> 
> >>> Upayavira


Re: Avoid re indexing

2015-08-01 Thread Nagasharath
Yes, shard splitting will only help in managing large clusters and to improve 
query performance. In my case as index size is fully grown (no capacity to hold 
in the existing shards) across the collection adding a new shard will help and 
for which I have to re index.


> On 01-Aug-2015, at 6:34 pm, Upayavira  wrote:
> 
> Erm, that doesn't seem to make sense. Seems like you are talking about
> *merging* shards.
> 
> Say you had two shards, 3m docs each:
> 
> shard1: 3m docs
> shard2: 3m docs
> 
> If you split shard1, you would have:
> 
> shard1_0: 1.5m docs
> shard1_1: 1.5m docs
> shard2: 3m docs
> 
> You could, of course, then split shard2. You could also split shard1
> into three parts instead, if you preferred:
> 
> shard1_0: 1m docs
> shard1_1: 1m docs
> shard1_2: 1m docs
> shard2: 3m docs
> 
> Upayavira
> 
>> On Sun, Aug 2, 2015, at 12:25 AM, Nagasharath wrote:
>> If my current shard is holding 3 million documents will the new subshard
>> after splitting also be able to hold 3 million documents?
>> If that is the case After shard splitting the sub shards should hold 6
>> million documents if a shard is split in to two. Am I right?
>> 
>>> On 01-Aug-2015, at 5:43 pm, Upayavira  wrote:
>>> 
>>> 
>>> 
 On Sat, Aug 1, 2015, at 11:29 PM, naga sharathrayapati wrote:
 I am using solrj to index documents
 
 i agree with you regarding the index update but i should not see any
 deleted documents as it is a fresh index. Can we actually identify what
 are
 those deleted documents?
>>> 
>>> If you post doc 1234, then you post doc 1234 a second time, you will see
>>> a deletion in your index. If you don't want deletions to show in your
>>> index, be sure NEVER to update a document, only add new ones with
>>> absolutely distinct document IDs.
>>> 
>>> You cannot see (via Solr) which docs are deleted. You could, I suppose,
>>> introspect the Lucene index, but that would most definitely be an expert
>>> task.
>>> 
 if there is no option of adding shards to existing collection i do not
 like
 the idea of re indexing the whole data (worth hours) and we have gone
 with
 good number of shards but there is a rapid increase of size in data over
 the past few days, do you think is it worth logging a ticket?
>>> 
>>> You can split a shard. See the collections API:
>>> 
>>> https://cwiki.apache.org/confluence/display/solr/Collections+API#CollectionsAPI-api3
>>> 
>>> What would you want to log a ticket for? I'm not sure that there's
>>> anything that would require that.
>>> 
>>> Upayavira


Re: Avoid re indexing

2015-08-01 Thread Upayavira
Erm, that doesn't seem to make sense. Seems like you are talking about
*merging* shards.

Say you had two shards, 3m docs each:

shard1: 3m docs
shard2: 3m docs

If you split shard1, you would have:

shard1_0: 1.5m docs
shard1_1: 1.5m docs
shard2: 3m docs

You could, of course, then split shard2. You could also split shard1
into three parts instead, if you preferred:

shard1_0: 1m docs
shard1_1: 1m docs
shard1_2: 1m docs
shard2: 3m docs

Upayavira

On Sun, Aug 2, 2015, at 12:25 AM, Nagasharath wrote:
> If my current shard is holding 3 million documents will the new subshard
> after splitting also be able to hold 3 million documents?
> If that is the case After shard splitting the sub shards should hold 6
> million documents if a shard is split in to two. Am I right?
> 
> > On 01-Aug-2015, at 5:43 pm, Upayavira  wrote:
> > 
> > 
> > 
> >> On Sat, Aug 1, 2015, at 11:29 PM, naga sharathrayapati wrote:
> >> I am using solrj to index documents
> >> 
> >> i agree with you regarding the index update but i should not see any
> >> deleted documents as it is a fresh index. Can we actually identify what
> >> are
> >> those deleted documents?
> > 
> > If you post doc 1234, then you post doc 1234 a second time, you will see
> > a deletion in your index. If you don't want deletions to show in your
> > index, be sure NEVER to update a document, only add new ones with
> > absolutely distinct document IDs.
> > 
> > You cannot see (via Solr) which docs are deleted. You could, I suppose,
> > introspect the Lucene index, but that would most definitely be an expert
> > task.
> > 
> >> if there is no option of adding shards to existing collection i do not
> >> like
> >> the idea of re indexing the whole data (worth hours) and we have gone
> >> with
> >> good number of shards but there is a rapid increase of size in data over
> >> the past few days, do you think is it worth logging a ticket?
> > 
> > You can split a shard. See the collections API:
> > 
> > https://cwiki.apache.org/confluence/display/solr/Collections+API#CollectionsAPI-api3
> > 
> > What would you want to log a ticket for? I'm not sure that there's
> > anything that would require that.
> > 
> > Upayavira


Re: Avoid re indexing

2015-08-01 Thread Nagasharath
If my current shard is holding 3 million documents will the new subshard after 
splitting also be able to hold 3 million documents?
If that is the case After shard splitting the sub shards should hold 6 million 
documents if a shard is split in to two. Am I right?

> On 01-Aug-2015, at 5:43 pm, Upayavira  wrote:
> 
> 
> 
>> On Sat, Aug 1, 2015, at 11:29 PM, naga sharathrayapati wrote:
>> I am using solrj to index documents
>> 
>> i agree with you regarding the index update but i should not see any
>> deleted documents as it is a fresh index. Can we actually identify what
>> are
>> those deleted documents?
> 
> If you post doc 1234, then you post doc 1234 a second time, you will see
> a deletion in your index. If you don't want deletions to show in your
> index, be sure NEVER to update a document, only add new ones with
> absolutely distinct document IDs.
> 
> You cannot see (via Solr) which docs are deleted. You could, I suppose,
> introspect the Lucene index, but that would most definitely be an expert
> task.
> 
>> if there is no option of adding shards to existing collection i do not
>> like
>> the idea of re indexing the whole data (worth hours) and we have gone
>> with
>> good number of shards but there is a rapid increase of size in data over
>> the past few days, do you think is it worth logging a ticket?
> 
> You can split a shard. See the collections API:
> 
> https://cwiki.apache.org/confluence/display/solr/Collections+API#CollectionsAPI-api3
> 
> What would you want to log a ticket for? I'm not sure that there's
> anything that would require that.
> 
> Upayavira


Re: Avoid re indexing

2015-08-01 Thread Upayavira


On Sat, Aug 1, 2015, at 11:29 PM, naga sharathrayapati wrote:
> I am using solrj to index documents
> 
> i agree with you regarding the index update but i should not see any
> deleted documents as it is a fresh index. Can we actually identify what
> are
> those deleted documents?

If you post doc 1234, then you post doc 1234 a second time, you will see
a deletion in your index. If you don't want deletions to show in your
index, be sure NEVER to update a document, only add new ones with
absolutely distinct document IDs.

You cannot see (via Solr) which docs are deleted. You could, I suppose,
introspect the Lucene index, but that would most definitely be an expert
task.

> if there is no option of adding shards to existing collection i do not
> like
> the idea of re indexing the whole data (worth hours) and we have gone
> with
> good number of shards but there is a rapid increase of size in data over
> the past few days, do you think is it worth logging a ticket?

You can split a shard. See the collections API:

https://cwiki.apache.org/confluence/display/solr/Collections+API#CollectionsAPI-api3

What would you want to log a ticket for? I'm not sure that there's
anything that would require that.

Upayavira


Re: Avoid re indexing

2015-08-01 Thread naga sharathrayapati
I am using solrj to index documents

i agree with you regarding the index update but i should not see any
deleted documents as it is a fresh index. Can we actually identify what are
those deleted documents?

if there is no option of adding shards to existing collection i do not like
the idea of re indexing the whole data (worth hours) and we have gone with
good number of shards but there is a rapid increase of size in data over
the past few days, do you think is it worth logging a ticket?

On Sat, Aug 1, 2015 at 5:04 PM, Upayavira  wrote:

>
>
> On Sat, Aug 1, 2015, at 10:30 PM, naga sharathrayapati wrote:
> > I have an exception with one of the document after indexing 6 mil
> > documents
> > out of 10 mil, is there any way i can avoid re indexing the 6 mil
> > documents?
>
> How are you indexing your documents? Are you using the DIH? Personally,
> I'd recommend you write your own app to push your content to Solr, then
> you will be able to control exceptions more precisely and have the
> behaviour you expect.
>
> > I also see that there are few documents that are deleted (based on the
> > count) while indexing, is there a way to identify what are those
> > documents?
>
> If you see deleted documents but are not actually deleting any, this
> will be because you have updated documents with an existing ID. An
> update is actually a delete followed by an insert.
>
> > can i add shard to a collection without re indexing?
>
> You cannot just add a new shard to an existing collection (at least, one
> that is using the compositeId router (the default). If a shard is too
> large, you will need to split an existing shard, which you can do with
> the collections API.
>
> It is much better though, to start with the right number of shards if at
> all possible.
>
> Upayavira
>


Re: Avoid re indexing

2015-08-01 Thread Upayavira


On Sat, Aug 1, 2015, at 10:30 PM, naga sharathrayapati wrote:
> I have an exception with one of the document after indexing 6 mil
> documents
> out of 10 mil, is there any way i can avoid re indexing the 6 mil
> documents?

How are you indexing your documents? Are you using the DIH? Personally,
I'd recommend you write your own app to push your content to Solr, then
you will be able to control exceptions more precisely and have the
behaviour you expect.

> I also see that there are few documents that are deleted (based on the
> count) while indexing, is there a way to identify what are those
> documents?

If you see deleted documents but are not actually deleting any, this
will be because you have updated documents with an existing ID. An
update is actually a delete followed by an insert.

> can i add shard to a collection without re indexing?

You cannot just add a new shard to an existing collection (at least, one
that is using the compositeId router (the default). If a shard is too
large, you will need to split an existing shard, which you can do with
the collections API.

It is much better though, to start with the right number of shards if at
all possible.

Upayavira


Avoid re indexing

2015-08-01 Thread naga sharathrayapati
I have an exception with one of the document after indexing 6 mil documents
out of 10 mil, is there any way i can avoid re indexing the 6 mil
documents?

I also see that there are few documents that are deleted (based on the
count) while indexing, is there a way to identify what are those documents?

can i add shard to a collection without re indexing?