Solr5.X document loss in splitting shards

2015-12-26 Thread Luca Quarello
Hi,
I have a SOLR 5.3.1 CLOUD with two nodes and 8 shards per node.

Each shard is about* 35 million documents (**35025882**) and 16GB sized.*


   - I launch the SPLIT command on a shard (shard 13) in the ASYNC way:

curl "
http://x-perf-jvm5:8983/solr/admin/collections?action=SPLITSHARD&collection=sepa&shard=shard13&async=1006
"



   - After many time I obtain:


curl "
http://x-perf-jvm5:8983/solr/admin/collections?action=REQUESTSTATUS&requestid=1006
"


06020505sepa_shard13_1_replica1EMPTY_BUFFERhttp://192.168.72.55:8983/solr/sepa_shard13_replica1/";>048100completedTaskId:
1006264805140687740 webapp=null path=/admin/cores
params={shard=shard13_0&collection.configName=flsFragments&name=sepa_shard13_0_replica1&action=CREATE&collection=sepa&wt=javabin&qt=/admin/cores&async=1006264805140687740&version=2}
status=0 QTime=2 00completedTaskId: 1006264808287598167 webapp=null path=/admin/cores
params={shard=shard13_1&collection.configName=flsFragments&name=sepa_shard13_1_replica1&action=CREATE&collection=sepa&wt=javabin&qt=/admin/cores&async=1006264808287598167&version=2}
status=0 QTime=0 00completedTaskId: 1006264810307413066 webapp=null path=/admin/cores
params={coreNodeName=core_node18&state=active&nodeName=192.168.72.55:8983_solr&action=PREPRECOVERY&checkLive=true&core=sepa_shard13_1_replica1&wt=javabin&qt=/admin/cores&onlyIfLeader=true&async=1006264810307413066&version=2}
status=0 QTime=0 00completedTaskId: 1006264810317508052 webapp=null path=/admin/cores
params={targetCore=sepa_shard13_0_replica1&targetCore=sepa_shard13_1_replica1&action=SPLIT&core=sepa_shard13_replica1&wt=javabin&qt=/admin/cores&async=1006264810317508052&version=2}
status=0 QTime=0 00completedTaskId: 1006266054432757899 webapp=null path=/admin/cores
params={name=sepa_shard13_1_replica1&action=REQUESTAPPLYUPDATES&wt=javabin&qt=/admin/cores&async=1006266054432757899&version=2}
status=0 QTime=5 completedfound 1006 in completed
tasks




   - I launch the commit command:

curl http://x-perf-jvm5:8983/solr/sepa/update --data-binary '' -H
'Content-type:application/xml'



0162




The new created shards have:
*13430316 documents (5.6 GB) and 13425924 documents (5.59 GB**)*.

What is the problem? Where I am wrong?

Thanks,
Luca


Re: Solr5.X document loss in splitting shards

2015-12-27 Thread Shawn Heisey
On 12/26/2015 11:21 AM, Luca Quarello wrote:
> I have a SOLR 5.3.1 CLOUD with two nodes and 8 shards per node.
> 
> Each shard is about* 35 million documents (**35025882**) and 16GB sized.*
> 
> 
>- I launch the SPLIT command on a shard (shard 13) in the ASYNC way:



> The new created shards have:
> *13430316 documents (5.6 GB) and 13425924 documents (5.59 GB**)*.

Where are you looking that shows you the source shard has 35 million
documents?  Be extremely specific.

The following screenshot shows one place you might be looking for this
information -- the core overview page:

https://www.dropbox.com/s/311n49wkp9kw7xa/admin-ui-core-overview.png?dl=0

Is the core overview page where you are looking, or is it somewhere else?

I'm asking because "Max Doc" and "Num Docs" on the core overview page
mean very different things.  The difference between them is the number
of deleted docs, and the split shards are probably missing those deleted
docs.

This is the only idea that I have.  If it's not that, then I'm as
clueless as you are.

Thanks,
Shawn



Re: Solr5.X document loss in splitting shards

2015-12-28 Thread GW
I don't use Curl but there are a couple of things that come to mind

1: Maybe use document routing with the shards. Use an "!" in your unique
ID. I'm using gmail to read this and it sucks for searching content so if
you have done this please ignore this point. Example: If you were storing
documents per domain you unique field values would look like
www.domain1.com!123,  www.domain1.com!124,
   www.domain2.com!35, etc.

This should create a two segment hash for searching shards. I do this in
blind faith as a best practice as it is mentioned in the docs.

2: Curl works best with URL encoding. I was using Curl at one time and I
noticed some strange results w/o url encoding

What are you using to write your client?

Best,

GW



On 27 December 2015 at 19:35, Shawn Heisey  wrote:

> On 12/26/2015 11:21 AM, Luca Quarello wrote:
> > I have a SOLR 5.3.1 CLOUD with two nodes and 8 shards per node.
> >
> > Each shard is about* 35 million documents (**35025882**) and 16GB sized.*
> >
> >
> >- I launch the SPLIT command on a shard (shard 13) in the ASYNC way:
>
> 
>
> > The new created shards have:
> > *13430316 documents (5.6 GB) and 13425924 documents (5.59 GB**)*.
>
> Where are you looking that shows you the source shard has 35 million
> documents?  Be extremely specific.
>
> The following screenshot shows one place you might be looking for this
> information -- the core overview page:
>
> https://www.dropbox.com/s/311n49wkp9kw7xa/admin-ui-core-overview.png?dl=0
>
> Is the core overview page where you are looking, or is it somewhere else?
>
> I'm asking because "Max Doc" and "Num Docs" on the core overview page
> mean very different things.  The difference between them is the number
> of deleted docs, and the split shards are probably missing those deleted
> docs.
>
> This is the only idea that I have.  If it's not that, then I'm as
> clueless as you are.
>
> Thanks,
> Shawn
>
>


Re: Solr5.X document loss in splitting shards

2015-12-29 Thread Luca Quarello
Hi Shawn,
I'm looking for the docs num from the core overview page and the situation
is:

Num Docs: 35031923Max Doc: 35156879





The difference doesn't explain the strange behavior.





On Mon, Dec 28, 2015 at 1:35 AM, Shawn Heisey  wrote:

> On 12/26/2015 11:21 AM, Luca Quarello wrote:
> > I have a SOLR 5.3.1 CLOUD with two nodes and 8 shards per node.
> >
> > Each shard is about* 35 million documents (**35025882**) and 16GB sized.*
> >
> >
> >- I launch the SPLIT command on a shard (shard 13) in the ASYNC way:
>
> 
>
> > The new created shards have:
> > *13430316 documents (5.6 GB) and 13425924 documents (5.59 GB**)*.
>
> Where are you looking that shows you the source shard has 35 million
> documents?  Be extremely specific.
>
> The following screenshot shows one place you might be looking for this
> information -- the core overview page:
>
> https://www.dropbox.com/s/311n49wkp9kw7xa/admin-ui-core-overview.png?dl=0
>
> Is the core overview page where you are looking, or is it somewhere else?
>
> I'm asking because "Max Doc" and "Num Docs" on the core overview page
> mean very different things.  The difference between them is the number
> of deleted docs, and the split shards are probably missing those deleted
> docs.
>
> This is the only idea that I have.  If it's not that, then I'm as
> clueless as you are.
>
> Thanks,
> Shawn
>
>


Re: Solr5.X document loss in splitting shards

2015-12-29 Thread Luca Quarello
Hi,
the only way that i find to solve my problem is to do the split using a
solr instance configured in standalone mode.

curl
http://localhost:8983/solr/admin/cores?action=SPLIT&core=sepa&path=/nas_perf_2/FRAGMENTS/17MINDEXES/1/index&path=/nas_perf/FRAGMENTS/17MINDEXES/2/index

In solr_cloud mode does the shards splitting action work properly for large
shards?

Thanks!



*Luca Quarello*

M:+39 347 018 3855

luca.quare...@xeffe.it



*X**EFFE * s.r.l

C.so Giovanni Lanza 72, 10131 Torino

T: +39 011 660 5039

F: +39 011 198 26822

www.xeffe.it

On Mon, Dec 28, 2015 at 2:58 PM, GW  wrote:

> I don't use Curl but there are a couple of things that come to mind
>
> 1: Maybe use document routing with the shards. Use an "!" in your unique
> ID. I'm using gmail to read this and it sucks for searching content so if
> you have done this please ignore this point. Example: If you were storing
> documents per domain you unique field values would look like
> www.domain1.com!123,  www.domain1.com!124,
>www.domain2.com!35, etc.
>
> This should create a two segment hash for searching shards. I do this in
> blind faith as a best practice as it is mentioned in the docs.
>
> 2: Curl works best with URL encoding. I was using Curl at one time and I
> noticed some strange results w/o url encoding
>
> What are you using to write your client?
>
> Best,
>
> GW
>
>
>
> On 27 December 2015 at 19:35, Shawn Heisey  wrote:
>
> > On 12/26/2015 11:21 AM, Luca Quarello wrote:
> > > I have a SOLR 5.3.1 CLOUD with two nodes and 8 shards per node.
> > >
> > > Each shard is about* 35 million documents (**35025882**) and 16GB
> sized.*
> > >
> > >
> > >- I launch the SPLIT command on a shard (shard 13) in the ASYNC way:
> >
> > 
> >
> > > The new created shards have:
> > > *13430316 documents (5.6 GB) and 13425924 documents (5.59 GB**)*.
> >
> > Where are you looking that shows you the source shard has 35 million
> > documents?  Be extremely specific.
> >
> > The following screenshot shows one place you might be looking for this
> > information -- the core overview page:
> >
> >
> https://www.dropbox.com/s/311n49wkp9kw7xa/admin-ui-core-overview.png?dl=0
> >
> > Is the core overview page where you are looking, or is it somewhere else?
> >
> > I'm asking because "Max Doc" and "Num Docs" on the core overview page
> > mean very different things.  The difference between them is the number
> > of deleted docs, and the split shards are probably missing those deleted
> > docs.
> >
> > This is the only idea that I have.  If it's not that, then I'm as
> > clueless as you are.
> >
> > Thanks,
> > Shawn
> >
> >
>


Re: Solr5.X document loss in splitting shards

2015-12-29 Thread Luca Quarello
Hi,
the only way that i find to solve my problem is to do the split using a
solr instance configured in standalone mode.

curl
http://localhost:8983/solr/admin/cores?action=SPLIT&core=sepa&path=/nas_perf_2/FRAGMENTS/17MINDEXES/1/index&path=/nas_perf/FRAGMENTS/17MINDEXES/2/index

In solr_cloud mode does the shards splitting action work properly for large
shards?

Thanks!

On Mon, Dec 28, 2015 at 2:58 PM, GW  wrote:

> I don't use Curl but there are a couple of things that come to mind
>
> 1: Maybe use document routing with the shards. Use an "!" in your unique
> ID. I'm using gmail to read this and it sucks for searching content so if
> you have done this please ignore this point. Example: If you were storing
> documents per domain you unique field values would look like
> www.domain1.com!123,  www.domain1.com!124,
>www.domain2.com!35, etc.
>
> This should create a two segment hash for searching shards. I do this in
> blind faith as a best practice as it is mentioned in the docs.
>
> 2: Curl works best with URL encoding. I was using Curl at one time and I
> noticed some strange results w/o url encoding
>
> What are you using to write your client?
>
> Best,
>
> GW
>
>
>
> On 27 December 2015 at 19:35, Shawn Heisey  wrote:
>
> > On 12/26/2015 11:21 AM, Luca Quarello wrote:
> > > I have a SOLR 5.3.1 CLOUD with two nodes and 8 shards per node.
> > >
> > > Each shard is about* 35 million documents (**35025882**) and 16GB
> sized.*
> > >
> > >
> > >- I launch the SPLIT command on a shard (shard 13) in the ASYNC way:
> >
> > 
> >
> > > The new created shards have:
> > > *13430316 documents (5.6 GB) and 13425924 documents (5.59 GB**)*.
> >
> > Where are you looking that shows you the source shard has 35 million
> > documents?  Be extremely specific.
> >
> > The following screenshot shows one place you might be looking for this
> > information -- the core overview page:
> >
> >
> https://www.dropbox.com/s/311n49wkp9kw7xa/admin-ui-core-overview.png?dl=0
> >
> > Is the core overview page where you are looking, or is it somewhere else?
> >
> > I'm asking because "Max Doc" and "Num Docs" on the core overview page
> > mean very different things.  The difference between them is the number
> > of deleted docs, and the split shards are probably missing those deleted
> > docs.
> >
> > This is the only idea that I have.  If it's not that, then I'm as
> > clueless as you are.
> >
> > Thanks,
> > Shawn
> >
> >
>