subject:"How to re\-index SOLR data"

Re: How to re-index SOLR data

2016-08-10 Thread John Bickerstaff

Right...  SOLR doesn't work quite that way...

Keep in mind the value of the data import jar if you have the data from
MySQL stored in a text file, although that would require a little
programming to get the data into the proper format..

But once you get everything into a text file or similar, you don't have to
task your MySQL database when you want to reindex  Unless your data
changes frequently... in which case you'll probably have to hit MySQL every
time.

Good luck!

On Aug 10, 2016 6:24 PM, "Bharath Kumar"  wrote:

> Hi All,
>
> Thanks so much for your inputs. We have a MYSQL data source and i think we
> will try to re-index using the MYSQL data.
>
> I wanted something where i can export all my current data say to an excel
> file or some data source and then import it on another node with the same
> collection with empty data.
>
> On Tue, Aug 9, 2016 at 8:44 PM, Erick Erickson 
> wrote:
>
> > Assuming you can re-index
> >
> > Consider "collection aliasing". Say your current collection is C1.
> > Create C2 (using the same cluster, Zookeeper and the like). Go
> > ahead and index to C2 (however you do that). NOTE: the physical
> > machines may be _different_ than C1, or not. That's up to you. The
> > critical bit is that you use the same Zookeeper.
> >
> > Now, when you are done you use the Collections API CREATEALIAS
> > command to point a "pseudo collection" to C1 (call it "prod"). This is
> > seamless to the users.
> >
> > The flaw in my plan so far is that you probably go at Collection C1
> > directly. So what you might do is create the "prod" alias and point it at
> > C1. Now change your LB (or client or whatever) to use the "prod"
> > collection,
> > then when indexing is complete use CREATEALIAS to point "prod" at C2
> > instead.
> >
> > This is actually a quite well-tested process, often used when you want to
> > change "atomically", e.g. when you reindex the same data nightly but want
> > all the new data available in its entirety only after it has been QA'd or
> > such.
> >
> > Best,
> > Erick
> >
> > On Tue, Aug 9, 2016 at 2:43 PM, John Bickerstaff
> >  wrote:
> > > In my case, I've done two things  neither of them involved taking
> the
> > > data from SOLR to SOLR...  although in my reading, I've seen that this
> is
> > > theoretically possible (I.E. sending data from one SOLR server to
> another
> > > SOLR server and  having the second SOLR instance re-index...)
> > >
> > > I haven't used the python script...  that was news to me, but it sounds
> > > interesting...
> > >
> > > What I've done is one of the following:
> > >
> > > a. Get the data from the original source (database, whatever) and
> massage
> > > it again so that i's ready for SOLR and then submit it to my new
> > SolrCloud
> > > for indexing.
> > >
> > > b. Keep a separate store of EVERY Solr document as it comes out of my
> > code
> > > (in xml) and store it in Kafka or a text file.  Then it's easy to push
> > back
> > > into another SOLR instance any time - multiple times if necessary.
> > >
> > > I'm guessing you don't have the data stored away as in "b"...  And if
> you
> > > don't have a way of getting the data from some central source, then "a"
> > > won't work either...  Which leaves you with the concept of sending data
> > > from SOLR "A" to SOLR "B" and having "B" reindex...
> > >
> > > This might serve as a starting point in that case...
> > > https://wiki.apache.org/solr/HowToReindex
> > >
> > > You'll note that there are limitations and a strong caveat against
> doing
> > > this with SOLR, but if you have no other option, then it's the best you
> > can
> > > do.
> > >
> > > Do you have the ability to get all the data again from an authoritative
> > > source?  (Relational Database or something similar?)
> > >
> > > On Tue, Aug 9, 2016 at 3:21 PM, Bharath Kumar <
> bharath.mvku...@gmail.com
> > >
> > > wrote:
> > >
> > >> Hi John,
> > >>
> > >> Thanks so much for your inputs. We have time to build another system.
> So
> > >> how did you index the same data on the main SOLR node to the new SOLR
> > node?
> > >> Did you use the re-index python script? The new data will be indexed
> > >> correctly with the new rules, but what about the old data?
> > >>
> > >> Our SOLR data is around 30GB with around 60 million documents. We use
> > SOLR
> > >> cloud with 3 solr nodes and 3 zookeepers.
> > >>
> > >> On Tue, Aug 9, 2016 at 2:13 PM, John Bickerstaff <
> > j...@johnbickerstaff.com
> > >> >
> > >> wrote:
> > >>
> > >> > In case this helps...
> > >> >
> > >> > Assuming you have the resources to build a copy of your production
> > >> > environment and assuming you have the time, you don't need to take
> > your
> > >> > production down - or even affect it's processing...
> > >> >
> > >> > What I've done (with admittedly smaller data sets) is build a
> separate
> > >> > environment (usually on VM's) and once it's set up, I do the new
> > indexing
> > >> > according to the new "rules"  (Like your change of long to stri

Re: How to re-index SOLR data

2016-08-10 Thread Bharath Kumar

Hi All,

Thanks so much for your inputs. We have a MYSQL data source and i think we
will try to re-index using the MYSQL data.

I wanted something where i can export all my current data say to an excel
file or some data source and then import it on another node with the same
collection with empty data.

On Tue, Aug 9, 2016 at 8:44 PM, Erick Erickson 
wrote:

> Assuming you can re-index
>
> Consider "collection aliasing". Say your current collection is C1.
> Create C2 (using the same cluster, Zookeeper and the like). Go
> ahead and index to C2 (however you do that). NOTE: the physical
> machines may be _different_ than C1, or not. That's up to you. The
> critical bit is that you use the same Zookeeper.
>
> Now, when you are done you use the Collections API CREATEALIAS
> command to point a "pseudo collection" to C1 (call it "prod"). This is
> seamless to the users.
>
> The flaw in my plan so far is that you probably go at Collection C1
> directly. So what you might do is create the "prod" alias and point it at
> C1. Now change your LB (or client or whatever) to use the "prod"
> collection,
> then when indexing is complete use CREATEALIAS to point "prod" at C2
> instead.
>
> This is actually a quite well-tested process, often used when you want to
> change "atomically", e.g. when you reindex the same data nightly but want
> all the new data available in its entirety only after it has been QA'd or
> such.
>
> Best,
> Erick
>
> On Tue, Aug 9, 2016 at 2:43 PM, John Bickerstaff
>  wrote:
> > In my case, I've done two things  neither of them involved taking the
> > data from SOLR to SOLR...  although in my reading, I've seen that this is
> > theoretically possible (I.E. sending data from one SOLR server to another
> > SOLR server and  having the second SOLR instance re-index...)
> >
> > I haven't used the python script...  that was news to me, but it sounds
> > interesting...
> >
> > What I've done is one of the following:
> >
> > a. Get the data from the original source (database, whatever) and massage
> > it again so that i's ready for SOLR and then submit it to my new
> SolrCloud
> > for indexing.
> >
> > b. Keep a separate store of EVERY Solr document as it comes out of my
> code
> > (in xml) and store it in Kafka or a text file.  Then it's easy to push
> back
> > into another SOLR instance any time - multiple times if necessary.
> >
> > I'm guessing you don't have the data stored away as in "b"...  And if you
> > don't have a way of getting the data from some central source, then "a"
> > won't work either...  Which leaves you with the concept of sending data
> > from SOLR "A" to SOLR "B" and having "B" reindex...
> >
> > This might serve as a starting point in that case...
> > https://wiki.apache.org/solr/HowToReindex
> >
> > You'll note that there are limitations and a strong caveat against doing
> > this with SOLR, but if you have no other option, then it's the best you
> can
> > do.
> >
> > Do you have the ability to get all the data again from an authoritative
> > source?  (Relational Database or something similar?)
> >
> > On Tue, Aug 9, 2016 at 3:21 PM, Bharath Kumar  >
> > wrote:
> >
> >> Hi John,
> >>
> >> Thanks so much for your inputs. We have time to build another system. So
> >> how did you index the same data on the main SOLR node to the new SOLR
> node?
> >> Did you use the re-index python script? The new data will be indexed
> >> correctly with the new rules, but what about the old data?
> >>
> >> Our SOLR data is around 30GB with around 60 million documents. We use
> SOLR
> >> cloud with 3 solr nodes and 3 zookeepers.
> >>
> >> On Tue, Aug 9, 2016 at 2:13 PM, John Bickerstaff <
> j...@johnbickerstaff.com
> >> >
> >> wrote:
> >>
> >> > In case this helps...
> >> >
> >> > Assuming you have the resources to build a copy of your production
> >> > environment and assuming you have the time, you don't need to take
> your
> >> > production down - or even affect it's processing...
> >> >
> >> > What I've done (with admittedly smaller data sets) is build a separate
> >> > environment (usually on VM's) and once it's set up, I do the new
> indexing
> >> > according to the new "rules"  (Like your change of long to string)
> >> >
> >> > Then, in a sense, I don't care how long it takes because it is not
> >> > affecting Prod.
> >> >
> >> > When it's done, I simply switch my load balancer to point to the new
> >> > environment and shut down the old one.
> >> >
> >> > To users, this could be seamless if you handle the load balancer
> >> correctly
> >> > and have it refuse new connections to the old servers while routing
> all
> >> new
> >> > connections to the new Solr servers...
> >> >
> >> > On Tue, Aug 9, 2016 at 3:04 PM, Bharath Kumar <
> bharath.mvku...@gmail.com
> >> >
> >> > wrote:
> >> >
> >> > > Hi Nick and Shawn,
> >> > >
> >> > > Thanks so much for the pointers. I will try that out. Thank you
> again!
> >> > >
> >> > > On Tue, Aug 9, 2016 at 9:40 AM, Nick Vasilyev <
> >> nick.vasily...@gmai

Re: How to re-index SOLR data

2016-08-09 Thread Erick Erickson

Assuming you can re-index

Consider "collection aliasing". Say your current collection is C1.
Create C2 (using the same cluster, Zookeeper and the like). Go
ahead and index to C2 (however you do that). NOTE: the physical
machines may be _different_ than C1, or not. That's up to you. The
critical bit is that you use the same Zookeeper.

Now, when you are done you use the Collections API CREATEALIAS
command to point a "pseudo collection" to C1 (call it "prod"). This is
seamless to the users.

The flaw in my plan so far is that you probably go at Collection C1
directly. So what you might do is create the "prod" alias and point it at
C1. Now change your LB (or client or whatever) to use the "prod" collection,
then when indexing is complete use CREATEALIAS to point "prod" at C2
instead.

This is actually a quite well-tested process, often used when you want to
change "atomically", e.g. when you reindex the same data nightly but want
all the new data available in its entirety only after it has been QA'd or such.

Best,
Erick

On Tue, Aug 9, 2016 at 2:43 PM, John Bickerstaff
 wrote:
> In my case, I've done two things  neither of them involved taking the
> data from SOLR to SOLR...  although in my reading, I've seen that this is
> theoretically possible (I.E. sending data from one SOLR server to another
> SOLR server and  having the second SOLR instance re-index...)
>
> I haven't used the python script...  that was news to me, but it sounds
> interesting...
>
> What I've done is one of the following:
>
> a. Get the data from the original source (database, whatever) and massage
> it again so that i's ready for SOLR and then submit it to my new SolrCloud
> for indexing.
>
> b. Keep a separate store of EVERY Solr document as it comes out of my code
> (in xml) and store it in Kafka or a text file.  Then it's easy to push back
> into another SOLR instance any time - multiple times if necessary.
>
> I'm guessing you don't have the data stored away as in "b"...  And if you
> don't have a way of getting the data from some central source, then "a"
> won't work either...  Which leaves you with the concept of sending data
> from SOLR "A" to SOLR "B" and having "B" reindex...
>
> This might serve as a starting point in that case...
> https://wiki.apache.org/solr/HowToReindex
>
> You'll note that there are limitations and a strong caveat against doing
> this with SOLR, but if you have no other option, then it's the best you can
> do.
>
> Do you have the ability to get all the data again from an authoritative
> source?  (Relational Database or something similar?)
>
> On Tue, Aug 9, 2016 at 3:21 PM, Bharath Kumar 
> wrote:
>
>> Hi John,
>>
>> Thanks so much for your inputs. We have time to build another system. So
>> how did you index the same data on the main SOLR node to the new SOLR node?
>> Did you use the re-index python script? The new data will be indexed
>> correctly with the new rules, but what about the old data?
>>
>> Our SOLR data is around 30GB with around 60 million documents. We use SOLR
>> cloud with 3 solr nodes and 3 zookeepers.
>>
>> On Tue, Aug 9, 2016 at 2:13 PM, John Bickerstaff > >
>> wrote:
>>
>> > In case this helps...
>> >
>> > Assuming you have the resources to build a copy of your production
>> > environment and assuming you have the time, you don't need to take your
>> > production down - or even affect it's processing...
>> >
>> > What I've done (with admittedly smaller data sets) is build a separate
>> > environment (usually on VM's) and once it's set up, I do the new indexing
>> > according to the new "rules"  (Like your change of long to string)
>> >
>> > Then, in a sense, I don't care how long it takes because it is not
>> > affecting Prod.
>> >
>> > When it's done, I simply switch my load balancer to point to the new
>> > environment and shut down the old one.
>> >
>> > To users, this could be seamless if you handle the load balancer
>> correctly
>> > and have it refuse new connections to the old servers while routing all
>> new
>> > connections to the new Solr servers...
>> >
>> > On Tue, Aug 9, 2016 at 3:04 PM, Bharath Kumar > >
>> > wrote:
>> >
>> > > Hi Nick and Shawn,
>> > >
>> > > Thanks so much for the pointers. I will try that out. Thank you again!
>> > >
>> > > On Tue, Aug 9, 2016 at 9:40 AM, Nick Vasilyev <
>> nick.vasily...@gmail.com>
>> > > wrote:
>> > >
>> > > > Hi, I work on a python Solr Client
>> > > >  library and there is a
>> > > > reindexing helper module that you can use if you are on Solr 4.9+. I
>> > use
>> > > it
>> > > > all the time and I think it works pretty well. You can re-index all
>> > > > documents from a collection into another collection or dump them to
>> the
>> > > > filesystem as JSON. It also supports parallel execution and can run
>> > > > independently on each shard. There is also a way to resume if your
>> job
>> > > > craps out half way through if your existing schema is set up with a
>> > good
>> >

Re: How to re-index SOLR data

2016-08-09 Thread John Bickerstaff

In my case, I've done two things  neither of them involved taking the
data from SOLR to SOLR...  although in my reading, I've seen that this is
theoretically possible (I.E. sending data from one SOLR server to another
SOLR server and  having the second SOLR instance re-index...)

I haven't used the python script...  that was news to me, but it sounds
interesting...

What I've done is one of the following:

a. Get the data from the original source (database, whatever) and massage
it again so that i's ready for SOLR and then submit it to my new SolrCloud
for indexing.

b. Keep a separate store of EVERY Solr document as it comes out of my code
(in xml) and store it in Kafka or a text file.  Then it's easy to push back
into another SOLR instance any time - multiple times if necessary.

I'm guessing you don't have the data stored away as in "b"...  And if you
don't have a way of getting the data from some central source, then "a"
won't work either...  Which leaves you with the concept of sending data
from SOLR "A" to SOLR "B" and having "B" reindex...

This might serve as a starting point in that case...
https://wiki.apache.org/solr/HowToReindex

You'll note that there are limitations and a strong caveat against doing
this with SOLR, but if you have no other option, then it's the best you can
do.

Do you have the ability to get all the data again from an authoritative
source?  (Relational Database or something similar?)

On Tue, Aug 9, 2016 at 3:21 PM, Bharath Kumar 
wrote:

> Hi John,
>
> Thanks so much for your inputs. We have time to build another system. So
> how did you index the same data on the main SOLR node to the new SOLR node?
> Did you use the re-index python script? The new data will be indexed
> correctly with the new rules, but what about the old data?
>
> Our SOLR data is around 30GB with around 60 million documents. We use SOLR
> cloud with 3 solr nodes and 3 zookeepers.
>
> On Tue, Aug 9, 2016 at 2:13 PM, John Bickerstaff  >
> wrote:
>
> > In case this helps...
> >
> > Assuming you have the resources to build a copy of your production
> > environment and assuming you have the time, you don't need to take your
> > production down - or even affect it's processing...
> >
> > What I've done (with admittedly smaller data sets) is build a separate
> > environment (usually on VM's) and once it's set up, I do the new indexing
> > according to the new "rules"  (Like your change of long to string)
> >
> > Then, in a sense, I don't care how long it takes because it is not
> > affecting Prod.
> >
> > When it's done, I simply switch my load balancer to point to the new
> > environment and shut down the old one.
> >
> > To users, this could be seamless if you handle the load balancer
> correctly
> > and have it refuse new connections to the old servers while routing all
> new
> > connections to the new Solr servers...
> >
> > On Tue, Aug 9, 2016 at 3:04 PM, Bharath Kumar  >
> > wrote:
> >
> > > Hi Nick and Shawn,
> > >
> > > Thanks so much for the pointers. I will try that out. Thank you again!
> > >
> > > On Tue, Aug 9, 2016 at 9:40 AM, Nick Vasilyev <
> nick.vasily...@gmail.com>
> > > wrote:
> > >
> > > > Hi, I work on a python Solr Client
> > > >  library and there is a
> > > > reindexing helper module that you can use if you are on Solr 4.9+. I
> > use
> > > it
> > > > all the time and I think it works pretty well. You can re-index all
> > > > documents from a collection into another collection or dump them to
> the
> > > > filesystem as JSON. It also supports parallel execution and can run
> > > > independently on each shard. There is also a way to resume if your
> job
> > > > craps out half way through if your existing schema is set up with a
> > good
> > > > date field and unique id.
> > > >
> > > > You can read the documentation here:
> > > > http://solrclient.readthedocs.io/en/latest/Reindexer.html
> > > >
> > > > Code is pretty short and is here:
> > > > https://github.com/moonlitesolutions/SolrClient/
> > blob/master/SolrClient/
> > > > helpers/reindexer.py
> > > >
> > > > Here is sample:
> > > > from SolrClient import SolrClient
> > > > from SolrClient.helpers import Reindexer
> > > >
> > > > r = Reindexer(SolrClient('http://source_solr:8983/solr'),
> SolrClient('
> > > > http://destination_solr:8983/solr') , source_coll='source_
> collection',
> > > > dest_coll='destination-collection')
> > > > r.reindex()
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > > On Tue, Aug 9, 2016 at 9:56 AM, Shawn Heisey 
> > > wrote:
> > > >
> > > > > On 8/9/2016 1:48 AM, bharath.mvkumar wrote:
> > > > > > What would be the best way to re-index the data in the SOLR
> cloud?
> > We
> > > > > > have around 65 million data and we are planning to change the
> > schema
> > > > > > by changing the unique key type from long to string. How long
> does
> > it
> > > > > > take to re-index 65 million documents in SOLR and can you please
> > > > > > suggest how to do that?

Re: How to re-index SOLR data

2016-08-09 Thread Bharath Kumar

Hi John,

Thanks so much for your inputs. We have time to build another system. So
how did you index the same data on the main SOLR node to the new SOLR node?
Did you use the re-index python script? The new data will be indexed
correctly with the new rules, but what about the old data?

Our SOLR data is around 30GB with around 60 million documents. We use SOLR
cloud with 3 solr nodes and 3 zookeepers.

On Tue, Aug 9, 2016 at 2:13 PM, John Bickerstaff 
wrote:

> In case this helps...
>
> Assuming you have the resources to build a copy of your production
> environment and assuming you have the time, you don't need to take your
> production down - or even affect it's processing...
>
> What I've done (with admittedly smaller data sets) is build a separate
> environment (usually on VM's) and once it's set up, I do the new indexing
> according to the new "rules"  (Like your change of long to string)
>
> Then, in a sense, I don't care how long it takes because it is not
> affecting Prod.
>
> When it's done, I simply switch my load balancer to point to the new
> environment and shut down the old one.
>
> To users, this could be seamless if you handle the load balancer correctly
> and have it refuse new connections to the old servers while routing all new
> connections to the new Solr servers...
>
> On Tue, Aug 9, 2016 at 3:04 PM, Bharath Kumar 
> wrote:
>
> > Hi Nick and Shawn,
> >
> > Thanks so much for the pointers. I will try that out. Thank you again!
> >
> > On Tue, Aug 9, 2016 at 9:40 AM, Nick Vasilyev 
> > wrote:
> >
> > > Hi, I work on a python Solr Client
> > >  library and there is a
> > > reindexing helper module that you can use if you are on Solr 4.9+. I
> use
> > it
> > > all the time and I think it works pretty well. You can re-index all
> > > documents from a collection into another collection or dump them to the
> > > filesystem as JSON. It also supports parallel execution and can run
> > > independently on each shard. There is also a way to resume if your job
> > > craps out half way through if your existing schema is set up with a
> good
> > > date field and unique id.
> > >
> > > You can read the documentation here:
> > > http://solrclient.readthedocs.io/en/latest/Reindexer.html
> > >
> > > Code is pretty short and is here:
> > > https://github.com/moonlitesolutions/SolrClient/
> blob/master/SolrClient/
> > > helpers/reindexer.py
> > >
> > > Here is sample:
> > > from SolrClient import SolrClient
> > > from SolrClient.helpers import Reindexer
> > >
> > > r = Reindexer(SolrClient('http://source_solr:8983/solr'), SolrClient('
> > > http://destination_solr:8983/solr') , source_coll='source_collection',
> > > dest_coll='destination-collection')
> > > r.reindex()
> > >
> > >
> > >
> > >
> > >
> > >
> > > On Tue, Aug 9, 2016 at 9:56 AM, Shawn Heisey 
> > wrote:
> > >
> > > > On 8/9/2016 1:48 AM, bharath.mvkumar wrote:
> > > > > What would be the best way to re-index the data in the SOLR cloud?
> We
> > > > > have around 65 million data and we are planning to change the
> schema
> > > > > by changing the unique key type from long to string. How long does
> it
> > > > > take to re-index 65 million documents in SOLR and can you please
> > > > > suggest how to do that?
> > > >
> > > > There is no magic bullet.  And there's no way for anybody but you to
> > > > determine how long it's going to take.  There are people who have
> > > > achieved over 50K inserts per second, and others who have difficulty
> > > > reaching 1000 per second.  Many factors affect indexing speed,
> > including
> > > > the size of your documents, the complexity of your analysis, the
> > > > capabilities of your hardware, and how many threads/processes you are
> > > > using at the same time when you index.
> > > >
> > > > Here's some more detailed info about reindexing, but it's probably
> not
> > > > what you wanted to hear:
> > > >
> > > > https://wiki.apache.org/solr/HowToReindex
> > > >
> > > > Thanks,
> > > > Shawn
> > > >
> > > >
> > >
> >
> >
> >
> > --
> > Thanks & Regards,
> > Bharath MV Kumar
> >
> > "Life is short, enjoy every moment of it"
> >
>



-- 
Thanks & Regards,
Bharath MV Kumar

"Life is short, enjoy every moment of it"

Re: How to re-index SOLR data

2016-08-09 Thread John Bickerstaff

In case this helps...

Assuming you have the resources to build a copy of your production
environment and assuming you have the time, you don't need to take your
production down - or even affect it's processing...

What I've done (with admittedly smaller data sets) is build a separate
environment (usually on VM's) and once it's set up, I do the new indexing
according to the new "rules"  (Like your change of long to string)

Then, in a sense, I don't care how long it takes because it is not
affecting Prod.

When it's done, I simply switch my load balancer to point to the new
environment and shut down the old one.

To users, this could be seamless if you handle the load balancer correctly
and have it refuse new connections to the old servers while routing all new
connections to the new Solr servers...

On Tue, Aug 9, 2016 at 3:04 PM, Bharath Kumar 
wrote:

> Hi Nick and Shawn,
>
> Thanks so much for the pointers. I will try that out. Thank you again!
>
> On Tue, Aug 9, 2016 at 9:40 AM, Nick Vasilyev 
> wrote:
>
> > Hi, I work on a python Solr Client
> >  library and there is a
> > reindexing helper module that you can use if you are on Solr 4.9+. I use
> it
> > all the time and I think it works pretty well. You can re-index all
> > documents from a collection into another collection or dump them to the
> > filesystem as JSON. It also supports parallel execution and can run
> > independently on each shard. There is also a way to resume if your job
> > craps out half way through if your existing schema is set up with a good
> > date field and unique id.
> >
> > You can read the documentation here:
> > http://solrclient.readthedocs.io/en/latest/Reindexer.html
> >
> > Code is pretty short and is here:
> > https://github.com/moonlitesolutions/SolrClient/blob/master/SolrClient/
> > helpers/reindexer.py
> >
> > Here is sample:
> > from SolrClient import SolrClient
> > from SolrClient.helpers import Reindexer
> >
> > r = Reindexer(SolrClient('http://source_solr:8983/solr'), SolrClient('
> > http://destination_solr:8983/solr') , source_coll='source_collection',
> > dest_coll='destination-collection')
> > r.reindex()
> >
> >
> >
> >
> >
> >
> > On Tue, Aug 9, 2016 at 9:56 AM, Shawn Heisey 
> wrote:
> >
> > > On 8/9/2016 1:48 AM, bharath.mvkumar wrote:
> > > > What would be the best way to re-index the data in the SOLR cloud? We
> > > > have around 65 million data and we are planning to change the schema
> > > > by changing the unique key type from long to string. How long does it
> > > > take to re-index 65 million documents in SOLR and can you please
> > > > suggest how to do that?
> > >
> > > There is no magic bullet.  And there's no way for anybody but you to
> > > determine how long it's going to take.  There are people who have
> > > achieved over 50K inserts per second, and others who have difficulty
> > > reaching 1000 per second.  Many factors affect indexing speed,
> including
> > > the size of your documents, the complexity of your analysis, the
> > > capabilities of your hardware, and how many threads/processes you are
> > > using at the same time when you index.
> > >
> > > Here's some more detailed info about reindexing, but it's probably not
> > > what you wanted to hear:
> > >
> > > https://wiki.apache.org/solr/HowToReindex
> > >
> > > Thanks,
> > > Shawn
> > >
> > >
> >
>
>
>
> --
> Thanks & Regards,
> Bharath MV Kumar
>
> "Life is short, enjoy every moment of it"
>

Re: How to re-index SOLR data

2016-08-09 Thread Bharath Kumar

Hi Nick and Shawn,

Thanks so much for the pointers. I will try that out. Thank you again!

On Tue, Aug 9, 2016 at 9:40 AM, Nick Vasilyev 
wrote:

> Hi, I work on a python Solr Client
>  library and there is a
> reindexing helper module that you can use if you are on Solr 4.9+. I use it
> all the time and I think it works pretty well. You can re-index all
> documents from a collection into another collection or dump them to the
> filesystem as JSON. It also supports parallel execution and can run
> independently on each shard. There is also a way to resume if your job
> craps out half way through if your existing schema is set up with a good
> date field and unique id.
>
> You can read the documentation here:
> http://solrclient.readthedocs.io/en/latest/Reindexer.html
>
> Code is pretty short and is here:
> https://github.com/moonlitesolutions/SolrClient/blob/master/SolrClient/
> helpers/reindexer.py
>
> Here is sample:
> from SolrClient import SolrClient
> from SolrClient.helpers import Reindexer
>
> r = Reindexer(SolrClient('http://source_solr:8983/solr'), SolrClient('
> http://destination_solr:8983/solr') , source_coll='source_collection',
> dest_coll='destination-collection')
> r.reindex()
>
>
>
>
>
>
> On Tue, Aug 9, 2016 at 9:56 AM, Shawn Heisey  wrote:
>
> > On 8/9/2016 1:48 AM, bharath.mvkumar wrote:
> > > What would be the best way to re-index the data in the SOLR cloud? We
> > > have around 65 million data and we are planning to change the schema
> > > by changing the unique key type from long to string. How long does it
> > > take to re-index 65 million documents in SOLR and can you please
> > > suggest how to do that?
> >
> > There is no magic bullet.  And there's no way for anybody but you to
> > determine how long it's going to take.  There are people who have
> > achieved over 50K inserts per second, and others who have difficulty
> > reaching 1000 per second.  Many factors affect indexing speed, including
> > the size of your documents, the complexity of your analysis, the
> > capabilities of your hardware, and how many threads/processes you are
> > using at the same time when you index.
> >
> > Here's some more detailed info about reindexing, but it's probably not
> > what you wanted to hear:
> >
> > https://wiki.apache.org/solr/HowToReindex
> >
> > Thanks,
> > Shawn
> >
> >
>



-- 
Thanks & Regards,
Bharath MV Kumar

"Life is short, enjoy every moment of it"

Re: How to re-index SOLR data

2016-08-09 Thread Nick Vasilyev

Hi, I work on a python Solr Client
 library and there is a
reindexing helper module that you can use if you are on Solr 4.9+. I use it
all the time and I think it works pretty well. You can re-index all
documents from a collection into another collection or dump them to the
filesystem as JSON. It also supports parallel execution and can run
independently on each shard. There is also a way to resume if your job
craps out half way through if your existing schema is set up with a good
date field and unique id.

You can read the documentation here:
http://solrclient.readthedocs.io/en/latest/Reindexer.html

Code is pretty short and is here:
https://github.com/moonlitesolutions/SolrClient/blob/master/SolrClient/helpers/reindexer.py

Here is sample:
from SolrClient import SolrClient
from SolrClient.helpers import Reindexer

r = Reindexer(SolrClient('http://source_solr:8983/solr'), SolrClient('
http://destination_solr:8983/solr') , source_coll='source_collection',
dest_coll='destination-collection')
r.reindex()






On Tue, Aug 9, 2016 at 9:56 AM, Shawn Heisey  wrote:

> On 8/9/2016 1:48 AM, bharath.mvkumar wrote:
> > What would be the best way to re-index the data in the SOLR cloud? We
> > have around 65 million data and we are planning to change the schema
> > by changing the unique key type from long to string. How long does it
> > take to re-index 65 million documents in SOLR and can you please
> > suggest how to do that?
>
> There is no magic bullet.  And there's no way for anybody but you to
> determine how long it's going to take.  There are people who have
> achieved over 50K inserts per second, and others who have difficulty
> reaching 1000 per second.  Many factors affect indexing speed, including
> the size of your documents, the complexity of your analysis, the
> capabilities of your hardware, and how many threads/processes you are
> using at the same time when you index.
>
> Here's some more detailed info about reindexing, but it's probably not
> what you wanted to hear:
>
> https://wiki.apache.org/solr/HowToReindex
>
> Thanks,
> Shawn
>
>

Re: How to re-index SOLR data

2016-08-09 Thread Shawn Heisey

On 8/9/2016 1:48 AM, bharath.mvkumar wrote:
> What would be the best way to re-index the data in the SOLR cloud? We
> have around 65 million data and we are planning to change the schema
> by changing the unique key type from long to string. How long does it
> take to re-index 65 million documents in SOLR and can you please
> suggest how to do that?

There is no magic bullet.  And there's no way for anybody but you to
determine how long it's going to take.  There are people who have
achieved over 50K inserts per second, and others who have difficulty
reaching 1000 per second.  Many factors affect indexing speed, including
the size of your documents, the complexity of your analysis, the
capabilities of your hardware, and how many threads/processes you are
using at the same time when you index.

Here's some more detailed info about reindexing, but it's probably not
what you wanted to hear:

https://wiki.apache.org/solr/HowToReindex

Thanks,
Shawn

How to re-index SOLR data

2016-08-09 Thread bharath.mvkumar

Hi All,

What would be the best way to re-index the data in the SOLR cloud? We have
around 65 million data and we are planning to change the schema by changing
the unique key type from long to string.

How long does it take to re-index 65 million documents in SOLR and can you
please suggest how to do that?

Thanks,
Bharath Kumar



--
View this message in context: 
http://lucene.472066.n3.nabble.com/How-to-re-index-SOLR-data-tp4290893.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: How to re-index SOLR data

Re: How to re-index SOLR data

Re: How to re-index SOLR data

Re: How to re-index SOLR data

Re: How to re-index SOLR data

Re: How to re-index SOLR data

Re: How to re-index SOLR data

Re: How to re-index SOLR data

Re: How to re-index SOLR data

How to re-index SOLR data

10 matches

Site Navigation

Mail list logo

Footer information