Re: How to re-index SOLR data

Erick Erickson Tue, 09 Aug 2016 20:45:07 -0700

Assuming you can re-index....

Consider "collection aliasing". Say your current collection is C1.
Create C2 (using the same cluster, Zookeeper and the like). Go
ahead and index to C2 (however you do that). NOTE: the physical
machines may be _different_ than C1, or not. That's up to you. The
critical bit is that you use the same Zookeeper.


Now, when you are done you use the Collections API CREATEALIAS
command to point a "pseudo collection" to C1 (call it "prod"). This is
seamless to the users.

The flaw in my plan so far is that you probably go at Collection C1
directly. So what you might do is create the "prod" alias and point it at
C1. Now change your LB (or client or whatever) to use the "prod" collection,
then when indexing is complete use CREATEALIAS to point "prod" at C2
instead.

This is actually a quite well-tested process, often used when you want to
change "atomically", e.g. when you reindex the same data nightly but want
all the new data available in its entirety only after it has been QA'd or such.

Best,
Erick

On Tue, Aug 9, 2016 at 2:43 PM, John Bickerstaff
<j...@johnbickerstaff.com> wrote:
> In my case, I've done two things....  neither of them involved taking the
> data from SOLR to SOLR...  although in my reading, I've seen that this is
> theoretically possible (I.E. sending data from one SOLR server to another
> SOLR server and  having the second SOLR instance re-index...)
>
> I haven't used the python script...  that was news to me, but it sounds
> interesting...
>
> What I've done is one of the following:
>
> a. Get the data from the original source (database, whatever) and massage
> it again so that i's ready for SOLR and then submit it to my new SolrCloud
> for indexing.
>
> b. Keep a separate store of EVERY Solr document as it comes out of my code
> (in xml) and store it in Kafka or a text file.  Then it's easy to push back
> into another SOLR instance any time - multiple times if necessary.
>
> I'm guessing you don't have the data stored away as in "b"...  And if you
> don't have a way of getting the data from some central source, then "a"
> won't work either...  Which leaves you with the concept of sending data
> from SOLR "A" to SOLR "B" and having "B" reindex...
>
> This might serve as a starting point in that case...
> https://wiki.apache.org/solr/HowToReindex
>
> You'll note that there are limitations and a strong caveat against doing
> this with SOLR, but if you have no other option, then it's the best you can
> do.
>
> Do you have the ability to get all the data again from an authoritative
> source?  (Relational Database or something similar?)
>
> On Tue, Aug 9, 2016 at 3:21 PM, Bharath Kumar <bharath.mvku...@gmail.com>
> wrote:
>
>> Hi John,
>>
>> Thanks so much for your inputs. We have time to build another system. So
>> how did you index the same data on the main SOLR node to the new SOLR node?
>> Did you use the re-index python script? The new data will be indexed
>> correctly with the new rules, but what about the old data?
>>
>> Our SOLR data is around 30GB with around 60 million documents. We use SOLR
>> cloud with 3 solr nodes and 3 zookeepers.
>>
>> On Tue, Aug 9, 2016 at 2:13 PM, John Bickerstaff <j...@johnbickerstaff.com
>> >
>> wrote:
>>
>> > In case this helps...
>> >
>> > Assuming you have the resources to build a copy of your production
>> > environment and assuming you have the time, you don't need to take your
>> > production down - or even affect it's processing...
>> >
>> > What I've done (with admittedly smaller data sets) is build a separate
>> > environment (usually on VM's) and once it's set up, I do the new indexing
>> > according to the new "rules"  (Like your change of long to string)
>> >
>> > Then, in a sense, I don't care how long it takes because it is not
>> > affecting Prod.
>> >
>> > When it's done, I simply switch my load balancer to point to the new
>> > environment and shut down the old one.
>> >
>> > To users, this could be seamless if you handle the load balancer
>> correctly
>> > and have it refuse new connections to the old servers while routing all
>> new
>> > connections to the new Solr servers...
>> >
>> > On Tue, Aug 9, 2016 at 3:04 PM, Bharath Kumar <bharath.mvku...@gmail.com
>> >
>> > wrote:
>> >
>> > > Hi Nick and Shawn,
>> > >
>> > > Thanks so much for the pointers. I will try that out. Thank you again!
>> > >
>> > > On Tue, Aug 9, 2016 at 9:40 AM, Nick Vasilyev <
>> nick.vasily...@gmail.com>
>> > > wrote:
>> > >
>> > > > Hi, I work on a python Solr Client
>> > > > <http://solrclient.readthedocs.io/en/latest/> library and there is a
>> > > > reindexing helper module that you can use if you are on Solr 4.9+. I
>> > use
>> > > it
>> > > > all the time and I think it works pretty well. You can re-index all
>> > > > documents from a collection into another collection or dump them to
>> the
>> > > > filesystem as JSON. It also supports parallel execution and can run
>> > > > independently on each shard. There is also a way to resume if your
>> job
>> > > > craps out half way through if your existing schema is set up with a
>> > good
>> > > > date field and unique id.
>> > > >
>> > > > You can read the documentation here:
>> > > > http://solrclient.readthedocs.io/en/latest/Reindexer.html
>> > > >
>> > > > Code is pretty short and is here:
>> > > > https://github.com/moonlitesolutions/SolrClient/
>> > blob/master/SolrClient/
>> > > > helpers/reindexer.py
>> > > >
>> > > > Here is sample:
>> > > > from SolrClient import SolrClient
>> > > > from SolrClient.helpers import Reindexer
>> > > >
>> > > > r = Reindexer(SolrClient('http://source_solr:8983/solr'),
>> SolrClient('
>> > > > http://destination_solr:8983/solr') , source_coll='source_
>> collection',
>> > > > dest_coll='destination-collection')
>> > > > r.reindex()
>> > > >
>> > > >
>> > > >
>> > > >
>> > > >
>> > > >
>> > > > On Tue, Aug 9, 2016 at 9:56 AM, Shawn Heisey <apa...@elyograg.org>
>> > > wrote:
>> > > >
>> > > > > On 8/9/2016 1:48 AM, bharath.mvkumar wrote:
>> > > > > > What would be the best way to re-index the data in the SOLR
>> cloud?
>> > We
>> > > > > > have around 65 million data and we are planning to change the
>> > schema
>> > > > > > by changing the unique key type from long to string. How long
>> does
>> > it
>> > > > > > take to re-index 65 million documents in SOLR and can you please
>> > > > > > suggest how to do that?
>> > > > >
>> > > > > There is no magic bullet.  And there's no way for anybody but you
>> to
>> > > > > determine how long it's going to take.  There are people who have
>> > > > > achieved over 50K inserts per second, and others who have
>> difficulty
>> > > > > reaching 1000 per second.  Many factors affect indexing speed,
>> > > including
>> > > > > the size of your documents, the complexity of your analysis, the
>> > > > > capabilities of your hardware, and how many threads/processes you
>> are
>> > > > > using at the same time when you index.
>> > > > >
>> > > > > Here's some more detailed info about reindexing, but it's probably
>> > not
>> > > > > what you wanted to hear:
>> > > > >
>> > > > > https://wiki.apache.org/solr/HowToReindex
>> > > > >
>> > > > > Thanks,
>> > > > > Shawn
>> > > > >
>> > > > >
>> > > >
>> > >
>> > >
>> > >
>> > > --
>> > > Thanks & Regards,
>> > > Bharath MV Kumar
>> > >
>> > > "Life is short, enjoy every moment of it"
>> > >
>> >
>>
>>
>>
>> --
>> Thanks & Regards,
>> Bharath MV Kumar
>>
>> "Life is short, enjoy every moment of it"
>>

Re: How to re-index SOLR data

Reply via email to