Re: Multiple collections for a write-alias
We are actually very close to doing what Shawn has suggested. Emir has a good point about new collections failing on deletes/updates of older documents which were not present in the new collection. But even if this feature can be implemented for an append-only log, it would make a good feature IMO. Use-case for re-indexing everything again is generally that of an attribute change like enabling "indexed" or "docValues" on a field or adding a new field to a schema. While the reading client-code sits behind a flag to start using the new attribute/field, we have to re-index all the data without stopping older-format reads. Currently, we have to do dual writes to the new collections or play catch-up-after-a-bootstrap. Note that the catch-up-after-a-bootstrap is not very easy too (it is very similar to the one described by Shwan). If this special place is Kafka or some table in the DB, then we have to do dual writes to the regular source-of-truth and this special place. Dual writes with DB and Kafka suffer from being transaction-less (and thus lack consistency) while dual write to DB increase the load on DB. Having created_date / modified_date fields and querying the DB to find live-traffic documents has its own problems and is taxing on the DB again. Dual writes to Solr's multiple collections directly is the simplest to implement for a client and that is exactly what this new feature could be. With a dual-write-collection-alias, it becomes easier for the client to not implement any of the above if the dual-write-collection-alias does the following: - Deletes on missing documents in new collection are simply ignored. - Incremental updates just throw an error for not being supported on multi-write-collection-alias. - Regular updates (i.e. Delete-Then-Insert) should work just fine because they will just treat the document as a brand new one and versioning strategies can take care of out-of-order updates. SG On Fri, Nov 10, 2017 at 6:33 AM, Emir Arnautović < emir.arnauto...@sematext.com> wrote: > This approach could work only if it is append only index. In case you have > updates/deletes, you have to process in order, otherwise you will get > incorrect results. I am thinking that is one of the reasons why it might > not be supported since not too useful. > > Emir > -- > Monitoring - Log Management - Alerting - Anomaly Detection > Solr & Elasticsearch Consulting Support Training - http://sematext.com/ > > > > > On 9 Nov 2017, at 19:09, S Gwrote: > > > > Hi, > > > > We have a use-case to re-create a solr-collection by re-ingesting > > everything but not tolerate a downtime while that is happening. > > > > We are using collection alias feature to point to the new collection when > > it has been re-ingested fully. > > > > However, re-ingestion takes several hours to complete and during that > time, > > the customer has to write to both the collections - previous collection > and > > the one being bootstrapped. > > This dual-write is harder to do from the client side (because client > needs > > to have a retry logic to ensure any update does not succeed in one > > collection and fails in another - consistency problem) and it would be a > > real welcome addition if collection aliasing can support this. > > > > Proposal: > > If can enhance the write alias to point to multiple collections such that > > any update to the alias is written to all the collections it points to, > it > > would help the client to avoid dual writes and also issue just a single > > http call from the client instead of multiple. It would also reduce the > > retry logic inside the client code used to keep the collections > consistent. > > > > > > Thanks > > SG > >
Re: Multiple collections for a write-alias
This approach could work only if it is append only index. In case you have updates/deletes, you have to process in order, otherwise you will get incorrect results. I am thinking that is one of the reasons why it might not be supported since not too useful. Emir -- Monitoring - Log Management - Alerting - Anomaly Detection Solr & Elasticsearch Consulting Support Training - http://sematext.com/ > On 9 Nov 2017, at 19:09, S Gwrote: > > Hi, > > We have a use-case to re-create a solr-collection by re-ingesting > everything but not tolerate a downtime while that is happening. > > We are using collection alias feature to point to the new collection when > it has been re-ingested fully. > > However, re-ingestion takes several hours to complete and during that time, > the customer has to write to both the collections - previous collection and > the one being bootstrapped. > This dual-write is harder to do from the client side (because client needs > to have a retry logic to ensure any update does not succeed in one > collection and fails in another - consistency problem) and it would be a > real welcome addition if collection aliasing can support this. > > Proposal: > If can enhance the write alias to point to multiple collections such that > any update to the alias is written to all the collections it points to, it > would help the client to avoid dual writes and also issue just a single > http call from the client instead of multiple. It would also reduce the > retry logic inside the client code used to keep the collections consistent. > > > Thanks > SG
Re: Multiple collections for a write-alias
On 11/9/2017 11:09 AM, S G wrote: > However, re-ingestion takes several hours to complete and during that time, > the customer has to write to both the collections - previous collection and > the one being bootstrapped. > This dual-write is harder to do from the client side (because client needs > to have a retry logic to ensure any update does not succeed in one > collection and fails in another - consistency problem) and it would be a > real welcome addition if collection aliasing can support this. Let me explain how I handle this situation. I'm not running in cloud mode, but I use the "swap" feature of CoreAdmin to do much the same thing you're describing with collection aliases. My source data (mysql database) has a way to track the last new document that was added, as well as track which deletes have been applied, and which documents need to be reinserted. I use these pointers to decide what data to retrieve on each indexing cycle, and then I update them to new positions when the indexing cycle completes successfully. When I do a full rebuild, I grab the current positions for new docs, deletes, and reinserts, and store that information in a special place. Then I start building indexes in the "build" cores. In the meantime, I am continuing to update all the "live" cores, so users are unaware that anything special is happening. When the rebuild finishes (which can take a day or more), I go to that special place where I stored all the position information, and proceed to run a "catchup" indexing process on the build cores -- all the deletes, new documents, and reinserts that happened since the time the full rebuild started. When that completes, I swap the build cores with the live cores, and resume normal operation. Doing it this way, I do not need to worry about the normal indexing cycle handling writes to both the old index and the new index -- the ongoing cycle just updates the current live cores. > Proposal: > If can enhance the write alias to point to multiple collections such that > any update to the alias is written to all the collections it points to, it > would help the client to avoid dual writes and also issue just a single > http call from the client instead of multiple. It would also reduce the > retry logic inside the client code used to keep the collections consistent. Imagine an index with time-series data, where there is an alias called "today" that includes up to 24 hourly collections. If you were to write to that alias with the idea you've proposed, the data would end up in the wrong places and would in fact get incorrectly duplicated many times ... but the way it currently works, the writes would only go to the FIRST collection in the alias, which can be arranged to always be the "current" collection. Your proposal is an interesting idea, but would require some development work. Errors during indexing could be a major source of headaches, especially those errors that don't affect all collections in the alias equally. So as to not change how users expect Solr to work currently, aliases would need a special flag to indicate that writes *should* be duplicated to all collections in the alias, or maybe there would need to be two different kinds of aliases. Since such a feature is probably not going to happen quickly even if it is something that we agree to work on, would you be able to use something like the method that I outlined above? Thanks, Shawn
Re: Multiple collections for a write-alias
Aliases can already point to multiple collections, have you just tried that? I'm not totally sure what the behavior would be, but nothing you've written indicates you tried so I thought I'd point it out. It's not clear to me how useful this is though, or what failure messages are returned. Or how you figure out which collection failed. Or how you'd take remedial action. Best, Erick Erick On Thu, Nov 9, 2017 at 10:09 AM, S Gwrote: > Hi, > > We have a use-case to re-create a solr-collection by re-ingesting > everything but not tolerate a downtime while that is happening. > > We are using collection alias feature to point to the new collection when > it has been re-ingested fully. > > However, re-ingestion takes several hours to complete and during that time, > the customer has to write to both the collections - previous collection and > the one being bootstrapped. > This dual-write is harder to do from the client side (because client needs > to have a retry logic to ensure any update does not succeed in one > collection and fails in another - consistency problem) and it would be a > real welcome addition if collection aliasing can support this. > > Proposal: > If can enhance the write alias to point to multiple collections such that > any update to the alias is written to all the collections it points to, it > would help the client to avoid dual writes and also issue just a single > http call from the client instead of multiple. It would also reduce the > retry logic inside the client code used to keep the collections consistent. > > > Thanks > SG