[ 
https://issues.apache.org/jira/browse/SOLR-9038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15262483#comment-15262483
 ] 

Hrishikesh Gadre commented on SOLR-9038:
----------------------------------------

Hi [~dsmiley] thanks for the comments :)

>>I presume by "snapshot", we're talking about named (or numbered) Lucene 
>>IndexCommit objects across all replicas of a Solr Collection? And then, in 
>>SOLR-5750 or future patch, the "backup" capability might optionally make 
>>reference to a named snapshot instead of just taking the last IndexCommit?

Yes that is correct.

 >>And in some separate issue, a rollback ability, I presume.

I am thinking to use "restore" capability for this (SOLR-5750). The idea here 
is that if the "snapshot" needs to restored, it should be exported to a 
separate location (Exported snapshot is equivalent to a backup). Since 
"rollback" would be less frequent than snapshot "creation", it should be 
acceptable to use the "restore" work-flow even if it is less efficient for 
simplicity and uniformity. But we can always revisit this if there are 
use-cases.

>>Perhaps another way to view this feature proposed here is to have a commit 
>>optionally include a persistent name (or variable name-value metadata for 
>>that matter) that will be included with the IndexCommit that is persisted. 
>>That would be a somewhat simple way to think of this feature, and needn't 
>>involve any SolrCloud related stuff. Of course this data would need to 
>>flow-through in all the places commit boolean does, which is a lot of places, 
>>but I don't think it would be hard/complicated.

I am thinking to define new APIs at collection and core level 
(CREATESNAPSHOT/DELETESNAPSHOT/LISTSNAPSHOTS). The collection level 
"CREATESNAPSHOT" operation would be implemented in the Overseer (just like 
BACKUP/RESTORE). The only difference  is that it would invoke core level 
"CREATESNAPSHOT" API for each of the shard leader replica (instead of BACKUP 
API). It will also copy the ZK configuration at the specified location.

Once the snapshot is created for an index commit, the corresponding files will 
be available for download. This download can be implemented without going 
through the Overseer. e.g.
-> If Solr is running on a Hadoop/HDFS cluster, we can use distcp tool to copy 
the files.
-> We can use replication handler functionality to copy the files (This can be 
wrapped as a Solr API or a command line tool).

I am not quite sure if we utilize the "commit" workflow for snapshot creation, 
how would we capture the collection metadata?

> Ability to create/delete/list snapshots for a solr collection
> -------------------------------------------------------------
>
>                 Key: SOLR-9038
>                 URL: https://issues.apache.org/jira/browse/SOLR-9038
>             Project: Solr
>          Issue Type: New Feature
>          Components: SolrCloud
>            Reporter: Hrishikesh Gadre
>
> Currently work is under-way to implement backup/restore API for Solr cloud 
> (SOLR-5750). SOLR-5750 is about providing an ability to "copy" index files 
> and collection metadata to a configurable location. 
> In addition to this, we should also provide a facility to create "named" 
> snapshots for Solr collection. Here by "snapshot" I mean configuring the 
> underlying Lucene IndexDeletionPolicy to not delete a specific commit point 
> (e.g. using PersistentSnapshotIndexDeletionPolicy). This should not be 
> confused with SOLR-5340 which implements core level "backup" functionality.
> The primary motivation of this feature is to decouple recording/preserving a 
> known consistent state of a collection from actually "copying" the relevant 
> files to a physically separate location. This decoupling have number of 
> advantages
> - We can use specialized data-copying tools for transferring Solr index 
> files. e.g. in Hadoop environment, typically 
> [distcp|https://hadoop.apache.org/docs/r1.2.1/distcp2.html] tool is used to 
> copy files from one location to other. This tool provides various options to 
> configure degree of parallelism, bandwidth usage as well as integration with 
> different types and versions of file systems (e.g. AWS S3, Azure Blob store 
> etc.)
> - This separation of concern would also help Solr to focus on the key 
> functionality (i.e. querying and indexing) while delegating the copy 
> operation to the tools built for that purpose.
> - Users can decide if/when to copy the data files as against creating a 
> snapshot. e.g. a user may want to create a snapshot of a collection before 
> making an experimental change (e.g. updating/deleting docs, schema change 
> etc.). If the experiment is successful, he can delete the snapshot (without 
> having to copy the files). If the experiment is failed, then he can copy the 
> files associated with the snapshot and restore.
> Note that Apache Blur project is also providing a similar feature 
> [BLUR-132|https://issues.apache.org/jira/browse/BLUR-132]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to