Hi there,
This thread is to address John's comments about missing error handling in S3 as 
secondary storage in object_store branch implementation. From previous merge 
email thread, I realize that we may not explain clearly in FS how S3 should 
work in new object_store branch, so causing several confusions. Let's make it 
clear here.

1. The goal of object_store branch is to make S3 serve as NATIVE secondary 
storage, not just a backup device as NFS secondary storage in master branch. We 
want to lead people to believe that their data (template, snapshot, volumes) 
are stored in S3 object store if they choose S3 as their cloudstack secondary 
storage. When users register template to S3, we are directly issuing S3 API to 
download template directly into S3 object store instead of downloading it to 
NFS secondary storage and then syncing to S3 by schedule done by master branch. 
When we tell users that their data is  READY on their S3 secondary storage, it 
really means that it is ready to use from S3. Unlike this guarantee, in master, 
S3 as a backup device, snapshot may only be ready on NFS secondary storage, not 
in S3 due to any network connection issues, but we actually mislead users that 
their snapshot is ready on S3.

2. NFS cache only comes into picture when user choose S3 as their native 
secondary storage. The data stored in NFS cache is really temporary and serve 
as an intermediate transfer stage for CloudStack to manipulate data stored in 
S3, our design does not have any requirement that these intermediate data has 
to be persist there in NFS cache forever to make CloudStack functional. This is 
quite different from the role of NFS secondary storage for S3 in master branch, 
where we have to keep data there in NFS secondary storage since we cannot 
guarantee that data is READY on S3 due to background sync issue I will mention 
in a minute. Theoretically speaking, we should be able to implement a simple 
LRU or FIFO cache algorithm (with the assumption that we have proved 4.2 
feature freeze extension vote) to age out old cache data without impacting any 
of CloudStack functionality using S3. Not sure if this is true for NFS 
secondary storage data for S3 in master branch, feels not based on my code 
understanding, but maybe I am just ignorant and too new to this part of code in 
master.

3. We have to admit that in current object_store implementation, we only try 
the S3 operations (put, get, etc) once and if it failed, and we just report 
error and user have to manually retry. On this aspect, we definitely can make 
it better by adding some re-try mechanism based on a global configured retry 
parameter. However, infinite retry in interacting with these external devices 
is always a bad idea from my past experience. Also, we disagree with John's 
comment about dropping previous  background sync process is "a step back from 
the current Swift and S3 implementations present in 4.1.0". We agree that 
current master background sync process relieves admin from manual retry in case 
of some S3 errors (BTW, some errors will never recover even with background 
process, for example, capacity full), but it also caused another severe 
drawback, that is,  give user misconception that their data is READY in S3, but 
actually not. Here is a simple example, users take a snapshot on one zone and 
backup to S3, based on S3 region-wide nature, it is very natural for them to 
think that they can immediately restore this snapshot on another zone. However, 
for current master implementation, this may fail. Due to S3 network connection 
issue at backup moment, snapshot may not be ready on S3, and only stored in 
zone-wide NFS secondary storage. Another backup sync process is not kicked in 
yet. If now users are trying to do restore action, it will doom failure in not 
finding proper snapshot. In our opinion, enhancing current object_store 
implementation with some configured retry logic should be a good compromise.

Thanks.
-min

Reply via email to