Night helps, and I solved what happened: - data and data-backup were both mounted, so windows sharing applied for both - after shutdown Fri 15, and reboot Sat 16, Windows were mounting data-backup shares instead of usual data shares - MacOSX machines mounting Stile CIFS shares, correctly mounted data ones. - Windows users were silently working on data-backup instead of data - on Tue 19, relaunching zfs send/recv from data to data-backup, destroyed any work of the windows users since Mon. - this is why I had no change in snapshots since Mon 16 in data, but only Mac works on stile. What I have to do (this will anyway loose any data changed during monday up to the resync): - halt any work from windows users - find any changed file in data-backup with date greater than 15 Nov 2013, and copy it in data - export data-backup - import -N data-backup (no mount, no sharing) - reboot the storage and see we have only the correct shares - let people work on the shares and verify they're working on data I was lucky that these windows users have few files over there (around 30GB). Gabriele. Da: Gabriele Bulfon A: [email protected] Cc: Raffaele Fullone Data: 20 novembre 2013 21.01.38 CET Oggetto: [discuss] Serious ZFS problem Hi, I'm in a very serious situation with a ZFS illumos based storage, that may completely render the solution insane... I really don't know what happened, so I'll try to describe the history of what happened in the last week. First, the problem: I found a part of the zfs filesystem back to a past date (15 Nov), but containing snapshots of the following days. The following days snapshots are 0 usage, as if nothing changed, and still new snapshots are 0 usage, as if nothing is changing, and they all contains the situation at that past date. The system has a zpool with a "data" filesystem, divided into: - data/stile - data/windows - data/windows/*some-different-zfs" These are all shared on the AD via CIFS. The system creates periodic recursive snapshots of the data pool every two hours during the day, with a retention over a week, then one every saturday with a retention over a month. Becuase it's a recent installation, we still have not automated the zfs send/recv to a backup system, so I manually run the send/recv every 2/3 days. On that day, 15 Nov at 17:xx, I created a "backup10" snapshot (knowing to have backup9), and then send 9 to 10 on the receving system. Then I deleted the old backup on the origin (backup9): zfs snapshot -r data@backup10 zfs send -Rv -i data@backup9 data@backup10 | zfs receive -Fd data-backup zfs destroy -r data@backup9 Everything went fine until yesterday, 19 Nov, when I wanted to resync the remote system: zfs snapshot -r data@backup11 zfs send -Rv -i data@backup10 data@backup11 | zfs receive -Fd data-backup zfs destroy -r data@backup10 Many more snapshots were present in between backup10 and backup11. After some minutes, I've been called because something was missing. And after some analisys I found some running filesystems were back to 15 Nov... As if the system was rolled back to the next snapshot after backup 10 (the destroyed one). But not all, just data/windows/*, not data/stile. More: all the snapshots after that date, and even the new ones created right now, presents always a 0 usage since then. As if nothing is changing. But people is working, and luckily that fs is not used so much, so just a bunch of files had to be updated by hand. But ZFS don't show it. Look at one of them: data/windows/commerciale 27.5G 1.20T 27.4G /data/windows/commerciale data/windows/commerciale@SATURDAY_2013-11-09_20:00:03 4.73M - 27.4G - data/windows/commerciale@DAILY_2013-11-13_09:00:00 669K - 27.4G - data/windows/commerciale@DAILY_2013-11-13_11:00:01 218K - 27.4G - data/windows/commerciale@DAILY_2013-11-13_13:00:01 212K - 27.4G - data/windows/commerciale@DAILY_2013-11-13_15:00:01 190K - 27.4G - data/windows/commerciale@DAILY_2013-11-13_17:00:01 192K - 27.4G - data/windows/commerciale@DAILY_2013-11-13_19:00:01 43K - 27.4G - data/windows/commerciale@DAILY_2013-11-14_09:00:01 45K - 27.4G - data/windows/commerciale@DAILY_2013-11-14_11:00:01 184K - 27.4G - data/windows/commerciale@DAILY_2013-11-14_13:00:01 175K - 27.4G - data/windows/commerciale@DAILY_2013-11-14_15:00:01 204K - 27.4G - data/windows/commerciale@DAILY_2013-11-14_17:00:01 202K - 27.4G - data/windows/commerciale@DAILY_2013-11-14_19:00:02 72K - 27.4G - data/windows/commerciale@DAILY_2013-11-15_09:00:01 214K - 27.4G - data/windows/commerciale@DAILY_2013-11-15_11:00:01 194K - 27.4G - data/windows/commerciale@DAILY_2013-11-15_13:00:01 202K - 27.4G - data/windows/commerciale@DAILY_2013-11-15_15:00:01 178K - 27.4G - data/windows/commerciale@DAILY_2013-11-15_17:00:02 376K - 27.4G - **here was my backup10 that I destroyed** data/windows/commerciale@DAILY_2013-11-15_19:00:01 53K - 27.4G - data/windows/commerciale@SATURDAY_2013-11-16_20:00:03 0 - 27.4G - data/windows/commerciale@DAILY_2013-11-18_09:00:01 0 - 27.4G - data/windows/commerciale@DAILY_2013-11-18_11:00:01 0 - 27.4G - data/windows/commerciale@DAILY_2013-11-18_13:00:02 0 - 27.4G - data/windows/commerciale@DAILY_2013-11-18_15:00:01 0 - 27.4G - data/windows/commerciale@DAILY_2013-11-18_17:00:01 0 - 27.4G - data/windows/commerciale@DAILY_2013-11-18_19:00:00 0 - 27.4G - data/windows/commerciale@DAILY_2013-11-19_09:00:02 0 - 27.4G - data/windows/commerciale@DAILY_2013-11-19_11:00:01 0 - 27.4G - data/windows/commerciale@DAILY_2013-11-19_13:00:01 0 - 27.4G - data/windows/commerciale@DAILY_2013-11-19_15:00:01 0 - 27.4G - data/windows/commerciale@DAILY_2013-11-19_17:00:01 0 - 27.4G - data/windows/commerciale@backup11 0 - 27.4G - data/windows/commerciale@DAILY_2013-11-19_19:00:01 0 - 27.4G - data/windows/commerciale@DAILY_2013-11-20_09:00:01 0 - 27.4G - data/windows/commerciale@DAILY_2013-11-20_13:00:02 0 - 27.4G - data/windows/commerciale@DAILY_2013-11-20_15:00:01 0 - 27.4G - data/windows/commerciale@DAILY_2013-11-20_17:00:02 0 - 27.4G - data/windows/commerciale@DAILY_2013-11-20_19:00:01 0 - 27.4G - Look at the fine one (stile): data/stile 1.42T 595G 1.41T /data/stile data/stile@SATURDAY_2013-11-09_20:00:03 143M - 1.41T - data/stile@DAILY_2013-11-13_09:00:00 41.1M - 1.41T - data/stile@DAILY_2013-11-13_11:00:01 9.44M - 1.41T - data/stile@DAILY_2013-11-13_13:00:01 12.1M - 1.41T - data/stile@DAILY_2013-11-13_15:00:01 19.2M - 1.41T - data/stile@DAILY_2013-11-13_17:00:01 12.1M - 1.41T - data/stile@DAILY_2013-11-13_19:00:01 6.07M - 1.41T - data/stile@DAILY_2013-11-14_09:00:01 708K - 1.41T - data/stile@DAILY_2013-11-14_11:00:01 3.10M - 1.41T - data/stile@DAILY_2013-11-14_13:00:01 24.4M - 1.41T - data/stile@DAILY_2013-11-14_15:00:01 34.9M - 1.41T - data/stile@DAILY_2013-11-14_17:00:01 704M - 1.41T - data/stile@DAILY_2013-11-14_19:00:02 906K - 1.41T - data/stile@DAILY_2013-11-15_09:00:01 778K - 1.41T - data/stile@DAILY_2013-11-15_11:00:01 14.9M - 1.41T - data/stile@DAILY_2013-11-15_13:00:01 1.81M - 1.41T - data/stile@DAILY_2013-11-15_15:00:01 1.74M - 1.41T - data/stile@DAILY_2013-11-15_17:00:02 37.9M - 1.41T - **here was my backup10 that I destroyed** data/stile@DAILY_2013-11-15_19:00:01 510M - 1.41T - data/stile@SATURDAY_2013-11-16_20:00:03 0 - 1.41T - data/stile@DAILY_2013-11-18_09:00:01 0 - 1.41T - data/stile@DAILY_2013-11-18_11:00:01 180M - 1.41T - data/stile@DAILY_2013-11-18_13:00:02 19.4M - 1.41T - data/stile@DAILY_2013-11-18_15:00:01 11.6M - 1.41T - data/stile@DAILY_2013-11-18_17:00:01 3.64M - 1.41T - data/stile@DAILY_2013-11-18_19:00:00 42K - 1.41T - data/stile@DAILY_2013-11-19_09:00:02 42K - 1.41T - data/stile@DAILY_2013-11-19_11:00:01 41.1M - 1.41T - data/stile@DAILY_2013-11-19_13:00:01 31.1M - 1.41T - data/stile@DAILY_2013-11-19_15:00:01 5.66M - 1.41T - data/stile@DAILY_2013-11-19_17:00:01 35.2M - 1.41T - data/stile@backup11 36.7M - 1.41T - data/stile@DAILY_2013-11-19_19:00:01 136K - 1.41T - data/stile@DAILY_2013-11-20_09:00:01 119K - 1.41T - data/stile@DAILY_2013-11-20_13:00:02 16.0M - 1.41T - data/stile@DAILY_2013-11-20_15:00:01 134M - 1.41T - data/stile@DAILY_2013-11-20_17:00:02 7.15M - 1.41T - data/stile@DAILY_2013-11-20_19:00:01 113K - 1.41T - As you can see, in the fine one, data is changing on every snap. On the failing one, I can tell you that data is back to the last without a "0". People needs an explanation, and I can't tell.............. More, how do I get it back to normal?? I don't even have new valid snapshots.... Help! Gabriele. illumos-discuss | Archives | Modify Your Subscription
------------------------------------------- illumos-discuss Archives: https://www.listbox.com/member/archive/182180/=now RSS Feed: https://www.listbox.com/member/archive/rss/182180/21175430-2e6923be Modify Your Subscription: https://www.listbox.com/member/?member_id=21175430&id_secret=21175430-6a77cda4 Powered by Listbox: http://www.listbox.com
