2011-06-10 18:00, Steve Gonczi пишет:
Hi Jim,

I wonder what  OS version you are running?

There was a problem similar to what you are describing in earlier versions
in the 13x kernel series.

Should not be present in the 14x kernels.

It is OpenIndiana oi_148a, and unlike many other details -
this one was in my email post today ;)

I missed the system parameters in your earlier emails.

Other config info: one dual-core P4 @2.8Ghz or so, 8Gb RAM
(max for the motherboard), 6*2Tb Seagate ST2000DL003 disks
in raidz2 plus an old 80Gb disk for the OS and swap.

This system turned overnight from a test box to a backup
of some of my old-but-needed files, to their only storage
after the original server was cleaned. So I do want this
box to work reliably and not lose the data which is on it
already, and without dedup at 1.5x I'm running close to
not fitting in these 8Tb ;)

The usual  recommended solution is
"oh just do not use dedup, it is not production ready"

Well, alas, I am getting to the same conclusion so far...
I have never seen any remotely similar issues on any other
servers I maintain, but this one is the first guinea-pig
box for dedup. I guess for the next few years (until 128Gb
RAM or so would become the norm for a cheap home enthusiast
NAS) this one may be the last, too ;)

Using Dedup is more or less  hopeless with less than 16Gigs of memory.
(More is better)

Well... 8Gb RAM is not as low-end configuration as some
others discussed in home-NAS-with-ZFS blogs which claimed
to have used dedup when it first came out. :)

Since there's little real information about DDT appetites
so far (this list is still buzzing with calculations and
tests), I assumed that 8Gb RAM is a reasonable amount for
starters. At least it would have been for, say, Linux or
other open-source enthusiasts communities, which have to
do with whatever crappy hardware they got second-handed ;)
I was in that camp, and I know that personal budgets do
not often assume more than 1-2k$ per box. Until recently
that included very little RAM. And as I said, this is the
maximum which can be put into my motherboard anyway.

Anyhow, this box has two pools at the moment:
* "pool" is the physical raidz2 on 6 disks with ashift=12
  Some datasets on pool are deduped and compressed (lzjb).
* "dcpool" is built in a compressed volume inside "pool",
  which is loopback-mounted over iSCSI and in the resulting
  disk I made another pool with deduped datasets.

This "dcpool" was used to test the idea about separating
compression and deduplication (so that dedup decisions
are made about raw source data, and after that whatever
has to be written is compressed - once).

The POC worked somewhat - except that I used small block
sizes in the "pool/dcpool" volume, so ZFS metadata to
address the volume blocks takes up as much as the userdata.

Performance was abysmal - around 1Mb/sec to write into
"dcpool" lately, and not much faster to read it, so for
the past month I am trying to evacuate my data back from
"dcpool" into "pool" which performs faster - about 5-10Mb/s
during these copies between pools, and according to iostat,
the physical harddisks are quite busy (over 60%) while
"dcpool" is often stuck at 100% busy with several seconds(!)
of wait times and zero IO operations. The physical "pool"
datasets without dedup performed at "wirespeed" 40-50Mb/s
when I was copying files over CIFS from another computer
over a home gigabit LAN. Since this box is primarily an
archive storage with maybe a few datasets dedicated to
somewhat active data (updates to photo archive), slow
but infrequent IO to deduped data is okay with me --
as long as the system doesn't crash as it does now.

Partially, this is why I bumped up TXG Sync intervals
to 30 seconds - so that ZFS would have more data in
buffers after slow IO to coalesce and try to minimize
fragmentation and mechanical IOPS involved.


The issue was an incorrectly sized buffer that caused ZFS to be waiting too long for a buffer allocation. I can dig up the bug number and the fix description
if you are running someting 130-ish.

Unless this fix was not integrated into OI for some reason,
I am afraid digging it up would be of limited help. Still,
I would be interested to read the summary, postmortem and
workarounds.

Maybe this was broken again in OI by newer "improvements"? ;)

The thing I would want to check is the sync times and frequencies.
You can dtrace (and timestamp) this.

Umm... could you please suggest a script, preferably one
that I can leave running on console and printing stats
every second or so?

I can only think of "time sync" so far. And I also think
it could be counted in a few seconds or more.


I would suspect when the bad  state occurs, your sync is taking a
_very_ long time.

When the VERY BAD state occurs, I can no longer use the
system or test anything ;)

When it nearly occurs, I have only a few seconds of uptime
left, and since each run boot-to-crash takes roughly 2-3
hours now, I am unlikely to be active at the console at
these critical few seconds.

And the "sync" would likely never return in this case, too.

  Delete's,  and dataset / snapshot deletes
are not managed correctly in a deduped environment in
ZFS.  This is a known problem although it should not be anywhere
nearly as bad as what you are describing in the current tip.

Well, it is, on a not lowest-end hardware (at least in
terms of what OpenSolaris developers can expect from a
general enthusiast community which is supposed to help
by testing, deploying and co-developing the best OS).

The part where such deletes are slow are understandable
and explainable - I don't have any big performance
expectations for the box, 10Mbyte/sec is quite fine
with me here. The part where it leads to crashes and
hangs system programs (zfs, zpool, etc) is unacceptable.


The startup delay you are seeing is another "feature" of ZFS, if you reboot
in the middle of a large file delete or dataset destroy, ZFS ( and the OS)
will not come up until it finishes the delete or dataset destroy first.

Why can't it be an intensive, but background, operation?
Import the pool, let it be used, and go on deleting...
like it was supposed to be in that lifetime when the
box began deleting these blocks ;)

Well, it took me a worrysome while to figure this out
the first time, a couple of months ago. Now I am just
rather annoyed about absence of access to my box and
data, but I hope that it will come around after several
retries.

Apparently, this unpredictability (and slowness and
crashes) is a show-stopper for any enterprise use.

I have made workarounds for the OS to come up okay,
though. Since the root pool is separate, I removed
"pool" and "dcpool" from zpool.cache file, and now the
OS milestones do not depend on them to be available.

Instead, importing the "pool" (with cachefile=none),
starting the iscsi target and initiator, creating and
removing the LUN with sbdadm, and importing the
"dcpool" are all wrapped in several SMF services
so I can relatively easily control the presence
of these pools (I can disable them from autostart
by touching a file in /etc directory).

Steve



----- "Jim Klimov" <jimkli...@cos.ru> wrote:

    I've captured an illustration for this today, with my watchdog as
    well as vmstat, top and other tools. Half a gigabyte in under one
    second - the watchdog never saw it coming :(




--


+============================================================+
|                                                            |
| Климов Евгений,                                 Jim Klimov |
| технический директор                                   CTO |
| ЗАО "ЦОС и ВТ"                                  JSC COS&HT |
|                                                            |
| +7-903-7705859 (cellular)          mailto:jimkli...@cos.ru |
|                          CC:ad...@cos.ru,jimkli...@mail.ru |
+============================================================+
| ()  ascii ribbon campaign - against html mail              |
| /\                        - against microsoft attachments  |
+============================================================+



_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Reply via email to