Since this topic refuses to die, I would like to have a thread where pros and cons about memory usage limits are explained in detail. I can then tell people to read this thread first.
People have strong and conflicting opinions, and it's not possible to make everyone happy. I'm not proposing any changes, but I might consider some tweaks depending on what people think. The default memory usage limits were removed in 5.0.0. The problem with the limits was that xz could refuse to decompress some files on a low-memory system even if there was enough free swap space to get the job done. There are situations where it's useful to have a limiter for decompression, but it's clear that single-threaded decompression must not have any limits by default. Unsurprisingly, the removal of memory usage limits had an opposite effect when compressing. With a default limit, xz used to adjust the compression settings down on a low-memory system. Now it can fail on the same system, requiring the user to adjust the settings manually or setting a memory usage limit for compression. Some people don't like that xz doesn't "just work" on low-end systems. They think that it's better that xz runs even if it means worse compression, because xz failing in the middle of a big script isn't so fun. It's similar to decompression with a memory usage limit, where it wasn't fun when xz failed in the middle of a big script. It has also been suggested that xz should check RLIMIT_DATA and RLIMIT_AS when compressing (but not when decompressing). They would be used as an implicit memory usage limit to prevent xz from trying to allocate more memory than the system will allow. Since xz will fail and exit if malloc() doesn't succeed, checking the resource limits would keep xz working as long as they are reasonably high. There are reasons why it can be good that xz fails instead of just works when there's not enough RAM or the resource limits are too low. xz prints a notice when it adjusts the compression settings. Since compression will still succeed, that notice might get lost if xz is run non-interactively. It may take a while until users figure out that the same script gives different output on different systems with identical software, and why it does that. Many people don't read logs unless something has already gone wrong. If xz fails when there isn't enough RAM+swap or the resource limits are too low, it will be obvious to the user that xz cannot do exactly what it was asked to do. If the user has set low resource limits just in case even when he has lots of RAM, getting an error will make the user increase the limits. The notice that xz prints when it adjusts the limits should do that too when running xz interactively, but in non-interactive example in the previous paragraph it's no so obvious. Sometimes the adjusted settings will cause the compression ratio to be much worse than it would be without adjusting. Sometimes this is fine to keep compression speed sane. Sometimes it would be better that xz used the original settings even if it makes things very slow due to heavy swapping. Sometimes repeatable compression is needed, that is, get the same compressed output from the same input on multiple computers. Since the output of xz may vary between versions, this requires using the same xz version on every system. I think this is somewhat unusual situation although, if I remember correctly, it is needed e.g. by DeltaRPM to get the package signatures to match. It can be argued that signing compressed data isn't the best way, but it's off-topic in this thread. When repeatable compression is needed, it's essential that xz doesn't change the compression settings, even if that means heavy swapping or that the compression fails completely due to failed malloc(). Having a limiter enabled by default is a bit risky, because the limiter would get hit only on low-end systems and thus developers could easily forget to add an option to explicitly disable the limiter for this use case. Documenting doesn't help much because many people don't read docs unless something has already gone wrong. Some think that a UNIX program should blindly try to do whatever the user told it and fail only if the task simply cannot be done. Self-aware programs that adapt to their environment (especially when it affects the output) belong to Windows. ;-) It can be argued that most of the above cases are not so common, and that the common case is that people want xz to just work even if it means suboptimal compression. It depends on what a person finds the most important and so there are conflicting wishes. When the default limit was removed, XZ_DEFAULTS environment variable was added to let people set default limits. It was very simple to add because there was already support for XZ_OPT variable. Using the XZ_DEFAULTS environment variable to set default memory usage limits isn't liked by everyone who want to enable limits: - If you login interactively with ssh, the shell startup scripts are executed and XZ_DEFAULTS will be set. But if ssh is used to run a remote command (e.g. "ssh myhost myprogram"), the startup scripts aren't read and XZ_DEFAULTS won't be there. - /etc/profile or equivalent usually isn't executed by initscripts when starting daemons. Some daemons use xz. - People don't want to pollute the environment with variables that affect only one program. Having a configuration file would fix the above problems, but XZ Utils is already an over-engineered pig, so I'm not so eager to add config file support. I have thought about adding configure options that would allow setting default limits for compression and decompression. Someone may think that it can confuse things even more, but on the other hand some people already patch xz to have a limit for compression by default. I haven't thought much about memory usage limits with threading, but below are some preliminary thoughts. With compression, -T0 in 5.1.1alpha sets the number of threads to match the number of CPU cores. If no memory usage limit has been set, it may end up using more memory than there is RAM. Pushing the system to swap with threading is silly, because the point of threading in xz is to make it faster. So it might make sense to have some kind of default soft limit that is used to limit the number of threads when automatic number of threads is requested. With threaded decompression (not implemented yet) and no memory usage limit, the worst case is that xz will try to read the whole input file to memory, which is silly. So it probably will need some sort of soft default limit to keep the maximum memory usage sane. The definition of sane is unclear though. It's not necessarily the same as for compression. -- Lasse Collin | IRC: Larhzu @ IRCnet & Freenode