Regarding the block size versus apparent size, like Padraig, I think it's okay to let --size just filter the size, whichever the user happens to choose right now. One can combine it with apparent size.
I think having special character in --size to denote max and min sizes are confusing. Why not have separate --max-size and --min-size arguments? This way you can filter by a range, and it's obvious what the flags mean. It's also more consistent in style with the --max-depth flag. If you absolutely want just one arg, what about --size=[minsize]-[maxsize]? e.g. --size=4K- filters output to entries greater than 4K, --size=-8K filters output to those lesser than 8K, and --size=4K-8K filters output for those between 4K and 8K. liulk On Thu, Jan 17, 2013 at 6:51 AM, Pádraig Brady <p...@draigbrady.com> wrote: > On 01/17/2013 07:19 AM, Bernhard Voelker wrote: > > On 01/17/2013 02:46 AM, Pádraig Brady wrote: > >> On 01/17/2013 01:23 AM, Bernhard Voelker wrote: > >>> I was pretty sure that this slipped also from Padraig's list. > >> > >> Sorry for the delay in this. > >> > >> Note it's still on the list: > >> http://www.pixelbeat.org/**patches/coreutils/inbox_dec_**2012.html<http://www.pixelbeat.org/patches/coreutils/inbox_dec_2012.html> > >> > >> You can browse older news and subscribe to new updates at: > >> http://www.pixelbeat.org/**patches/coreutils/<http://www.pixelbeat.org/patches/coreutils/> > > > > Thanks for the links. > > > >>> Therefore, I took Jakob's patch and amended it with documentation > >>> and a comprehensive test. ;-) > >> > >> Wow great work on the test. > > > > Well, that test just grew and grew. It's actually a result of > > me not being 100% happy with the --size option as in some > > situations it might confuse people more than it may help: > > > > E.g. users usually tend to "think in apparent sizes" for their > > files instead of block sizes. > > > > Having a directory like this: > > > > $ find tmp -exec ls -dog '{}' + > > drwxr-xr-x 5 4096 Jan 17 07:28 tmp > > drwxr-xr-x 2 4096 Jan 17 07:29 tmp/big_dir > > -rw-r--r-- 1 104857600 Jan 17 07:29 tmp/big_dir/big_file > > drwxr-xr-x 2 4096 Jan 17 07:25 tmp/empty_dir > > drwxr-xr-x 2 4096 Jan 17 07:28 tmp/small_dir > > -rw-r--r-- 1 6 Jan 17 07:26 tmp/small_dir/small_file > > -rw-r--r-- 1 0 Jan 17 07:22 tmp/x0 > > -rw-r--r-- 1 1 Jan 17 07:22 tmp/x1 > > -rw-r--r-- 1 10 Jan 17 07:22 tmp/x2 > > -rw-r--r-- 1 100 Jan 17 07:22 tmp/x3 > > -rw-r--r-- 1 1000 Jan 17 07:22 tmp/x4 > > -rw-r--r-- 1 10000 Jan 17 07:22 tmp/x5 > > -rw-r--r-- 1 100000 Jan 17 07:22 tmp/x6 > > -rw-r--r-- 1 1000000 Jan 17 07:22 tmp/x7 > > > > Then filter files and directories greater/equal 4000: > > > > $ src/du -B1 -a --size=4000 tmp | sort -k2 > > 106012672 tmp > > 104861696 tmp/big_dir > > 104857600 tmp/big_dir/big_file > > 4096 tmp/empty_dir > > 8192 tmp/small_dir > > 4096 tmp/small_dir/small_file > > 4096 tmp/x1 > > 4096 tmp/x2 > > 4096 tmp/x3 > > 4096 tmp/x4 > > 12288 tmp/x5 > > 102400 tmp/x6 > > 1003520 tmp/x7 > > > > This included also the small files tmp/x1 while it left out > > the empty file tmp/x0 ... but yet included the empty directory > > tmp/empty_dir. This feels somehow counter-intuitive. > > > > Now let's use the "apparent size": > > $ src/du -B1 -a --size=4000 --app tmp | sort -k2 > > 105985101 tmp > > 104861696 tmp/big_dir > > 104857600 tmp/big_dir/big_file > > 4096 tmp/empty_dir > > 4102 tmp/small_dir > > 10000 tmp/x5 > > 100000 tmp/x6 > > 1000000 tmp/x7 > > > > This is much better. Well, the empty directory still shows up > > here (which might be different on a different file system), > > but at least the small files have gone. > > > > Thus said, it seems that automatically applying --apparent > > when -a and --size is specified would give a more "natural" > > result. > > > > In practice, the users will probably only search for huge files > > and directories, i.e. much greater than the file system's > > block size, but even then they'd be trapped by forgetting the > > --app option when it comes to sparse files: > > > > $ src/truncate --size=1T tmp/sparse-1T > > > > $ src/du -h -a --size=100M tmp > > 100M tmp/big_dir/big_file > > 101M tmp/big_dir > > 102M tmp > > > > $ src/du -h -a --size=100M --app tmp > > 100M tmp/big_dir/big_file > > 101M tmp/big_dir > > 1.0T tmp/sparse-1T > > 1.1T tmp > > > > The only way out of this - probably only my - confusion would > > be to prevent the use of the -a and the --size option together. > > But this would artificially restrict the user's flexibility. > > > > Does anyone else have such a feeling, too? > > I think it's fine to have --size filtering what du outputs. > I.E. have it just honor -a. Your info on the subject is clear enough: > > > +Please note that the @option{--size} option can be combined with the > above > +@option{--apparent-size} option, and in this case would elide entries > based on > +its apparent size. This makes most sense for files, i.e. when the > @option{-a} > +is specified, too. > > I'd remove the last sentence above actually. > The user may want to operate on the cumulative apparent size for dirs. > > > >> I wonder would it make sense to have consistent --size > >> handling for du and truncate. I.E. have --size='<10M' > >> specify the max size and --size='>10M' specify the min size? > > > > I personally do not like shell-special characters in optargs > > too much, as many users will forget to put it into quotes; > > --size=<10M may not be a great problem, but --size=>10M > > may destroy data. > > Yes I agree. Maybe we should enforce the '+', > but then again maybe not since it means '>' in `find`, > rather than '>='. For comparison as it stands: > > find -size +1233 ≍ du -B512 -a --size '1234' > find -size +1233c ≍ du -a --size '1234' > > > > I was rather thinking that to make it more consistent with > > "find tmp -size +10M", or even to teach find a new -csize > > (cumulative size) option ... as finding big directories was > > the original problem. On the other side, 'find' doesn't offer > > the flexibility to filter based on the block size, i.e. it > > would always include huge sparse files although these do > > not fill up the file system. > > > > Maybe the current implementation is still the better way ... > > +1 > > thanks, > Pádraig. >