Re: [dev] Re: coreutils / moreutils - DC a directory counter
On Mon, Jul 29, 2013 at 6:39 AM, Paul Hoffman nkui...@nkuitse.com wrote: I just read something about using LD_PRELOAD for this. Write a library that implements open(2), munging the file path and then calling the real open(2). Then you just set LD_PRELOAD in the environment of the scripts and Bob's your uncle. Nice how I find this idea again here, after the BEAUTIFUL day when systemd munged my /tmp path, leaving me puzzled about what the hell was going on... rage. Otoh, the munging could work, unless the hashed filesystem would overflow PATH_MAX. Measured on the amount of sanity in the system which OP announced, this could be solved by injecting slashes into the filename. cheers! mar77i
Re: [dev] Re: coreutils / moreutils - DC a directory counter
On mán 29.júl 2013 04:39, Paul Hoffman wrote: Their 100+ Perl and bash scripts are slow because they're opening files in a humongous directory. They can't subdivide the directory because they're afraid that they will break the scripts when modifying them. I posted a comprehensive comment on the blog post that has yet to be approved by the censor. In short, ext2/3 directories are linked lists. You can traverse said list in constant space and process each entry when you encounter it. O(n) time is unavoidable. Bash globs and ls listings are automatically sorted. Dash and ls -f or find . -type f don't. Switching to dash might result in all sorts of compatibility issues, but s/ls/ls -f/g is easy to test, and just might work. And then s/ \* / \$(ls -f) /g (assuming old regex, \n in $IFS). Dividing the directory into folders requires structural changes and just contains the scalability issue instead of just not sorting. Sorting does not only take up to O(n^2) time, but requires searching for every single entry in the linked list. That's equal to traversing half the list n times, instead of all of the list just once. http://ext2.sourceforge.net/2005-ols/paper-html/node3.html P.S. This just might be my favorite regex: s/ \* / \$\(ls -f\) /g.
Re: [dev] Re: coreutils / moreutils - DC a directory counter
Bjartur Thorlacius dixit: by the censor. In short, ext2/3 directories are linked lists. You can traverse Are they, still? I thought they had the equivalent of UFS_DIRHASH nowadays… bye, //mirabilos -- [...] if maybe ext3fs wasn't a better pick, or jfs, or maybe reiserfs, oh but what about xfs, and if only i had waited until reiser4 was ready... in the be- ginning, there was ffs, and in the middle, there was ffs, and at the end, there was still ffs, and the sys admins knew it was good. :) -- Ted Unangst über *fs
Re: [dev] Re: coreutils / moreutils - DC a directory counter
On mán 29.júl 2013 11:38, Thorsten Glaser wrote: Are they, still? I thought they had the equivalent of UFS_DIRHASH nowadays… Ext4 does, optionally.
Re: [dev] Re: coreutils / moreutils - DC a directory counter
On Fri, Jul 26, 2013 at 01:17:54AM +, Thorsten Glaser wrote: Calvin Morrison dixit: I was sick of ls | wc -l being so damned slow on large directories, so What, besides the printing and sorting, is the slow part anyway? Is it the VFS API or just the filesystem code? In the latter case… could workarounds exist? Someone asked this… http://fenski.pl/2013/07/looking-for-a-specific-fuse-based-filesystem/ … on Planet Debian this night. Summarized: Their 100+ Perl and bash scripts are slow because they're opening files in a humongous directory. They can't subdivide the directory because they're afraid that they will break the scripts when modifying them. I just read something about using LD_PRELOAD for this. Write a library that implements open(2), munging the file path and then calling the real open(2). Then you just set LD_PRELOAD in the environment of the scripts and Bob's your uncle. Don't shoot me, I have no idea whether that's a good idea or not! Paul. -- Paul Hoffman nkui...@nkuitse.com
Re: [dev] Re: coreutils / moreutils - DC a directory counter
Calvin Morrison dixit: Its called unionfs if I recall No. Go read it again. On Jul 25, 2013 9:28 PM, Thorsten Glaser t...@mirbsd.de wrote: And stop top-posting and full-quoting. Read http://www.afaik.de/usenet/faq/zitieren/ (it’s in German, English and Dutch, so no excuses). bye, //mirabilos -- emacs als auch vi zum Kotzen finde (joe rules) und pine für den einzig bedienbaren textmode-mailclient halte (und ich hab sie alle ausprobiert). ;) Hallo, ich bin der Holger (Hallo Holger!), und ich bin ebenfalls ... pine-User, und das auch noch gewohnheitsmäßig (Oooohhh). [aus dasr]
Re: [dev] Re: coreutils / moreutils - DC a directory counter
Yes master On Jul 26, 2013 3:40 AM, Thorsten Glaser t...@mirbsd.de wrote: Calvin Morrison dixit: Its called unionfs if I recall No. Go read it again. On Jul 25, 2013 9:28 PM, Thorsten Glaser t...@mirbsd.de wrote: And stop top-posting and full-quoting. Read http://www.afaik.de/usenet/faq/zitieren/ (it’s in German, English and Dutch, so no excuses). bye, //mirabilos -- emacs als auch vi zum Kotzen finde (joe rules) und pine für den einzig bedienbaren textmode-mailclient halte (und ich hab sie alle ausprobiert). ;) Hallo, ich bin der Holger (Hallo Holger!), und ich bin ebenfalls ... pine-User, und das auch noch gewohnheitsmäßig (Oooohhh). [aus dasr]
[dev] Re: coreutils / moreutils - DC a directory counter
Calvin Morrison dixit: I was sick of ls | wc -l being so damned slow on large directories, so What, besides the printing and sorting, is the slow part anyway? Is it the VFS API or just the filesystem code? In the latter case… could workarounds exist? Someone asked this… http://fenski.pl/2013/07/looking-for-a-specific-fuse-based-filesystem/ … on Planet Debian this night. Something to think about. (No further input from me, besides mumbling that I had a vague idea of similar concept and wouldn’t be surprised if something like that already existed, and probably only in the Plan 9 world…) bye, //mirabilos -- (gnutls can also be used, but if you are compiling lynx for your own use, there is no reason to consider using that package) -- Thomas E. Dickey on the Lynx mailing list, about OpenSSL
Re: [dev] Re: coreutils / moreutils - DC a directory counter
Its called unionfs if I recall On Jul 25, 2013 9:28 PM, Thorsten Glaser t...@mirbsd.de wrote: Calvin Morrison dixit: I was sick of ls | wc -l being so damned slow on large directories, so What, besides the printing and sorting, is the slow part anyway? Is it the VFS API or just the filesystem code? In the latter case… could workarounds exist? Someone asked this… http://fenski.pl/2013/07/looking-for-a-specific-fuse-based-filesystem/ … on Planet Debian this night. Something to think about. (No further input from me, besides mumbling that I had a vague idea of similar concept and wouldn’t be surprised if something like that already existed, and probably only in the Plan 9 world…) bye, //mirabilos -- (gnutls can also be used, but if you are compiling lynx for your own use, there is no reason to consider using that package) -- Thomas E. Dickey on the Lynx mailing list, about OpenSSL
Re: [dev] Re: coreutils / moreutils - DC a directory counter
On 7/17/13, Calvin Morrison mutanttur...@gmail.com wrote: On 17 July 2013 16:32, Christian Neukirchen chneukirc...@gmail.com wrote: What's the bottle neck here? Looking up the filenames and reading them, printing them to standard out and then wc parsing for all the \n characters. Most ls implementations also sort the list of filenames by default. Is that the bottleneck? (Or is your dc only faster because the directory index is in cache now...) No that's not why: calvin@ecoli:~/big_folder ls 2v1 | wc -l 687560 real 0m7.678s user 0m7.313s sys 0m0.579s calvin@ecoli:~/big_folder time dc 2v1 687560 real 0m0.138s user 0m0.055s sys 0m0.082s calvin@ecoli:~/big_folder time ls 2v1 | wc -l 687560 real 0m7.672s user 0m7.310s sys 0m0.580s Um. How often are you going to use this new C program which saves only 7.5 seconds? Robert Ransom
Re: [dev] Re: coreutils / moreutils - DC a directory counter
* Calvin Morrison mutanttur...@gmail.com [2013-07-17 16:43:00 -0400]: On 17 July 2013 16:32, Christian Neukirchen chneukirc...@gmail.com wrote: calvin@ecoli:~/big_folder time ls file2v1dir/ | wc -l 687560 real0m7.798s user0m7.317s sys 0m0.700s calvin@ecoli:~/big_folder time ~/bin/dc file2v1dir/ 687560 real0m0.138s user0m0.057s sys 0m0.081s What do you think? Calvin What's the bottle neck here? Looking up the filenames and reading them, printing them to standard out and then wc parsing for all the \n characters. if it's coreutils ls|wc then most of the time is locale specific code (strcoll and encoding related), try export LC_ALL=C ls -f |wc -l
[dev] Re: coreutils / moreutils - DC a directory counter
Calvin Morrison mutanttur...@gmail.com writes: Hi guys, I came up with a utility[0] that i think could be useful, and I sent it to the moreutils page, but maybe it might fit better here. All it does is give a count of files in a directory. I was sick of ls | wc -l being so damned slow on large directories, so I thought a more direct solution would be better. calvin@ecoli:~/big_folder time ls file2v1dir/ | wc -l 687560 real0m7.798s user0m7.317s sys 0m0.700s calvin@ecoli:~/big_folder time ~/bin/dc file2v1dir/ 687560 real0m0.138s user0m0.057s sys 0m0.081s What do you think? Calvin What's the bottle neck here? (Or is your dc only faster because the directory index is in cache now...) -- Christian Neukirchen chneukirc...@gmail.com http://chneukirchen.org
Re: [dev] Re: coreutils / moreutils - DC a directory counter
On 17 July 2013 16:32, Christian Neukirchen chneukirc...@gmail.com wrote: Calvin Morrison mutanttur...@gmail.com writes: Hi guys, I came up with a utility[0] that i think could be useful, and I sent it to the moreutils page, but maybe it might fit better here. All it does is give a count of files in a directory. I was sick of ls | wc -l being so damned slow on large directories, so I thought a more direct solution would be better. calvin@ecoli:~/big_folder time ls file2v1dir/ | wc -l 687560 real0m7.798s user0m7.317s sys 0m0.700s calvin@ecoli:~/big_folder time ~/bin/dc file2v1dir/ 687560 real0m0.138s user0m0.057s sys 0m0.081s What do you think? Calvin What's the bottle neck here? Looking up the filenames and reading them, printing them to standard out and then wc parsing for all the \n characters. (Or is your dc only faster because the directory index is in cache now...) No that's not why: calvin@ecoli:~/big_folder ls 2v1 | wc -l 687560 real 0m7.678s user 0m7.313s sys 0m0.579s calvin@ecoli:~/big_folder time dc 2v1 687560 real 0m0.138s user 0m0.055s sys 0m0.082s calvin@ecoli:~/big_folder time ls 2v1 | wc -l 687560 real 0m7.672s user 0m7.310s sys 0m0.580s -- Christian Neukirchen chneukirc...@gmail.com http://chneukirchen.org
[dev] Re: coreutils / moreutils - DC a directory counter
Calvin Morrison mutanttur...@gmail.com writes: On 17 July 2013 16:32, Christian Neukirchen chneukirc...@gmail.com wrote: Calvin Morrison mutanttur...@gmail.com writes: Hi guys, I came up with a utility[0] that i think could be useful, and I sent it to the moreutils page, but maybe it might fit better here. All it does is give a count of files in a directory. I was sick of ls | wc -l being so damned slow on large directories, so I thought a more direct solution would be better. calvin@ecoli:~/big_folder time ls file2v1dir/ | wc -l 687560 real0m7.798s user0m7.317s sys 0m0.700s calvin@ecoli:~/big_folder time ~/bin/dc file2v1dir/ 687560 real0m0.138s user0m0.057s sys 0m0.081s What do you think? Calvin What's the bottle neck here? Looking up the filenames and reading them, printing them to standard out and then wc parsing for all the \n characters. (Or is your dc only faster because the directory index is in cache now...) No that's not why: calvin@ecoli:~/big_folder ls 2v1 | wc -l 687560 real 0m7.678s user 0m7.313s sys 0m0.579s calvin@ecoli:~/big_folder time dc 2v1 687560 real 0m0.138s user 0m0.055s sys 0m0.082s calvin@ecoli:~/big_folder time ls 2v1 | wc -l 687560 real 0m7.672s user 0m7.310s sys 0m0.580s How fast is find 2v1 -printf x | wc -c ? -- Christian Neukirchen chneukirc...@gmail.com http://chneukirchen.org
Re: [dev] Re: coreutils / moreutils - DC a directory counter
On 17 July 2013 16:58, Christian Neukirchen chneukirc...@gmail.com wrote: Calvin Morrison mutanttur...@gmail.com writes: On 17 July 2013 16:32, Christian Neukirchen chneukirc...@gmail.com wrote: Calvin Morrison mutanttur...@gmail.com writes: Hi guys, I came up with a utility[0] that i think could be useful, and I sent it to the moreutils page, but maybe it might fit better here. All it does is give a count of files in a directory. I was sick of ls | wc -l being so damned slow on large directories, so I thought a more direct solution would be better. calvin@ecoli:~/big_folder time ls file2v1dir/ | wc -l 687560 real0m7.798s user0m7.317s sys 0m0.700s calvin@ecoli:~/big_folder time ~/bin/dc file2v1dir/ 687560 real0m0.138s user0m0.057s sys 0m0.081s What do you think? Calvin What's the bottle neck here? Looking up the filenames and reading them, printing them to standard out and then wc parsing for all the \n characters. (Or is your dc only faster because the directory index is in cache now...) No that's not why: calvin@ecoli:~/big_folder ls 2v1 | wc -l 687560 real 0m7.678s user 0m7.313s sys 0m0.579s calvin@ecoli:~/big_folder time dc 2v1 687560 real 0m0.138s user 0m0.055s sys 0m0.082s calvin@ecoli:~/big_folder time ls 2v1 | wc -l 687560 real 0m7.672s user 0m7.310s sys 0m0.580s How fast is find 2v1 -printf x | wc -c ? -- Christian Neukirchen chneukirc...@gmail.com http://chneukirchen.org time find 2v1 -printf x | wc -c 687561 real 0m0.531s user 0m0.264s sys 0m0.271s time ls 2v1 /dev/null real 0m7.642s user 0m7.265s sys 0m0.375s So it seems a good deal of that time is ls
Re: [dev] Re: coreutils / moreutils - DC a directory counter
On 07/17/2013 09:02 PM, Calvin Morrison wrote: So it seems a good deal of that time is ls Wait, sbase ls doesn't seem to implement -f. Are you sorting the directory entries?