Re: [dev] Re: coreutils / moreutils - DC a directory counter

2013-08-06 Thread Martti Kühne
On Mon, Jul 29, 2013 at 6:39 AM, Paul Hoffman nkui...@nkuitse.com wrote:

 I just read something about using LD_PRELOAD for this.  Write a library
 that implements open(2), munging the file path and then calling the
 real open(2).  Then you just set LD_PRELOAD in the environment of the
 scripts and Bob's your uncle.



Nice how I find this idea again here, after the BEAUTIFUL day when
systemd munged my /tmp path, leaving me puzzled about what the hell
was going on... rage.
Otoh, the munging could work, unless the hashed filesystem would
overflow PATH_MAX. Measured on the amount of sanity in the system
which OP announced, this could be solved by injecting slashes into the
filename.

cheers!
mar77i



Re: [dev] Re: coreutils / moreutils - DC a directory counter

2013-07-29 Thread Bjartur Thorlacius

On mán 29.júl 2013 04:39, Paul Hoffman wrote:

Their 100+ Perl and bash scripts are slow because they're opening files
in a humongous directory.  They can't subdivide the directory because
they're afraid that they will break the scripts when modifying them.
I posted a comprehensive comment on the blog post that has yet to be 
approved by the censor. In short, ext2/3 directories are linked lists. 
You can traverse said list in constant space and process each entry when 
you encounter it. O(n) time is unavoidable. Bash globs and ls listings 
are automatically sorted. Dash and ls -f or find . -type f don't. 
Switching to dash might result in all sorts of compatibility issues, but 
s/ls/ls -f/g is easy to test, and just might work. And then s/ \* / 
\$(ls -f) /g (assuming old regex, \n in $IFS).


Dividing the directory into folders requires structural changes and just 
contains the scalability issue instead of just not sorting. Sorting does 
not only take up to O(n^2) time, but requires searching for every single 
entry in the linked list. That's equal to traversing half the list n 
times, instead of all of the list just once.


http://ext2.sourceforge.net/2005-ols/paper-html/node3.html

P.S. This just might be my favorite regex: s/ \* / \$\(ls -f\) /g.



Re: [dev] Re: coreutils / moreutils - DC a directory counter

2013-07-29 Thread Thorsten Glaser
Bjartur Thorlacius dixit:

 by the censor. In short, ext2/3 directories are linked lists. You can traverse

Are they, still? I thought they had the equivalent of UFS_DIRHASH
nowadays…

bye,
//mirabilos
-- 
[...] if maybe ext3fs wasn't a better pick, or jfs, or maybe reiserfs, oh but
what about xfs, and if only i had waited until reiser4 was ready... in the be-
ginning, there was ffs, and in the middle, there was ffs, and at the end, there
was still ffs, and the sys admins knew it was good. :)  -- Ted Unangst über *fs



Re: [dev] Re: coreutils / moreutils - DC a directory counter

2013-07-29 Thread Bjartur Thorlacius

On mán 29.júl 2013 11:38, Thorsten Glaser wrote:

Are they, still? I thought they had the equivalent of UFS_DIRHASH
nowadays…

Ext4 does, optionally.



Re: [dev] Re: coreutils / moreutils - DC a directory counter

2013-07-28 Thread Paul Hoffman
On Fri, Jul 26, 2013 at 01:17:54AM +, Thorsten Glaser wrote:
 Calvin Morrison dixit:
 
 I was sick of ls | wc -l being so damned slow on large directories, so
 
 What, besides the printing and sorting, is the slow part anyway?
 Is it the VFS API or just the filesystem code?
 
 In the latter case… could workarounds exist? Someone asked this…
 http://fenski.pl/2013/07/looking-for-a-specific-fuse-based-filesystem/
 … on Planet Debian this night.

Summarized:

Their 100+ Perl and bash scripts are slow because they're opening files 
in a humongous directory.  They can't subdivide the directory because 
they're afraid that they will break the scripts when modifying them.

I just read something about using LD_PRELOAD for this.  Write a library 
that implements open(2), munging the file path and then calling the 
real open(2).  Then you just set LD_PRELOAD in the environment of the 
scripts and Bob's your uncle.

Don't shoot me, I have no idea whether that's a good idea or not!

Paul.

-- 
Paul Hoffman nkui...@nkuitse.com



Re: [dev] Re: coreutils / moreutils - DC a directory counter

2013-07-26 Thread Thorsten Glaser
Calvin Morrison dixit:

Its called unionfs if I recall

No. Go read it again.

On Jul 25, 2013 9:28 PM, Thorsten Glaser t...@mirbsd.de wrote:

And stop top-posting and full-quoting.

Read http://www.afaik.de/usenet/faq/zitieren/ (it’s in
German, English and Dutch, so no excuses).

bye,
//mirabilos
-- 
 emacs als auch vi zum Kotzen finde (joe rules) und pine für den einzig
 bedienbaren textmode-mailclient halte (und ich hab sie alle ausprobiert). ;)
Hallo, ich bin der Holger (Hallo Holger!), und ich bin ebenfalls
... pine-User, und das auch noch gewohnheitsmäßig (Oooohhh).  [aus dasr]



Re: [dev] Re: coreutils / moreutils - DC a directory counter

2013-07-26 Thread Calvin Morrison
Yes master
On Jul 26, 2013 3:40 AM, Thorsten Glaser t...@mirbsd.de wrote:

 Calvin Morrison dixit:

 Its called unionfs if I recall

 No. Go read it again.

 On Jul 25, 2013 9:28 PM, Thorsten Glaser t...@mirbsd.de wrote:

 And stop top-posting and full-quoting.

 Read http://www.afaik.de/usenet/faq/zitieren/ (it’s in
 German, English and Dutch, so no excuses).

 bye,
 //mirabilos
 --
  emacs als auch vi zum Kotzen finde (joe rules) und pine für den einzig
  bedienbaren textmode-mailclient halte (und ich hab sie alle
 ausprobiert). ;)
 Hallo, ich bin der Holger (Hallo Holger!), und ich bin ebenfalls
 ... pine-User, und das auch noch gewohnheitsmäßig (Oooohhh).  [aus
 dasr]




[dev] Re: coreutils / moreutils - DC a directory counter

2013-07-25 Thread Thorsten Glaser
Calvin Morrison dixit:

I was sick of ls | wc -l being so damned slow on large directories, so

What, besides the printing and sorting, is the slow part anyway?
Is it the VFS API or just the filesystem code?

In the latter case… could workarounds exist? Someone asked this…
http://fenski.pl/2013/07/looking-for-a-specific-fuse-based-filesystem/
… on Planet Debian this night.

Something to think about. (No further input from me, besides
mumbling that I had a vague idea of similar concept and wouldn’t
be surprised if something like that already existed, and probably
only in the Plan 9 world…)

bye,
//mirabilos
-- 
(gnutls can also be used, but if you are compiling lynx for your own use,
there is no reason to consider using that package)
-- Thomas E. Dickey on the Lynx mailing list, about OpenSSL



Re: [dev] Re: coreutils / moreutils - DC a directory counter

2013-07-25 Thread Calvin Morrison
Its called unionfs if I recall
On Jul 25, 2013 9:28 PM, Thorsten Glaser t...@mirbsd.de wrote:

 Calvin Morrison dixit:

 I was sick of ls | wc -l being so damned slow on large directories, so

 What, besides the printing and sorting, is the slow part anyway?
 Is it the VFS API or just the filesystem code?

 In the latter case… could workarounds exist? Someone asked this…
 http://fenski.pl/2013/07/looking-for-a-specific-fuse-based-filesystem/
 … on Planet Debian this night.

 Something to think about. (No further input from me, besides
 mumbling that I had a vague idea of similar concept and wouldn’t
 be surprised if something like that already existed, and probably
 only in the Plan 9 world…)

 bye,
 //mirabilos
 --
 (gnutls can also be used, but if you are compiling lynx for your own use,
 there is no reason to consider using that package)
 -- Thomas E. Dickey on the Lynx mailing list, about OpenSSL




Re: [dev] Re: coreutils / moreutils - DC a directory counter

2013-07-18 Thread Robert Ransom
On 7/17/13, Calvin Morrison mutanttur...@gmail.com wrote:
 On 17 July 2013 16:32, Christian Neukirchen chneukirc...@gmail.com wrote:

 What's the bottle neck here?

 Looking up the filenames and reading them, printing them to standard
 out and then wc parsing for all the \n characters.

Most ls implementations also sort the list of filenames by default.
Is that the bottleneck?

 (Or is your dc only faster because the directory index is in cache
 now...)

 No that's not why:

 calvin@ecoli:~/big_folder ls 2v1 | wc -l
 687560

 real 0m7.678s
 user 0m7.313s
 sys 0m0.579s

 calvin@ecoli:~/big_folder time dc 2v1
 687560

 real 0m0.138s
 user 0m0.055s
 sys 0m0.082s

 calvin@ecoli:~/big_folder time ls 2v1 | wc -l
 687560

 real 0m7.672s
 user 0m7.310s
 sys 0m0.580s

Um.  How often are you going to use this new C program which saves
only 7.5 seconds?


Robert Ransom



Re: [dev] Re: coreutils / moreutils - DC a directory counter

2013-07-18 Thread Szabolcs Nagy
* Calvin Morrison mutanttur...@gmail.com [2013-07-17 16:43:00 -0400]:
 On 17 July 2013 16:32, Christian Neukirchen chneukirc...@gmail.com wrote:
  calvin@ecoli:~/big_folder time ls file2v1dir/ | wc -l
  687560
 
  real0m7.798s
  user0m7.317s
  sys 0m0.700s
 
  calvin@ecoli:~/big_folder time ~/bin/dc file2v1dir/
  687560
 
  real0m0.138s
  user0m0.057s
  sys 0m0.081s
 
  What do you think?
  Calvin
 
  What's the bottle neck here?
 
 Looking up the filenames and reading them, printing them to standard
 out and then wc parsing for all the \n characters.
 

if it's coreutils ls|wc then most of the time is
locale specific code (strcoll and encoding related),
try

export LC_ALL=C
ls -f |wc -l





[dev] Re: coreutils / moreutils - DC a directory counter

2013-07-17 Thread Christian Neukirchen
Calvin Morrison mutanttur...@gmail.com writes:

 Hi guys,

 I came up with a utility[0] that i think could be useful, and I sent
 it to the moreutils page, but maybe it might fit better here. All it
 does is give a count of files in a directory.

 I was sick of ls | wc -l being so damned slow on large directories, so
 I thought a more direct solution would be better.

 calvin@ecoli:~/big_folder time ls file2v1dir/ | wc -l
 687560

 real0m7.798s
 user0m7.317s
 sys 0m0.700s

 calvin@ecoli:~/big_folder time ~/bin/dc file2v1dir/
 687560

 real0m0.138s
 user0m0.057s
 sys 0m0.081s

 What do you think?
 Calvin

What's the bottle neck here?

(Or is your dc only faster because the directory index is in cache now...)

-- 
Christian Neukirchen  chneukirc...@gmail.com  http://chneukirchen.org




Re: [dev] Re: coreutils / moreutils - DC a directory counter

2013-07-17 Thread Calvin Morrison
On 17 July 2013 16:32, Christian Neukirchen chneukirc...@gmail.com wrote:
 Calvin Morrison mutanttur...@gmail.com writes:

 Hi guys,

 I came up with a utility[0] that i think could be useful, and I sent
 it to the moreutils page, but maybe it might fit better here. All it
 does is give a count of files in a directory.

 I was sick of ls | wc -l being so damned slow on large directories, so
 I thought a more direct solution would be better.

 calvin@ecoli:~/big_folder time ls file2v1dir/ | wc -l
 687560

 real0m7.798s
 user0m7.317s
 sys 0m0.700s

 calvin@ecoli:~/big_folder time ~/bin/dc file2v1dir/
 687560

 real0m0.138s
 user0m0.057s
 sys 0m0.081s

 What do you think?
 Calvin

 What's the bottle neck here?

Looking up the filenames and reading them, printing them to standard
out and then wc parsing for all the \n characters.

 (Or is your dc only faster because the directory index is in cache now...)

No that's not why:

calvin@ecoli:~/big_folder ls 2v1 | wc -l
687560

real 0m7.678s
user 0m7.313s
sys 0m0.579s

calvin@ecoli:~/big_folder time dc 2v1
687560

real 0m0.138s
user 0m0.055s
sys 0m0.082s

calvin@ecoli:~/big_folder time ls 2v1 | wc -l
687560

real 0m7.672s
user 0m7.310s
sys 0m0.580s

 --
 Christian Neukirchen  chneukirc...@gmail.com  http://chneukirchen.org





[dev] Re: coreutils / moreutils - DC a directory counter

2013-07-17 Thread Christian Neukirchen
Calvin Morrison mutanttur...@gmail.com writes:

 On 17 July 2013 16:32, Christian Neukirchen chneukirc...@gmail.com wrote:
 Calvin Morrison mutanttur...@gmail.com writes:

 Hi guys,

 I came up with a utility[0] that i think could be useful, and I sent
 it to the moreutils page, but maybe it might fit better here. All it
 does is give a count of files in a directory.

 I was sick of ls | wc -l being so damned slow on large directories, so
 I thought a more direct solution would be better.

 calvin@ecoli:~/big_folder time ls file2v1dir/ | wc -l
 687560

 real0m7.798s
 user0m7.317s
 sys 0m0.700s

 calvin@ecoli:~/big_folder time ~/bin/dc file2v1dir/
 687560

 real0m0.138s
 user0m0.057s
 sys 0m0.081s

 What do you think?
 Calvin

 What's the bottle neck here?

 Looking up the filenames and reading them, printing them to standard
 out and then wc parsing for all the \n characters.

 (Or is your dc only faster because the directory index is in cache now...)

 No that's not why:

 calvin@ecoli:~/big_folder ls 2v1 | wc -l
 687560

 real 0m7.678s
 user 0m7.313s
 sys 0m0.579s

 calvin@ecoli:~/big_folder time dc 2v1
 687560

 real 0m0.138s
 user 0m0.055s
 sys 0m0.082s

 calvin@ecoli:~/big_folder time ls 2v1 | wc -l
 687560

 real 0m7.672s
 user 0m7.310s
 sys 0m0.580s

How fast is  find 2v1 -printf x | wc -c  ?

-- 
Christian Neukirchen  chneukirc...@gmail.com  http://chneukirchen.org




Re: [dev] Re: coreutils / moreutils - DC a directory counter

2013-07-17 Thread Calvin Morrison
On 17 July 2013 16:58, Christian Neukirchen chneukirc...@gmail.com wrote:
 Calvin Morrison mutanttur...@gmail.com writes:

 On 17 July 2013 16:32, Christian Neukirchen chneukirc...@gmail.com wrote:
 Calvin Morrison mutanttur...@gmail.com writes:

 Hi guys,

 I came up with a utility[0] that i think could be useful, and I sent
 it to the moreutils page, but maybe it might fit better here. All it
 does is give a count of files in a directory.

 I was sick of ls | wc -l being so damned slow on large directories, so
 I thought a more direct solution would be better.

 calvin@ecoli:~/big_folder time ls file2v1dir/ | wc -l
 687560

 real0m7.798s
 user0m7.317s
 sys 0m0.700s

 calvin@ecoli:~/big_folder time ~/bin/dc file2v1dir/
 687560

 real0m0.138s
 user0m0.057s
 sys 0m0.081s

 What do you think?
 Calvin

 What's the bottle neck here?

 Looking up the filenames and reading them, printing them to standard
 out and then wc parsing for all the \n characters.

 (Or is your dc only faster because the directory index is in cache now...)

 No that's not why:

 calvin@ecoli:~/big_folder ls 2v1 | wc -l
 687560

 real 0m7.678s
 user 0m7.313s
 sys 0m0.579s

 calvin@ecoli:~/big_folder time dc 2v1
 687560

 real 0m0.138s
 user 0m0.055s
 sys 0m0.082s

 calvin@ecoli:~/big_folder time ls 2v1 | wc -l
 687560

 real 0m7.672s
 user 0m7.310s
 sys 0m0.580s

 How fast is  find 2v1 -printf x | wc -c  ?

 --
 Christian Neukirchen  chneukirc...@gmail.com  http://chneukirchen.org



time find 2v1 -printf x | wc -c
687561

real 0m0.531s
user 0m0.264s
sys 0m0.271s


time ls 2v1  /dev/null

real 0m7.642s
user 0m7.265s
sys 0m0.375s

So it seems a good deal of that time is ls



Re: [dev] Re: coreutils / moreutils - DC a directory counter

2013-07-17 Thread Bjartur Thorlacius

On 07/17/2013 09:02 PM, Calvin Morrison wrote:

So it seems a good deal of that time is ls

Wait, sbase ls doesn't seem to implement -f. Are you sorting the 
directory entries?