On Thu, Mar 29, 2012 at 7:00 PM, Mark Hills <[email protected]> wrote: > Hi James, thank's for your suggestions. I've addressed the points inline, > below. > > Perhaps I wasn't clear, this was an example case and I was hoping for a > general solution in the find command itself (and suprised to not see one)
find can't protect you against the ARG_MAX limit, since this limit applies to the shell's excecution of find. If "find ./*/xx/*/yy" exceeds ARG_MAX, then "/bin/echo ./*/xx/*/yy" does too. The limit isn't specific to find. > On Thu, 29 Mar 2012, James Youngman wrote: > >> On Tue, Mar 27, 2012 at 3:24 PM, Mark Hills <[email protected]> >> wrote: >> > We traverse portions of our filesystem and apply a find action to them; >> > currently by allowing the shell to expand the glob; eg. >> > >> > find ./*/xx/*/yy >> > >> > But the expansion can be large and problematic before being passed to >> > find. >> >> I'm not sure what you mean by "problematic" here. It' spossible I >> suppose that the shell runs out of RAM in which to expand the glob, or >> the results exceed ARG_MAX. I'll assume it's the latter for the >> purpose of this reply, please correct me if this is the wrong >> interpretation of what you meant. > > Yes, that's correct. > >> > To do the equivalent in find itself is slow. >> >> How slow? How much slower? >> >> > The whole hierarchy is traversed (which is slow), and only matching >> > results displayed: >> > >> > find . -path './*/xx/*/yy' >> >> You don't state what the structure of your filesystem hierarchy is, so >> it is hard to give entirely reliable advice here. I'm going to guess >> a bit about things like the depth of the tree (which I'm going to >> guess is large), the total number of files below ".' (also large) and >> the cardinalities of the expansions of "*" in the glob above (also >> large). > > Yes, that's correct. > > Also, there are several directories alongside the 'xx' and 'yy' > directories, themselves also large. It would be wasteful to traverse > these; they cannot and do not match. > >> If that is your whole command line you are certainly using find in an >> inefficient way. It's hard to say for sure since you don't state >> what fraction of the whole filesystem hierarchy you need to visit, or >> what the actions are. However, the predicates -mindepth, -maxdepth, >> -prune and -quit can be used to limit or terminate the filesystem >> search. > > The command is typically followed by checks against the (non-path) > attributes; eg. -mtime etc. and then -print or -exec. > >> > Is there a way to have find itself only visit the relevant portions of the >> > filesystem? >> >> Certainly. If I knew quite what you meant by "relevant" I could >> provide a more useful response. Instead I will provide some >> examples. > > To clarify, by "relevant" I meant those which match or could potentially > match the pattern. At the moment it scans the whole hierarchy. > >> We start with your original command, which you state as problematic: >> >> $ find ./*/xx/*/yy >> >> I'm going to assume you really meant you use >> >> $ find ./*/xx/*/yy -actions >> >> where -actions is some non-empty mixture of find predicates and >> actions. If -actions already includes -mindepth, -maxdepth, -prune or >> (most awkwardly) -quit, some of the examples below are going to need >> adjustment. >> >> The simplest rearrangement is >> >> $ for start in find ./*/xx; do >> find "${start}"/*/yy -actions >> done >> >> This will dramatically cut down the number of arguments passed to each >> invocation of find, an so may be enough by itself to form a >> satisfactory solution to your problem. If the argument count is >> still too large you could also try: >> >> $ for start in ./*/xx; do >> for sub in "${start}"/*/yy; do >> find "${sub}" -actions >> done >> done >> >> If you still have a problem with this second option, it's likely that >> one of the "*"s expands to a sufficiently large list that ARG_MAX is >> still exceeded. You can overcome this by transforming the loop into >> find predicates. I'll do this with only the inner loop for >> simplicity: >> >> for sub in "${start}"/*/yy; do >> find "${sub}" -actions >> done >> >> becomes >> >> find "${start}" -mindepth 2 \( -depth 2 \! -name yy -prune , -true \) >> -actions >> >> If -actions contains tests like -depth, options like -mindepth or >> -maxdepth, then some adjustment will be needed there. > > Thanks for the examples -- you have understood my explanation correctly. > > I assume that the examples confirm that this kind of selective traversal > cannot be done in find itself? > > My similar solution, (which automatically interprets the wildcard) was to > wrap the glob(3) function in a command which outputs to stdout, and use > that; eg. > > glob './*/xx/*/yy' | xargs -n 100 -I'!!' -- find '!!' -actions > > But this, like the examples, is rather unweildy! > >> > The manual [1] seems to suggest using locate and xargs. Keeping an index >> > is not practical for us, >> >> I assume because either the tree changes frequently and multiple >> independent locate indexes would be no help (since all parts of the >> tree change frequently). > > The tree changes frequently, and is too large (millions of files) to > feasably index within usable time. > >> > so I wrote a simple command around the glob(3) >> > function to do the traversal and print to stdout. Am I missing some well >> > established method here? >> >> It's difficult to give a definitive answer here since you don't state >> what you're actually trying to achieve. I hope the above was useful >> anyway. > > Useful suggestion, thank you. > > I'm trying to achieve the functinality of glob match within the find > command. It already has this, you used it. The search can be limited to just a part of the tree using the techniques quoted above. > > Thanks > > -- > Mark
