Re: find and glob

James Youngman Fri, 30 Mar 2012 14:14:59 -0700

On Thu, Mar 29, 2012 at 7:00 PM, Mark Hills <[email protected]> wrote:
> Hi James, thank's for your suggestions. I've addressed the points inline,
> below.
>
> Perhaps I wasn't clear, this was an example case and I was hoping for a
> general solution in the find command itself (and suprised to not see one)


find can't protect you against the ARG_MAX limit, since this limit
applies to the shell's excecution of find.   If "find ./*/xx/*/yy"
exceeds ARG_MAX, then "/bin/echo ./*/xx/*/yy" does too.   The limit
isn't specific to find.

> On Thu, 29 Mar 2012, James Youngman wrote:
>
>> On Tue, Mar 27, 2012 at 3:24 PM, Mark Hills <[email protected]> 
>> wrote:
>> > We traverse portions of our filesystem and apply a find action to them;
>> > currently by allowing the shell to expand the glob; eg.
>> >
>> >  find ./*/xx/*/yy
>> >
>> > But the expansion can be large and problematic before being passed to
>> > find.
>>
>> I'm not sure what you mean by "problematic" here.   It' spossible I
>> suppose that the shell runs out of RAM in which to expand the glob, or
>> the results exceed ARG_MAX.    I'll assume it's the latter for the
>> purpose of this reply, please correct me if this is the wrong
>> interpretation of what you meant.
>
> Yes, that's correct.
>
>> > To do the equivalent in find itself is slow.
>>
>> How slow?   How much slower?
>>
>> > The whole hierarchy is traversed (which is slow), and only matching 
>> > results displayed:
>> >
>> >  find . -path './*/xx/*/yy'
>>
>> You don't state what the structure of your filesystem hierarchy is, so
>> it is hard to give entirely reliable advice here.   I'm going to guess
>> a bit about things like the depth of the tree (which I'm going to
>> guess is large), the total number of files below ".' (also large) and
>> the cardinalities of the expansions of "*" in the glob above (also
>> large).
>
> Yes, that's correct.
>
> Also, there are several directories alongside the 'xx' and 'yy'
> directories, themselves also large. It would be wasteful to traverse
> these; they cannot and do not match.
>
>> If that is your whole command line you are certainly using find in an
>> inefficient way.   It's hard to say for sure since you don't state
>> what fraction of the whole filesystem hierarchy you need to visit, or
>> what the actions are.    However, the predicates -mindepth, -maxdepth,
>> -prune and -quit can be used to limit or terminate the filesystem
>> search.
>
> The command is typically followed by checks against the (non-path)
> attributes; eg. -mtime etc. and then -print or -exec.
>
>> > Is there a way to have find itself only visit the relevant portions of the
>> > filesystem?
>>
>> Certainly.  If I knew quite what you meant by "relevant" I could
>> provide a more useful response.   Instead I will provide some
>> examples.
>
> To clarify, by "relevant" I meant those which match or could potentially
> match the pattern. At the moment it scans the whole hierarchy.
>
>> We start with your original command, which you state as problematic:
>>
>> $  find ./*/xx/*/yy
>>
>> I'm going to assume you really meant you use
>>
>> $  find ./*/xx/*/yy -actions
>>
>> where -actions is some non-empty mixture of find predicates and
>> actions.  If -actions already includes -mindepth, -maxdepth, -prune or
>> (most awkwardly) -quit, some of the examples below are going to need
>> adjustment.
>>
>> The simplest rearrangement is
>>
>> $ for start in find ./*/xx; do
>>   find "${start}"/*/yy -actions
>> done
>>
>> This will dramatically cut down the number of arguments passed to each
>> invocation of find, an so may be enough by itself to form a
>> satisfactory solution to your problem.   If the argument count is
>> still too  large you could also try:
>>
>> $ for start in ./*/xx; do
>>   for sub in "${start}"/*/yy; do
>>     find "${sub}" -actions
>>   done
>> done
>>
>> If you still have a problem with this second option, it's likely that
>> one of the "*"s expands to a sufficiently large list that ARG_MAX is
>> still exceeded.   You can overcome this by transforming the loop into
>> find predicates.   I'll do this with only the inner loop for
>> simplicity:
>>
>>   for sub in "${start}"/*/yy; do
>>     find "${sub}" -actions
>>   done
>>
>> becomes
>>
>> find "${start}" -mindepth 2 \( -depth 2 \! -name yy -prune , -true \) 
>> -actions
>>
>> If -actions contains tests like -depth, options like -mindepth or
>> -maxdepth, then some adjustment will be needed there.
>
> Thanks for the examples -- you have understood my explanation correctly.
>
> I assume that the examples confirm that this kind of selective traversal
> cannot be done in find itself?
>
> My similar solution, (which automatically interprets the wildcard) was to
> wrap the glob(3) function in a command which outputs to stdout, and use
> that; eg.
>
>  glob './*/xx/*/yy' | xargs -n 100 -I'!!' -- find '!!' -actions
>
> But this, like the examples, is rather unweildy!
>
>> > The manual [1] seems to suggest using locate and xargs. Keeping an index
>> > is not practical for us,
>>
>> I assume because either the tree changes frequently and multiple
>> independent locate indexes would be no help (since all parts of the
>> tree change frequently).
>
> The tree changes frequently, and is too large (millions of files) to
> feasably index within usable time.
>
>> > so I wrote a simple command around the glob(3)
>> > function to do the traversal and print to stdout. Am I missing some well
>> > established method here?
>>
>> It's difficult to give a definitive answer here since you don't state
>> what you're actually trying to achieve.   I hope the above was useful
>> anyway.
>
> Useful suggestion, thank you.
>
> I'm trying to achieve the functinality of glob match within the find
> command.

It already has this, you used it.   The search can be limited to just
a part of the tree using the techniques quoted above.

>
> Thanks
>
> --
> Mark

Re: find and glob

Reply via email to