Re: fstrcmp
On Tue, 2009-06-02 at 10:28 -0400, Michael Poole wrote: > I have some relevant experience that makes me skeptical about needing > sophisticated structures to achieve acceptable performance Some concrete numbers: - naive fstrcmp comparisons (using edit distance, lifted from GNU Gettext) - list of names obtained from apt-cache dumpavail - 2.1GHz x86 - looking for "dnsutl" 0.08008 = 0.215 sec / 26866 pair that's 8.0us per fstrcmp call - looking for "dns-server-utilities" 0.15816 = 0.425 sec / 26866 pair that's 15us per fstrcmp call For me, adding ~0.3 second to the error case in order to provide a far better error message would appear to be worth it, especially as this is less than the existing 0.532s reported by "time sudo apt-get install dnsutl" (not including I/O time) on the same system. I/O time when none of the package data is in cache is > 9 seconds, confirming what Martin has been saying... and ~30 times longer than the fstrcmp overhead. I will have version 0.1 ready shortly. Regards Peter Miller /\/\*http://miller.emu.id.au/pmiller/ PGP public key ID: 1024D/D0EDB64D fingerprint = AD0A C5DF C426 4F03 5D53 2BDB 18D8 A4E2 D0ED B64D See http://www.keyserver.net or any PGP keyserver for public key. "A computer is like air conditioning: it becomes useless when you open windows." -- Linus Torvalds -- To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org
Re: fstrcmp
Jérôme Pouiller writes: > In another thread, Adeodato Simó wrote: >> I can't see how it'd work here, at least without the help of some >> on-disk structure, since we're talking about a space of 25,000 >> packages. > > Naive search of matching string under a set of 25,000 strings is > something like 2000 times slower (maybe far more) than a correct > algorithm. Adeodato thinks it is not usable. I said we shouldn't reject > idea because of performances and I suggested a better algorithm should > be used. If we are pulling numbers out of the air, reading data from disk is something like 2000 times slower (maybe far more) than doing string comparisons. I/O may dominate the running time even for a naive comparison. I have some relevant experience that makes me skeptical about needing sophisticated structures to achieve acceptable performance: Ten years ago, I wrote a simple spell checker as part of a job interview. It was able to suggest (ranked) close matches for misspelled words from a ~100,000-word dictionary within a few milliseconds. Given that I had less than an hour's notice about what I would have to write, and only about 4 hours to write and test the program, I would be very surprised if performance for any reasonable algorithm on more modern machines will be bad enough to make it unusable. It is not as if the application code should require significant rewriting based on whether it performs an element-wise comparison or whether it passes some function an array of strings against which to match. The library code is what would need much deeper changes. So why not see how well the existing library works before demanding that it be radically rewritten? If you want to implement a superior algorithm, no one will stop you. If the fuzzy match only runs when no exact match is found, it suffers no slowdown on success and may be fast enough when there is incorrect input. If there are people who think it is unacceptably slow, let them come forward with actual numbers in hand. Insisting on much more development work because of an unsubstantiated concern seems like a poor trade-off. Michael -- To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org
Re: fstrcmp
On Tue, Jun 02, 2009 at 02:32:06PM +0200, Jérôme Pouiller was heard to say: > In another thread, Adeodato Simó wrote: > > I can't see how it'd work here, at least without the help of some > > on-disk structure, since we're talking about a space of 25,000 > > packages. > > Naive search of matching string under a set of 25,000 strings is > something like 2000 times slower (maybe far more) than a correct > algorithm. Adeodato thinks it is not usable. I said we shouldn't reject > idea because of performances and I suggested a better algorithm should > be used. aptitude already scans every package regularly. For instance, when you ask it to install something that doesn't exist, it checks every package name using a naive substring algorithm. According to "time", this takes 1.5 to 1.7 seconds on my computer, most of which appears to be spent initializing the program. Fuzzy matches are more expensive, of course. It might even be interesting to just do an edit distance computation up to k=1 and see how that performed, though; I bet it would be reasonable with some mild optimization. Daniel -- To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org
Re: fstrcmp
On Tuesday 02 June 2009 12:32:47 Michael Poole wrote: > Jérôme Pouiller writes: [...] > > It is naive to think matching algorithm iterates on all items until > > it find the correct one. At least, algorithm use a sorted index > > with a dichotomy search. > > > > Nevertheless, your idea is interesting. But you should implement a > > function to match the nearest string in a set of strings. Take a > > look in spell checking libraries to have an idea how to implement > > it. > > Isn't this optimization premature? I would say: Package the library, > implement the fuzzy matching, and if it is too slow for people to > like the case where they misspell a package name, *then* optimize for > run-time. I would rather have the fuzzy matching sooner than have it > shave a few milliseconds off the display time for a correction. In another thread, Adeodato Simó wrote: > I can't see how it'd work here, at least without the help of some > on-disk structure, since we're talking about a space of 25,000 > packages. Naive search of matching string under a set of 25,000 strings is something like 2000 times slower (maybe far more) than a correct algorithm. Adeodato thinks it is not usable. I said we shouldn't reject idea because of performances and I suggested a better algorithm should be used. -- Jérôme Pouiller (jezz AT sysmic DOT org) -- To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org
Re: fstrcmp
Jérôme Pouiller writes: > On Sunday 31 May 2009 03:49:37 Peter Miller wrote: > [...] >>This goes for packages as well. Wouldn't it be great if >> >>apt-get install dns-utils >> >>instead of saying >> >>E: Couldn't find package dns-utils >> >>it said something more useful, like >> >>E: Couldn't find package dns-utils, did you mean dnsutils instead? > > It is naive to think matching algorithm iterates on all items until it > find the correct one. At least, algorithm use a sorted index with a > dichotomy search. > > Nevertheless, your idea is interesting. But you should implement a > function to match the nearest string in a set of strings. Take a look in > spell checking libraries to have an idea how to implement it. Isn't this optimization premature? I would say: Package the library, implement the fuzzy matching, and if it is too slow for people to like the case where they misspell a package name, *then* optimize for run-time. I would rather have the fuzzy matching sooner than have it shave a few milliseconds off the display time for a correction. Michael Poole -- To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org
Re: fstrcmp
On Sunday 31 May 2009 03:49:37 Peter Miller wrote: [...] >This goes for packages as well. Wouldn't it be great if > >apt-get install dns-utils > >instead of saying > >E: Couldn't find package dns-utils > >it said something more useful, like > >E: Couldn't find package dns-utils, did you mean dnsutils instead? It is naive to think matching algorithm iterates on all items until it find the correct one. At least, algorithm use a sorted index with a dichotomy search. Nevertheless, your idea is interesting. But you should implement a function to match the nearest string in a set of strings. Take a look in spell checking libraries to have an idea how to implement it. -- Jérôme Pouiller (jezz AT sysmic DOT org) -- To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org
Re: fstrcmp
On Sun, 2009-05-31 at 11:04 +0200, Florian Weimer wrote: > * Peter Miller: > > > I've been considering turning my fuzzy string compare function into a > > library. > > I would certainly welcome that. > > Would you be willing to relicense it under a more permissive license, > so that we don't have to worry about OpenSSL license compatibility > etc.? > > > /** > > * the fstrcmp function compare two strings, to determine how > > * similar two strings appear. > > * > > * @param s1 > > * The first of the strings to compare. > > * @param s2 > > * The second of the strings to compare. > > * @returns > > * a number between 0.0 and 1.0; 0.0 means the strings are > > * nothing alike, 1.0 means the two strings are identical. > > */ > > double fstrcmp(const char *s1, const char *s2); > > It could be helpful if it didn't use floating point because we support > some systems where floating point is software-emulated. > > I don't think we've got a library of C goodies. libbsd is something > in that direction, but if your function isn't in the BSDs, it probably > doesn't fit there, either. There is libmowgli (don't ask why it's named this, we just picked something random), but it requires all core modules to be licensed under ISC. There is some Windows portability components that are LGPL. Unfortunately, we're renovating the atheme.org website right now, so instructions to submit additional components are not yet available. William signature.asc Description: This is a digitally signed message part
Re: fstrcmp
+ Peter Miller (Sun, 31 May 2009 11:49:37 +1000): > Wouldn't it be great if when you typed > apt-get build-deps gcc > instead of saying > E: Invalid operation build-deps > it said something more useful, like > E: Invalid operation build-deps, did you mean build-dep instead? I guess that's would be doable, since it'd involve calculating distances to all registered commands, which are only a handful. However: > This goes for packages as well. Wouldn't it be great if > apt-get install dns-utils > instead of saying > E: Couldn't find package dns-utils > it said something more useful, like > E: Couldn't find package dns-utils, did you mean dnsutils instead? I can't see how it'd work here, at least without the help of some on-disk structure, since we're talking about a space of 25,000 packages. -- - Are you sure we're good? - Always. -- Rory and Lorelai -- To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org
Re: fstrcmp
* Peter Miller: > I've been considering turning my fuzzy string compare function into a > library. I would certainly welcome that. Would you be willing to relicense it under a more permissive license, so that we don't have to worry about OpenSSL license compatibility etc.? > /** > * the fstrcmp function compare two strings, to determine how > * similar two strings appear. > * > * @param s1 > * The first of the strings to compare. > * @param s2 > * The second of the strings to compare. > * @returns > * a number between 0.0 and 1.0; 0.0 means the strings are > * nothing alike, 1.0 means the two strings are identical. > */ > double fstrcmp(const char *s1, const char *s2); It could be helpful if it didn't use floating point because we support some systems where floating point is software-emulated. I don't think we've got a library of C goodies. libbsd is something in that direction, but if your function isn't in the BSDs, it probably doesn't fit there, either. -- To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org