subject:"\"Re\\\: fstrcmp\""

Re: fstrcmp

2009-06-07 Thread Peter Miller

On Tue, 2009-06-02 at 10:28 -0400, Michael Poole wrote:
> I have some relevant experience that makes me skeptical about needing
> sophisticated structures to achieve acceptable performance

Some concrete numbers:

- naive fstrcmp comparisons (using edit distance, lifted from GNU
Gettext)
- list of names obtained from apt-cache dumpavail
- 2.1GHz x86

- looking for "dnsutl"

0.08008 = 0.215 sec / 26866 pair
that's 8.0us per fstrcmp call

- looking for "dns-server-utilities"

0.15816 = 0.425 sec / 26866 pair
that's 15us per fstrcmp call

For me, adding ~0.3 second to the error case in order to provide a far
better error message would appear to be worth it, especially as this is
less than the existing 0.532s reported by "time sudo apt-get install
dnsutl" (not including I/O time) on the same system.

I/O time when none of the package data is in cache is > 9 seconds,
confirming what Martin has been saying... and ~30 times longer than the
fstrcmp overhead.

I will have version 0.1 ready shortly.

Regards
Peter Miller 
/\/\*http://miller.emu.id.au/pmiller/

PGP public key ID: 1024D/D0EDB64D
fingerprint = AD0A C5DF C426 4F03 5D53  2BDB 18D8 A4E2 D0ED B64D
See http://www.keyserver.net or any PGP keyserver for public key.

"A computer is like air conditioning: it becomes useless when you open
windows." -- Linus Torvalds

-- 
To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org

Re: fstrcmp

2009-06-02 Thread Michael Poole

Jérôme Pouiller writes:

> In another thread, Adeodato Simó wrote:
>> I can't see how it'd work here, at least without the help of some
>> on-disk structure, since we're talking about a space of 25,000
>> packages.
>
> Naive search of matching string under a set of 25,000 strings is 
> something like 2000 times slower (maybe far more) than a correct 
> algorithm. Adeodato thinks it is not usable. I said we shouldn't reject 
> idea because of performances and I suggested a better algorithm should 
> be used.

If we are pulling numbers out of the air, reading data from disk is
something like 2000 times slower (maybe far more) than doing string
comparisons.  I/O may dominate the running time even for a naive
comparison.

I have some relevant experience that makes me skeptical about needing
sophisticated structures to achieve acceptable performance: Ten years
ago, I wrote a simple spell checker as part of a job interview.  It
was able to suggest (ranked) close matches for misspelled words from a
~100,000-word dictionary within a few milliseconds.  Given that I had
less than an hour's notice about what I would have to write, and only
about 4 hours to write and test the program, I would be very surprised
if performance for any reasonable algorithm on more modern machines
will be bad enough to make it unusable.

It is not as if the application code should require significant
rewriting based on whether it performs an element-wise comparison or
whether it passes some function an array of strings against which to
match.  The library code is what would need much deeper changes.  So
why not see how well the existing library works before demanding that
it be radically rewritten?

If you want to implement a superior algorithm, no one will stop you.
If the fuzzy match only runs when no exact match is found, it suffers
no slowdown on success and may be fast enough when there is incorrect
input.  If there are people who think it is unacceptably slow, let
them come forward with actual numbers in hand.  Insisting on much more
development work because of an unsubstantiated concern seems like a
poor trade-off.

Michael

--
To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org

Re: fstrcmp

2009-06-02 Thread Daniel Burrows

On Tue, Jun 02, 2009 at 02:32:06PM +0200, Jérôme Pouiller  was 
heard to say:
> In another thread, Adeodato Simó wrote:
> > I can't see how it'd work here, at least without the help of some
> > on-disk structure, since we're talking about a space of 25,000
> > packages.
> 
> Naive search of matching string under a set of 25,000 strings is 
> something like 2000 times slower (maybe far more) than a correct 
> algorithm. Adeodato thinks it is not usable. I said we shouldn't reject 
> idea because of performances and I suggested a better algorithm should 
> be used.

  aptitude already scans every package regularly.  For instance, when
you ask it to install something that doesn't exist, it checks every
package name using a naive substring algorithm.  According to "time",
this takes 1.5 to 1.7 seconds on my computer, most of which appears
to be spent initializing the program.

  Fuzzy matches are more expensive, of course.  It might even be
interesting to just do an edit distance computation up to k=1 and see
how that performed, though; I bet it would be reasonable with some mild
optimization.

  Daniel

-- 
To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org

Re: fstrcmp

2009-06-02 Thread Jérôme Pouiller

On Tuesday 02 June 2009 12:32:47 Michael Poole wrote:
> Jérôme Pouiller writes:
[...]
> > It is naive to think matching algorithm iterates on all items until
> > it find the correct one. At least, algorithm use a sorted index
> > with a dichotomy search.
> >
> > Nevertheless, your idea is interesting. But you should implement a
> > function to match the nearest string in a set of strings. Take a
> > look in spell checking libraries to have an idea how to implement
> > it.
>
> Isn't this optimization premature?  I would say: Package the library,
> implement the fuzzy matching, and if it is too slow for people to
> like the case where they misspell a package name, *then* optimize for
> run-time.  I would rather have the fuzzy matching sooner than have it
> shave a few milliseconds off the display time for a correction.

In another thread, Adeodato Simó wrote:
> I can't see how it'd work here, at least without the help of some
> on-disk structure, since we're talking about a space of 25,000
> packages.

Naive search of matching string under a set of 25,000 strings is 
something like 2000 times slower (maybe far more) than a correct 
algorithm. Adeodato thinks it is not usable. I said we shouldn't reject 
idea because of performances and I suggested a better algorithm should 
be used.


-- 
Jérôme Pouiller (jezz AT sysmic DOT org)


--
To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org

Re: fstrcmp

2009-06-02 Thread Michael Poole

Jérôme Pouiller writes:

> On Sunday 31 May 2009 03:49:37 Peter Miller wrote:
> [...]
>>This goes for packages as well.  Wouldn't it be great if
>>
>>apt-get install dns-utils
>>
>>instead of saying
>>
>>E: Couldn't find package dns-utils
>>
>>it said something more useful, like
>>
>>E: Couldn't find package dns-utils, did you mean dnsutils instead?
>
> It is naive to think matching algorithm iterates on all items until it 
> find the correct one. At least, algorithm use a sorted index with a 
> dichotomy search. 
>
> Nevertheless, your idea is interesting. But you should implement a 
> function to match the nearest string in a set of strings. Take a look in 
> spell checking libraries to have an idea how to implement it.

Isn't this optimization premature?  I would say: Package the library,
implement the fuzzy matching, and if it is too slow for people to like
the case where they misspell a package name, *then* optimize for
run-time.  I would rather have the fuzzy matching sooner than have it
shave a few milliseconds off the display time for a correction.

Michael Poole


--
To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org

Re: fstrcmp

2009-06-02 Thread Jérôme Pouiller

On Sunday 31 May 2009 03:49:37 Peter Miller wrote:
[...]
>This goes for packages as well.  Wouldn't it be great if
>
>apt-get install dns-utils
>
>instead of saying
>
>E: Couldn't find package dns-utils
>
>it said something more useful, like
>
>E: Couldn't find package dns-utils, did you mean dnsutils instead?

It is naive to think matching algorithm iterates on all items until it 
find the correct one. At least, algorithm use a sorted index with a 
dichotomy search. 

Nevertheless, your idea is interesting. But you should implement a 
function to match the nearest string in a set of strings. Take a look in 
spell checking libraries to have an idea how to implement it.

-- 
Jérôme Pouiller (jezz AT sysmic DOT org)

--
To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org

Re: fstrcmp

2009-05-31 Thread William Pitcock

On Sun, 2009-05-31 at 11:04 +0200, Florian Weimer wrote:
> * Peter Miller:
> 
> > I've been considering turning my fuzzy string compare function into a
> > library.
> 
> I would certainly welcome that.
> 
> Would you be willing to relicense it under a more permissive license,
> so that we don't have to worry about OpenSSL license compatibility
> etc.?
> 
> > /**
> >  * the fstrcmp function compare two strings, to determine how
> >  * similar two strings appear.
> >  *
> >  * @param s1
> >  * The first of the strings to compare.
> >  * @param s2
> >  * The second of the strings to compare.
> >  * @returns
> >  * a number between 0.0 and 1.0; 0.0 means the strings are
> >  * nothing alike, 1.0 means the two strings are identical.
> >  */
> > double fstrcmp(const char *s1, const char *s2);
> 
> It could be helpful if it didn't use floating point because we support
> some systems where floating point is software-emulated.
> 
> I don't think we've got a library of C goodies.  libbsd is something
> in that direction, but if your function isn't in the BSDs, it probably
> doesn't fit there, either.

There is libmowgli (don't ask why it's named this, we just picked
something random), but it requires all core modules to be licensed under
ISC.  There is some Windows portability components that are LGPL.

Unfortunately, we're renovating the atheme.org website right now, so
instructions to submit additional components are not yet available.

William


signature.asc
Description: This is a digitally signed message part

Re: fstrcmp

2009-05-31 Thread Adeodato Simó

+ Peter Miller (Sun, 31 May 2009 11:49:37 +1000):

> Wouldn't it be great if when you typed
> apt-get build-deps gcc

> instead of saying
> E: Invalid operation build-deps

> it said something more useful, like
> E: Invalid operation build-deps, did you mean build-dep instead?

I guess that's would be doable, since it'd involve calculating distances
to all registered commands, which are only a handful. However:

> This goes for packages as well.  Wouldn't it be great if
> apt-get install dns-utils

> instead of saying
> E: Couldn't find package dns-utils

> it said something more useful, like
> E: Couldn't find package dns-utils, did you mean dnsutils instead?

I can't see how it'd work here, at least without the help of some
on-disk structure, since we're talking about a space of 25,000 packages.

-- 
- Are you sure we're good?
- Always.
-- Rory and Lorelai


-- 
To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org

Re: fstrcmp

2009-05-31 Thread Florian Weimer

* Peter Miller:

> I've been considering turning my fuzzy string compare function into a
> library.

I would certainly welcome that.

Would you be willing to relicense it under a more permissive license,
so that we don't have to worry about OpenSSL license compatibility
etc.?

> /**
>  * the fstrcmp function compare two strings, to determine how
>  * similar two strings appear.
>  *
>  * @param s1
>  * The first of the strings to compare.
>  * @param s2
>  * The second of the strings to compare.
>  * @returns
>  * a number between 0.0 and 1.0; 0.0 means the strings are
>  * nothing alike, 1.0 means the two strings are identical.
>  */
> double fstrcmp(const char *s1, const char *s2);

It could be helpful if it didn't use floating point because we support
some systems where floating point is software-emulated.

I don't think we've got a library of C goodies.  libbsd is something
in that direction, but if your function isn't in the BSDs, it probably
doesn't fit there, either.


-- 
To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org

Re: fstrcmp

Re: fstrcmp

Re: fstrcmp

Re: fstrcmp

Re: fstrcmp

Re: fstrcmp

Re: fstrcmp

Re: fstrcmp

Re: fstrcmp

9 matches

Site Navigation

Mail list logo

Footer information