Re: Bug#881692: command-not-found: I re-wrote command-not-found
On Nov 14, 2017 8:15 AM, "Julian Andres Klode"wrote: On Tue, Nov 14, 2017 at 03:35:02PM +, John Lenton wrote: > On 14 November 2017 at 12:34, Julian Andres Klode wrote: > > On Tue, Nov 14, 2017 at 01:00:54PM +0100, Zygmunt Krynicki wrote: > >> I would love if we have a compact representation of mapping from name > >> to list of bits of information where each bit can be a small structure > >> with some data. Apart from components for ubuntu archive it could be > >> used to store facts about snap packages, flatpaks, etc. I would try to > >> avoid a simplistic command -> package mapping as that will force us to > >> encode things into strings in an ad-hoc way. > > > > That makes sense to me. But then we're back on a db, I guess. I sort > > like this minimal approach. > > I was thinking in the other direction, was going to see how it behaved > with sqlite as the store. Would that be objectionable? Using a relational database for a simple key -> structure mapping seems overkill and a mismatch for the problem, and the SQL does not make it more readable. I'd play with lmdb and kyotocabinet, these are two high-performance key-value file databases and then encode a structure as mentioned before. I had some kyotocabinet code, (i maintain that package, which btw is in mentors) but this way is at least half the size. (Kyotocabinet is 1mb and it almost doubles the size of the db, even using lower overhead b-tree back end. These entries are just very small. For the text file approach, we can even go human, readable, like git: git just encodes a number in a fixed-length decimal number, we can do the same, and then just encode (length, key), (length, data) pairs after each other (or as mentioned, just use the "index" as the field id, and store field ids in the progrma). Uses a bit more space, but encodes everything in a format you could read with a text editor, and should not be terribly less efficient. The thing is: This needs to be as efficient as possible: it should be below 100ms (or better 50ms), regardless of whether caches are dropped or not. Python code | Shawn's code SSD, cache 50ms5ms SSD, " dropped 256ms 15ms HDD, cache 50ms5ms HDD, " dropped 530ms 15ms I guess Shawn's code could even be improved in performance by avoiding the subprocess execution, avoiding various ld cache lookups and library loads. I am going to have to bring it in process to add the spell check code. That said, space requirements might matter too. -- Debian Developer - deb.li/jak | jak-linux.org - free software dev Ubuntu Core Developer de, en speaker -- Ubuntu-devel-discuss mailing list Ubuntu-devel-discuss@lists.ubuntu.com Modify settings or unsubscribe at: https://lists.ubuntu.com/mailman/listinfo/ubuntu-devel-discuss
Re: Bug#881692: command-not-found: I re-wrote command-not-found
On Tue, Nov 14, 2017 at 03:35:02PM +, John Lenton wrote: > On 14 November 2017 at 12:34, Julian Andres Klodewrote: > > On Tue, Nov 14, 2017 at 01:00:54PM +0100, Zygmunt Krynicki wrote: > >> I would love if we have a compact representation of mapping from name > >> to list of bits of information where each bit can be a small structure > >> with some data. Apart from components for ubuntu archive it could be > >> used to store facts about snap packages, flatpaks, etc. I would try to > >> avoid a simplistic command -> package mapping as that will force us to > >> encode things into strings in an ad-hoc way. > > > > That makes sense to me. But then we're back on a db, I guess. I sort > > like this minimal approach. > > I was thinking in the other direction, was going to see how it behaved > with sqlite as the store. Would that be objectionable? Using a relational database for a simple key -> structure mapping seems overkill and a mismatch for the problem, and the SQL does not make it more readable. I'd play with lmdb and kyotocabinet, these are two high-performance key-value file databases and then encode a structure as mentioned before. For the text file approach, we can even go human, readable, like git: git just encodes a number in a fixed-length decimal number, we can do the same, and then just encode (length, key), (length, data) pairs after each other (or as mentioned, just use the "index" as the field id, and store field ids in the progrma). Uses a bit more space, but encodes everything in a format you could read with a text editor, and should not be terribly less efficient. The thing is: This needs to be as efficient as possible: it should be below 100ms (or better 50ms), regardless of whether caches are dropped or not. Python code | Shawn's code SSD, cache 50ms5ms SSD, " dropped 256ms 15ms HDD, cache 50ms5ms HDD, " dropped 530ms 15ms I guess Shawn's code could even be improved in performance by avoiding the subprocess execution, avoiding various ld cache lookups and library loads. That said, space requirements might matter too. -- Debian Developer - deb.li/jak | jak-linux.org - free software dev Ubuntu Core Developer de, en speaker -- Ubuntu-devel-discuss mailing list Ubuntu-devel-discuss@lists.ubuntu.com Modify settings or unsubscribe at: https://lists.ubuntu.com/mailman/listinfo/ubuntu-devel-discuss
Re: Bug#881692: command-not-found: I re-wrote command-not-found
On 14 November 2017 at 12:34, Julian Andres Klodewrote: > On Tue, Nov 14, 2017 at 01:00:54PM +0100, Zygmunt Krynicki wrote: >> I would love if we have a compact representation of mapping from name >> to list of bits of information where each bit can be a small structure >> with some data. Apart from components for ubuntu archive it could be >> used to store facts about snap packages, flatpaks, etc. I would try to >> avoid a simplistic command -> package mapping as that will force us to >> encode things into strings in an ad-hoc way. > > That makes sense to me. But then we're back on a db, I guess. I sort > like this minimal approach. I was thinking in the other direction, was going to see how it behaved with sqlite as the store. Would that be objectionable? -- Ubuntu-devel-discuss mailing list Ubuntu-devel-discuss@lists.ubuntu.com Modify settings or unsubscribe at: https://lists.ubuntu.com/mailman/listinfo/ubuntu-devel-discuss
Re: Bug#881692: command-not-found: I re-wrote command-not-found
On Tue, Nov 14, 2017 at 01:00:54PM +0100, Zygmunt Krynicki wrote: > Hey everyone. > > Thank you for your interest in command-not-found. > > On Tue, Nov 14, 2017 at 8:50 AM, Julian Andres Klodewrote: > > (forwarding this to ubuntu-devel-discuss and Zygmunt) > > > > On Mon, Nov 13, 2017 at 10:33:39PM -0800, Shawn Landden wrote: > >> Package: command-not-found > >> Severity: wishlist > >> > >> I re-wrote command-not-found to get rid of the python dependancy, and > >> to reduce the database size, as to reduce memory usage. > >> > >> https://github.com/shawnl/command-not-found > >> > >> I was preparing to upload it to mentors as command-not-found-ng > > > > I also rewrote it years ago, but using the same database format, > > just in C. It was a lot faster. I don't understand the memory usage > > bit - it should not matter how large the database is, it's memory > > mapped, and not read into memory, as such memory usage should be > > roughly constant. > > > > Questions/Comments for your approach: > > > > * Did you test your format on a slow HDD with caches dropped? It > > must not be slower than the Python one (that one is way too slow > > already) - I did, it seems to be faster (0.4 vs 0.68 seconds) > > - I believe the database-based C rewrite was even much faster, > > though. > > > * update-command-not-found should use apt-get indextargets > > > * You don't store components, hence you cannot tell people to enable > > component. That's a very important use case for Ubuntu, where > > not all components are enabled by default, but the database is > > shipped in the package. > > > > You could just append / to each package name I think, > > and strip it away when displaying. > > I would love if we have a compact representation of mapping from name > to list of bits of information where each bit can be a small structure > with some data. Apart from components for ubuntu archive it could be > used to store facts about snap packages, flatpaks, etc. I would try to > avoid a simplistic command -> package mapping as that will force us to > encode things into strings in an ad-hoc way. That makes sense to me. But then we're back on a db, I guess. I sort like this minimal approach. An approach of course would be to store a key/value map after the package, something like: filepackage name followed by multiple: key value where lengths are 32-bit (16 bit?) integers. Should not be too hard. Alternatively, this also works filepackagename for each field value for each field and then you can index stuff offset(attr_i) = offset(attr_i) + attrs[i] Lots of options to extend. > > > * You should use getopt_long() to parse command-line options, and > > support -h, --help :) > > > * pts_lbsearch belongs into usr/lib/..., not usr/share/... > > > > * You don't implement a closest matches function: > > > > $ command-not-found thunderbrd > > No command 'thunderbrd' found, did you mean: > > Command 'thunderbird' from package 'thunderbird' (main) > > thunderbrd: command not found > > $ ./command-not-found thunderbrd > > thunderbrd: command not found > > > >This one is really important. People do make typos or misremember > >command names, so the tool needs to be able to deal with that > > +1 on this, the function should be not too hard to implement in C. > > >Should be easy to implement though, although you might hav > > > * You need to Conflict with command-not-found and not Break AFAIUI > > Ideally, to ease the transition, you should do something about the > python APIs. If yo can keep them (either as pure-python bindings or > just as a compatible implementation) that would be a plus. If you want > to drop them then please announce that and see if anything rdepends on > it. Oh, hmm. > > > * You do have to Depend on apt-file, as that configures apt to download > > the Contents files > > I didn't look at the details but I (hope) this is a build dependency > and this will be processed somewhere on the archive side. That's a Debian-only dependency forced upon us by ftpmaster, on Ubuntu we can ship the data in the package (or preferably a separate command-not-found-data source package). > > > I think it's a worthwhile approach, and I can see it replacing > > command-not-found if those tiny issues have been fixed. Then you > > could also avoid the -ng moniker, and just take over the main > > package (if Zygmunt does not mind), which also avoids a month > > long NEW process :) > > Yes, though I'd like to participate as we're working on > command-not-found improvements in snapd and would like to have > something that fits Debian, Ubuntu as well as (eventually but not > conflicting at least) Fedora and openSUSE (at least the snapd part). I'd like that. -- Debian Developer -
Re: Bug#881692: command-not-found: I re-wrote command-not-found
Hey everyone. Thank you for your interest in command-not-found. On Tue, Nov 14, 2017 at 8:50 AM, Julian Andres Klodewrote: > (forwarding this to ubuntu-devel-discuss and Zygmunt) > > On Mon, Nov 13, 2017 at 10:33:39PM -0800, Shawn Landden wrote: >> Package: command-not-found >> Severity: wishlist >> >> I re-wrote command-not-found to get rid of the python dependancy, and >> to reduce the database size, as to reduce memory usage. >> >> https://github.com/shawnl/command-not-found >> >> I was preparing to upload it to mentors as command-not-found-ng > > I also rewrote it years ago, but using the same database format, > just in C. It was a lot faster. I don't understand the memory usage > bit - it should not matter how large the database is, it's memory > mapped, and not read into memory, as such memory usage should be > roughly constant. > > Questions/Comments for your approach: > > * Did you test your format on a slow HDD with caches dropped? It > must not be slower than the Python one (that one is way too slow > already) - I did, it seems to be faster (0.4 vs 0.68 seconds) > - I believe the database-based C rewrite was even much faster, > though. > * update-command-not-found should use apt-get indextargets > * You don't store components, hence you cannot tell people to enable > component. That's a very important use case for Ubuntu, where > not all components are enabled by default, but the database is > shipped in the package. > > You could just append / to each package name I think, > and strip it away when displaying. I would love if we have a compact representation of mapping from name to list of bits of information where each bit can be a small structure with some data. Apart from components for ubuntu archive it could be used to store facts about snap packages, flatpaks, etc. I would try to avoid a simplistic command -> package mapping as that will force us to encode things into strings in an ad-hoc way. > * You should use getopt_long() to parse command-line options, and > support -h, --help :) > * pts_lbsearch belongs into usr/lib/..., not usr/share/... > > * You don't implement a closest matches function: > > $ command-not-found thunderbrd > No command 'thunderbrd' found, did you mean: > Command 'thunderbird' from package 'thunderbird' (main) > thunderbrd: command not found > $ ./command-not-found thunderbrd > thunderbrd: command not found > >This one is really important. People do make typos or misremember >command names, so the tool needs to be able to deal with that +1 on this, the function should be not too hard to implement in C. >Should be easy to implement though, although you might hav > * You need to Conflict with command-not-found and not Break AFAIUI Ideally, to ease the transition, you should do something about the python APIs. If yo can keep them (either as pure-python bindings or just as a compatible implementation) that would be a plus. If you want to drop them then please announce that and see if anything rdepends on it. > * You do have to Depend on apt-file, as that configures apt to download > the Contents files I didn't look at the details but I (hope) this is a build dependency and this will be processed somewhere on the archive side. > I think it's a worthwhile approach, and I can see it replacing > command-not-found if those tiny issues have been fixed. Then you > could also avoid the -ng moniker, and just take over the main > package (if Zygmunt does not mind), which also avoids a month > long NEW process :) Yes, though I'd like to participate as we're working on command-not-found improvements in snapd and would like to have something that fits Debian, Ubuntu as well as (eventually but not conflicting at least) Fedora and openSUSE (at least the snapd part). Best regards ZK -- Ubuntu-devel-discuss mailing list Ubuntu-devel-discuss@lists.ubuntu.com Modify settings or unsubscribe at: https://lists.ubuntu.com/mailman/listinfo/ubuntu-devel-discuss
Re: Bug#881692: command-not-found: I re-wrote command-not-found
(forwarding this to ubuntu-devel-discuss and Zygmunt) On Mon, Nov 13, 2017 at 10:33:39PM -0800, Shawn Landden wrote: > Package: command-not-found > Severity: wishlist > > I re-wrote command-not-found to get rid of the python dependancy, and > to reduce the database size, as to reduce memory usage. > > https://github.com/shawnl/command-not-found > > I was preparing to upload it to mentors as command-not-found-ng I also rewrote it years ago, but using the same database format, just in C. It was a lot faster. I don't understand the memory usage bit - it should not matter how large the database is, it's memory mapped, and not read into memory, as such memory usage should be roughly constant. Questions/Comments for your approach: * Did you test your format on a slow HDD with caches dropped? It must not be slower than the Python one (that one is way too slow already) - I did, it seems to be faster (0.4 vs 0.68 seconds) - I believe the database-based C rewrite was even much faster, though. * update-command-not-found should use apt-get indextargets * You don't store components, hence you cannot tell people to enable component. That's a very important use case for Ubuntu, where not all components are enabled by default, but the database is shipped in the package. You could just append / to each package name I think, and strip it away when displaying. * You should use getopt_long() to parse command-line options, and support -h, --help :) * pts_lbsearch belongs into usr/lib/..., not usr/share/... * You don't implement a closest matches function: $ command-not-found thunderbrd No command 'thunderbrd' found, did you mean: Command 'thunderbird' from package 'thunderbird' (main) thunderbrd: command not found $ ./command-not-found thunderbrd thunderbrd: command not found This one is really important. People do make typos or misremember command names, so the tool needs to be able to deal with that Should be easy to implement though, although you might have to search multiple times - once for each alternative. All you need is def similar_words(word): """ return a set with spelling1 distance alternative spellings based on http://norvig.com/spell-correct.html""; alphabet = 'abcdefghijklmnopqrstuvwxyz-_' s = [(word[:i], word[i:]) for i in range(len(word) + 1)] deletes= [a + b[1:] for a, b in s if b] transposes = [a + b[1] + b[0] + b[2:] for a, b in s if len(b)>1] replaces = [a + c + b[1:] for a, b in s for c in alphabet if b] inserts= [a + c + b for a, b in s for c in alphabet] return set(deletes + transposes + replaces + inserts) And search for what that returns. And you don't need to search for those at all if you have a direct match. * It needs to be translated - also very important. * You need to Conflict with command-not-found and not Break AFAIUI * You should not depend on grep, sed, coreutils, they are Essential. * You do have to Depend on apt-file, as that configures apt to download the Contents files * You should not have identifiers starting with _ in the program, these are reserved for the C implementation (like _cleanup_free_). Yes, and these are basically the same reasons my C prototype is not in the archive. Also, I did not put a lot of work into it, as I was waiting for PackageKit to take that over, but that was not done yet. I think it's a worthwhile approach, and I can see it replacing command-not-found if those tiny issues have been fixed. Then you could also avoid the -ng moniker, and just take over the main package (if Zygmunt does not mind), which also avoids a month long NEW process :) -- Debian Developer - deb.li/jak | jak-linux.org - free software dev Ubuntu Core Developer de, en speaker -- Ubuntu-devel-discuss mailing list Ubuntu-devel-discuss@lists.ubuntu.com Modify settings or unsubscribe at: https://lists.ubuntu.com/mailman/listinfo/ubuntu-devel-discuss