[gentoo-portage-dev] search functionality in emerge
Hi everyone. My name is Emma, and I am completely new to this list. I've been using Gentoo since 2004, including Portage of course, and before I say anything else I'd like to say thanks to everyone for such a kickass package management system!! Anyway, for my final project in my Data Structures Algorithms class this semester, I would like to modify the search functionality in emerge. Something I've always noticed about 'emerge -s' or '-S' is that, in general, it takes a very long time to perform the searches. (Although, lately it does seem to be running faster, specifically on my laptop as opposed to my desktop. Strangely, though, it seems that when I do a simple 'emerge -av whatever' on my laptop it takes a very long time for emerge to find the package and/or determine the dependecies - whatever it's doing behind that spinner. I can definitely go into more detail about this if anyone's interested. It's really been puzzling me!) So, as my final project I've proposed to improve the time it takes to perform a search using emerge. My professor suggested that I look into implementing indexing. However, I've started looking at the code, and I must admit I'm pretty overwhelmed! I don't know where to start. I was wondering if anyone on here could give me a quick overview of how the search function currently works, an idea as to what could be modified or implemented in order to improve the running time of this code, or any tip really as to where I should start or what I should start looking at. I'd really appreciate any help or advice!! Thanks a lot, and keep on making my Debian-using professor jealous :] Emma
Re: [gentoo-portage-dev] search functionality in emerge
El dom, 23-11-2008 a las 16:01 +0200, tvali escribió: Try esearch. emerge esearch esearch ... 2008/11/23 Emma Strubell [EMAIL PROTECTED] Hi everyone. My name is Emma, and I am completely new to this list. I've been using Gentoo since 2004, including Portage of course, and before I say anything else I'd like to say thanks to everyone for such a kickass package management system!! Anyway, for my final project in my Data Structures Algorithms class this semester, I would like to modify the search functionality in emerge. Something I've always noticed about 'emerge -s' or '-S' is that, in general, it takes a very long time to perform the searches. (Although, lately it does seem to be running faster, specifically on my laptop as opposed to my desktop. Strangely, though, it seems that when I do a simple 'emerge -av whatever' on my laptop it takes a very long time for emerge to find the package and/or determine the dependecies - whatever it's doing behind that spinner. I can definitely go into more detail about this if anyone's interested. It's really been puzzling me!) So, as my final project I've proposed to improve the time it takes to perform a search using emerge. My professor suggested that I look into implementing indexing. However, I've started looking at the code, and I must admit I'm pretty overwhelmed! I don't know where to start. I was wondering if anyone on here could give me a quick overview of how the search function currently works, an idea as to what could be modified or implemented in order to improve the running time of this code, or any tip really as to where I should start or what I should start looking at. I'd really appreciate any help or advice!! Thanks a lot, and keep on making my Debian-using professor jealous :] Emma -- tvali Kuskilt foorumist: http://www.cooltests.com - kui inglise keelt oskad. Muide, üle 120 oled väga tark, üle 140 oled geenius, mingi 170 oled ju mingi täica pea nagu prügikast... I use eix: emerge eix ;-)
Re: [gentoo-portage-dev] search functionality in emerge
Thanks for the replies! I know there are a couple programs out there that basically already do what I'm looking to do... Unfortunately I wasn't aware of these pre-existing utilities until after I submitted my project proposal to my professor. So, I'm looking to implement a better search myself. Preferably by editing the existing portage code, not writing a separate program. So if anyone can offer any help regarding the actual implementation of search in portage, I would greatly appreciate it! Or, if anyone has an idea for a more productive/useful project I could work on relating to portage (about the same difficulty, preferably at least a little bit data structure related), please, let me know! Thanks again guys, Emma On Sun, Nov 23, 2008 at 9:33 AM, Pacho Ramos [EMAIL PROTECTED] wrote: El dom, 23-11-2008 a las 16:01 +0200, tvali escribió: Try esearch. emerge esearch esearch ... 2008/11/23 Emma Strubell [EMAIL PROTECTED] Hi everyone. My name is Emma, and I am completely new to this list. I've been using Gentoo since 2004, including Portage of course, and before I say anything else I'd like to say thanks to everyone for such a kickass package management system!! Anyway, for my final project in my Data Structures Algorithms class this semester, I would like to modify the search functionality in emerge. Something I've always noticed about 'emerge -s' or '-S' is that, in general, it takes a very long time to perform the searches. (Although, lately it does seem to be running faster, specifically on my laptop as opposed to my desktop. Strangely, though, it seems that when I do a simple 'emerge -av whatever' on my laptop it takes a very long time for emerge to find the package and/or determine the dependecies - whatever it's doing behind that spinner. I can definitely go into more detail about this if anyone's interested. It's really been puzzling me!) So, as my final project I've proposed to improve the time it takes to perform a search using emerge. My professor suggested that I look into implementing indexing. However, I've started looking at the code, and I must admit I'm pretty overwhelmed! I don't know where to start. I was wondering if anyone on here could give me a quick overview of how the search function currently works, an idea as to what could be modified or implemented in order to improve the running time of this code, or any tip really as to where I should start or what I should start looking at. I'd really appreciate any help or advice!! Thanks a lot, and keep on making my Debian-using professor jealous :] Emma -- tvali Kuskilt foorumist: http://www.cooltests.com - kui inglise keelt oskad. Muide, üle 120 oled väga tark, üle 140 oled geenius, mingi 170 oled ju mingi täica pea nagu prügikast... I use eix: emerge eix ;-)
Re: [gentoo-portage-dev] search functionality in emerge
Emma, It would be great it you could speed search up a bit! As these other guys have pointed out, we do have some indexing tools in Gentoo already. Most users don't understand why that kind of functionality isn't built directly into Portage, but IIRC it has something to do with the fact that these fast search indexes aren't able to be written to by more than one process at the same time, so for example if you had two emerges finishing at the same time, Portage's current flat hash file can handle that, but the faster db-based indexes can't. Anyways, that's the way I, as a curious user, understood the problem. You might be interested in reading this, very old forum thread about a previous attempt: http://forums.gentoo.org/viewtopic-t-261580-postdays-0-postorder-asc-start-0.html On Sun, Nov 23, 2008 at 11:33 PM, Pacho Ramos [EMAIL PROTECTED] wrote: El dom, 23-11-2008 a las 16:01 +0200, tvali escribió: Try esearch. emerge esearch esearch ... 2008/11/23 Emma Strubell [EMAIL PROTECTED] Hi everyone. My name is Emma, and I am completely new to this list. I've been using Gentoo since 2004, including Portage of course, and before I say anything else I'd like to say thanks to everyone for such a kickass package management system!! Anyway, for my final project in my Data Structures Algorithms class this semester, I would like to modify the search functionality in emerge. Something I've always noticed about 'emerge -s' or '-S' is that, in general, it takes a very long time to perform the searches. (Although, lately it does seem to be running faster, specifically on my laptop as opposed to my desktop. Strangely, though, it seems that when I do a simple 'emerge -av whatever' on my laptop it takes a very long time for emerge to find the package and/or determine the dependecies - whatever it's doing behind that spinner. I can definitely go into more detail about this if anyone's interested. It's really been puzzling me!) So, as my final project I've proposed to improve the time it takes to perform a search using emerge. My professor suggested that I look into implementing indexing. However, I've started looking at the code, and I must admit I'm pretty overwhelmed! I don't know where to start. I was wondering if anyone on here could give me a quick overview of how the search function currently works, an idea as to what could be modified or implemented in order to improve the running time of this code, or any tip really as to where I should start or what I should start looking at. I'd really appreciate any help or advice!! Thanks a lot, and keep on making my Debian-using professor jealous :] Emma -- tvali Kuskilt foorumist: http://www.cooltests.com - kui inglise keelt oskad. Muide, üle 120 oled väga tark, üle 140 oled geenius, mingi 170 oled ju mingi täica pea nagu prügikast... I use eix: emerge eix ;-)
Re: [gentoo-portage-dev] search functionality in emerge
Wow, that's extremely helpful!! I happen to particularly enjoy tries, so the suffix trie sounds like a great idea. The trie class example is really helpful too, because this will be my first time programming in Python, and it's a bit easier to figure out what's going on syntax-wise in that simple trie class than in the middle of the portage source code! Seriously, thanks again :] On Sun, Nov 23, 2008 at 11:56 AM, Lucian Poston [EMAIL PROTECTED]wrote: Thanks for the replies! I know there are a couple programs out there that basically already do what I'm looking to do... Unfortunately I wasn't aware of these pre-existing utilities until after I submitted my project proposal to my professor. So, I'm looking to implement a better search myself. Preferably by editing the existing portage code, not writing a separate program. So if anyone can offer any help regarding the actual implementation of search in portage, I would greatly appreciate it! Most of the search implementation is in /usr/lib/portage/pym/_emerge/__init__.py in class search. The class's execute() method simply iterates over all packages (and descriptions and package sets) and matches against the searchkey. You might need to look into pym/portage/dbapi/porttree.py for portdbapi as well. If you intend to index and support fast regex lookup, then you need to do some fancy indexing, which I'm not terribly familiar with. You could follow in the steps of eix[1] or other indexed search utilities and design some sort of index layout, which is easier than the following thought. You might consider implementing a suffix trie or similar that has sublinear regexp lookup and marshalling the structure for the index. I couldn't find a python implementation for something like this, but here is a general trie class[2] that you might start with if you go that route. There is a perl module[3], Tie::Hash::Regex, that does that, but properly implementing that in python would be a chore. :) That project sounds interesting and fun. Good luck! Lucian Poston [1] https://projects.gentooexperimental.org/eix/wiki/IndexFileLayout [2] http://www.koders.com/python/fid7B6BC1651A9E8BBA547552FE3F039479A4DECC45.aspx [3] http://search.cpan.org/~davecross/Tie-Hash-Regex-1.02/lib/Tie/Hash/Regex.pmhttp://search.cpan.org/%7Edavecross/Tie-Hash-Regex-1.02/lib/Tie/Hash/Regex.pm
Re: [gentoo-portage-dev] search functionality in emerge
Yes if it would be a low-level implementation to portage, speeding up it's native code for searching and using indexes, then it would make everything faster, including emerge (because emerge does search first for package relations). I have actually wanted to do it myself several years ago, so reacting here to have my ideas discussed, too. Douglas Anderson 16:46 reply is about locks and I think that it would need to rethink portages locking methods - what, when and why it locks. This is probably quite hard task by itself. Anyway, as portage actually lets user make two emerges at the same time, locks might be OK, too. I think that the best thing would be bottom-up refactoring - first to make a list of lowest-level functions, which have to do with reading data from portage tree or writing into it; then making indexer class, which will be used by all of those low-level functions. To have it OOP, it should be implemented in such way: - Low-level portage tree handler does everything with portage tree, no function in portage uses it directly. - Tree handler has several needed and several optional methods - so that implementing new handler would be easy, but creating things like native regex search would be possible. - One could implement a new tree handler with SQLite or other interface instead of filesystem and do other tricks through this interface; for example, boost it. So, nice way to go would be: 1. Implementing portage tree handler and it's proxy, which uses current portage tree in non-indexed way and simply gives methods for the same kind of access, as currently implemented one. 2. Refactoring portage to rely only on portage tree handler and use direct portage tree nowhere. To test if it is so, portage tree should be moved to another directory, about which only this handler knows, and check if portage works well. Indexing all places, where portage uses it's tree handler (by std. comment, for example) and making clear, which methods would contain all boostable code of it. 3. Implementing those methods in proxy, which could simulate fast regex search and other stuff using simplest possible interface of portage tree handler (smth. like four methods add, remove, get, list). Proxy should be able to use handler's methods if they are implemented. 4. Refactoring portage to use advanced methods in proxy. 5. Now, having taken all code together into one place and looking this nice and readable code, real optimizations could be discussed here, for example. Ideally, i think, portage would have such tree handlers: - Filesystem handler - fast searches over current portage tree structure. - SQL handler - rewrite of tree functions into SQL queries. - Network-based handler - maybe sometimes it would nice to have portage tree only in one machine of cluster, for example if I want to have 100 really small computers with fast connection with mother-computer and portage tree is too big to be wholly copied to all of them. - Memory-buffered handler with daemon, which is actually proxy to some other handler - daemon, which reads whole tree (from filesystem or SQL) into memory on boot or first use, creates really fast index (because now it does matter to have better indexing) and optionally deletes some [less needed] parts of it's index from memory when it's becoming full and behaves as really simple proxy if it stays full. This should be implemented after critical parts of filesystem or SQL handler. 2008/11/23 Emma Strubell [EMAIL PROTECTED] Wow, that's extremely helpful!! I happen to particularly enjoy tries, so the suffix trie sounds like a great idea. The trie class example is really helpful too, because this will be my first time programming in Python, and it's a bit easier to figure out what's going on syntax-wise in that simple trie class than in the middle of the portage source code! Seriously, thanks again :] On Sun, Nov 23, 2008 at 11:56 AM, Lucian Poston [EMAIL PROTECTED]wrote: Thanks for the replies! I know there are a couple programs out there that basically already do what I'm looking to do... Unfortunately I wasn't aware of these pre-existing utilities until after I submitted my project proposal to my professor. So, I'm looking to implement a better search myself. Preferably by editing the existing portage code, not writing a separate program. So if anyone can offer any help regarding the actual implementation of search in portage, I would greatly appreciate it! Most of the search implementation is in /usr/lib/portage/pym/_emerge/__init__.py in class search. The class's execute() method simply iterates over all packages (and descriptions and package sets) and matches against the searchkey. You might need to look into pym/portage/dbapi/porttree.py for portdbapi as well. If you intend to index and support fast regex
Re: [gentoo-portage-dev] search functionality in emerge
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Mike Auty schrieb: Finally there are overlays, and since these can change outside of an emerge --sync (as indeed can the main tree), you'll have to reindex these before each search request, or give the user stale data until they manually reindex. Determining whether there has been a change to the ebuild system is a major point in the whole thing. What does a great index serves you, if it does not notice the changes the user made in his own local overlay? :) Manually re-indexing is not a good choice I think... If somebody comes up here with a good (and fast) solution, this would be a nice thing ;) (need it myself). Regards, René -BEGIN PGP SIGNATURE- Version: GnuPG v2.0.9 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iEYEARECAAYFAkkp0kAACgkQ4UOg/zhYFuAhTACfYDxNeQQG6dysgU5TrNEZGOiH 3CoAn2wV6g8/8uj+T99cxJGdQBxTtZjI =2I2j -END PGP SIGNATURE-
Re: [gentoo-portage-dev] search functionality in emerge
On Sun, 23 Nov 2008 07:17:40 -0500 Emma Strubell [EMAIL PROTECTED] wrote: However, I've started looking at the code, and I must admit I'm pretty overwhelmed! I don't know where to start. I was wondering if anyone on here could give me a quick overview of how the search function currently works, an idea as to what could be modified or implemented in order to improve the running time of this code, or any tip really as to where I should start or what I should start looking at. I'd really appreciate any help or advice!! Well, it depends how much effort you want to put into this. The current interface doesn't actually provide a search interface, but merely functions to 1) list all package names - dbapi.cp_all() 2) list all package names and versions - dbapi.cpv_all() 3) list all versions for a given package name - dbapi.cp_list() 4) read metadata (like DESCRIPTION) for a given package name and version - dbapi.aux_get() One of the main performance problems of --search is that there is no persistent cache for functions 1, 2 and 3, so if you're just interested in performance aspects you might want to look into that. The issue with implementing a persistent cache is that you have to consider both cold and hot filesystem cache cases: Loading an index file with package names and versions might improve the cold-cache case, but slow things down when the filesystem cache is populated. As has been mentioned, keeping the index updated is the other major issue, especially as it has to be portable and should require little or no configuration/setup for the user (so no extra daemons or special filesystems running permanently in the background). The obvious solution would be to generate the cache after `emerge --sync` (and other sync implementations) and hope that people don't modify their tree and search for the changes in between (that's what all the external tools do). I don't know if there is actually a way to do online updates while still improving performance and not relying on custom system daemons running in the background. As for --searchdesc, one problem is that dbapi.aux_get() can only operate on a single package-version on each call (though it can read multiple metadata variables). So for description searches the control flow is like this (obviously simplified): result = [] # iterate over all packages for package in dbapi.cp_all(): # determine the current version of each package, this is # another performance issue. version = get_current_version(package) # read package description from metadata cache description = dbapi.aux_get(version, [DESCRIPTION])[0] # check if the description matches if matches(description, searchkey): result.append(package) There you see the three bottlenecks: the lack of a pregenerated package list, the version lookup for *each* package and the actual metadata read. I've already talked about the first, so lets look at the other two. The core problem there is that DESCRIPTION (like all standard metadata variables) is version specific, so to access it you need to determine a version to use, even though in almost all cases the description is the same (or very similar) for all versions. So the proper solution would be to make the description a property of the package name instead of the package version, but that's a _huge_ task you're probably not interested in. What _might_ work here is to add support for an optional package-name-description cache that can be generated offline and includes those packages where all versions have the same description, and fall back to the current method if the package is not included in the cache. (Don't think about caching the version lookup, that's system dependent and therefore not suitable for caching). Hope it has become clear that while the actual search algorithm might be simple and not very efficient, the real problem lies in getting the data to operate on. That and the somewhat limited dbapi interface. Disclaimer: The stuff below involves extending and redesigning some core portage APIs. This isn't something you can do on a weekend, only work on this if you want to commit yourself to portage development for a long time. The functions listed above are the bare minimum to perform queries on the package repositories, but they're very low-level. That means that whenever you want to select packages by name, description, license, dependencies or other variables you need quite a bit of custom code, more if you want to combine multiple searches, and much more if you want to do it efficient and flexible. See http://dev.gentoo.org/~genone/scripts/metalib.py and http://dev.gentoo.org/~genone/scripts/metascan for a somewhat flexible, but very inefficient search tool (might not work anymore due to old age). Ideally repository searches could be done without writing any application code using some kind of query language, similar to how SQL works for generic database searches (obviously not that complex). But
Re: [gentoo-portage-dev] search functionality in emerge
On Sun, 23 Nov 2008 21:01:40 -0800 (PST) devsk [EMAIL PROTECTED] wrote: not relying on custom system daemonsrunning in the background. Why is a portage daemon such a bad thing? Or hard to do? I would very much like a daemon running on my system which I can configure to sync the portage tree once a week (or month if I am lazy), give me a summary of hot fixes, security fixes in a nice email, push important announcements and of course, sync caches on detecting changes (which should be trivial with notify daemons all over the place) etc. Why is it such a bad thing? Well, as an opt-in solution it might work (though most of what you described is IMO just stuff for cron, no need to reinvent the wheel). What I was saying is that _relying_ on custom system daemons/filesystems for a _core subsystem_ of portage is the wrong way, simply because it adds a substantial amount of complexity to the whole package management architecture. It's one more thing that can (and will) break, one more layer to take into account for any design decisions, one more component that has to be secured, one more obstacle to overcome when you want to analyze/debug things. And special care must be taken if it requires special kernel support and/or external packages. Do you want to make inotify support mandatory to use portage efficiently? (btw, looks like inotify doesn't really work with NFS mounts, which would already make such a daemon completely useless for people using a NFS-shared repository) And finally, if you look at the use cases, a daemon is simply overkill for most cases, as the vast majority of people only use emerge --sync (or wrappers) and maybe layman to change the tree, usually once per day or less often. Do you really want to push another system daemon on users that isn't of use to them? Its crazy to think that security updates need to be pulled in Linux. That's IMO better be handled via an applet (bug #190397 has some code), or just check for updates after a sync (as syncing is the only way for updates to become available at this time). Maybe a message could be added after sync if there are pending GLSAs, now that the glsa support code is in portage. Marius
[gentoo-portage-dev] unsubscribe
AVVISO DI RISERVATEZZA Informazioni riservate possono essere contenute nel messaggio o nei suoi allegati. Se non siete i destinatari indicati nel messaggio, o responsabili per la sua consegna alla persona, o se avete ricevuto il messaggio per errore, siete pregati di non trascriverlo, copiarlo o inviarlo a nessuno. In tal caso vi invitiamo a cancellare il messaggio ed i suoi allegati. Grazie. CONFIDENTIALITY NOTICE Confidential information may be contained in this message or in its attachments. If you are not the addressee indicated in this message, or responsible for message delivering to that person, or if you have received this message in error, you may not transcribe, copy or deliver this message to anyone. In that case, you should delete this message and its attachments. Thank you.