Re: Inverted index to accelerate guix package search
zimoun writes: > However, the core issue about "search" is not 'find quickly an item' > but 'find the relevant item'. > For example, Bloom filter [1] or B-tree could be interesting data > structure if we are talking about fast retrieval. Note that this part > should even be delegated to SQLite, IMHO. > > [1] https://en.wikipedia.org/wiki/Bloom_filter > [2] https://en.wikipedia.org/wiki/B-tree I just read about Bloom filters. They are fascinating indeed! > Well, the issue is 'scoring' a query. I think the issue of whether to use an inverted index is orthogonal to the quest to improve the relevance of the search results. Implementing tf-idf like you suggested could greatly benefit from having fast searches. > Last, I have not read the 2 links you pointed but I think you have a > bug in your code. The retrieval "game" using the index returns the > package "r-mgcv" which does not contain the term "game" in its > description. Well, the query "game" using the brute force does not > return this very package. Indeed, I found the bug and fixed it. Please find attached the updated code. Also, the inverted index hash table stores package objects instead of package names and all comparisons are now done with eq?. This gets us an even higher speedup of around 300x. Built index in 8.556596040725708 seconds Brute force search took 0.4107058048248291 seconds Inverted index search took 0.0012979507446289062 seconds Bengt Richter writes: > If you want to instrument particular calls, as you know, you can probably > get sub-microsecond resolution (try the following snippet on your platform > to see how long it takes to get a gettimeofday value -- even on this > Acer Swift with its Intel(R) Pentium(R) CPU N4200 @ 1.10GHz I get about 10 > calls per microsecond). gettimeofday is indeed a good idea. I have used it in the latest version of my proof of concept script. Like you said, I also noticed some peculiar outliers which probably got averaged out when I was running the search 1000 times. For the purpose of the discussion above, I have only reported the most common time. Pierre Neidhardt writes: > By the way, what about using Xapian in Guix? > > https://en.wikipedia.org/wiki/Xapian > > If it's relevant, maybe we can follow up with a discussion in a new > thread. I feel xapian is too much work (considering that we don't yet have guile bindings) compared to our own simple implementation of an inverted index. But, of course, I am biased since I wrote the inverted index code! :-) But, on a more serious note, if we move to xapian, we will not be able to support regular expression based search queries that we support today. On the other hand, I can extend the inverted index implementation to support regular expression searches. Personally, I don't use regular expression based search queries, and don't think they are very useful especially if we make use of xapian's stemming. What do people think? On the question of whether xapian is too heavy, I think we should make it an optional dependency of Guix so that it remains possible to build and use a more minimalistic Guix for those interested in such things. (use-modules (gnu packages) (guix packages) (ice-9 match) (ice-9 time) (srfi srfi-1) (srfi srfi-26)) ;;; Implementation of sets using hash tables (define make-set make-hash-table) (define set-element-of? hashq-ref) (define set-add! (cut hashq-set! <> <> #t)) (define (set-fold proc init set) (hash-fold (lambda (element _ result) (proc element result)) init set)) (define (set-intersection set1 set2) "Return the intersection of SET1 with SET2." (let ((result (make-set))) (hash-for-each (lambda (element _) (when (set-element-of? set2 element) (set-add! result element))) set1) result)) (define (trigrams str) "Return all trigrams in STR." (if (< (string-length str) 3) '() (map (lambda (start) (substring str start (+ start 3))) (iota (- (string-length str) 2) ;; Inverted index (define index (make-hash-table)) (define (build-index!) "Return an inverted index of all trigrams in the descriptions of all packages." (fold-packages (lambda (package _) (for-each (lambda (trigram) (let ((set (hash-ref index trigram (make-set (set-add! set package) (hash-set! index trigram set))) (trigrams (package-description package #f) index) (define (match-package-fold-proc query package result) (if (string-contains (package-description package) query) (cons package result) result)) (define (brute-force-retrieve query) "Return names of all packages whose descriptions contain the string QUERY. Search brute force by folding over all packag
Re: Documenting Yubikey setup
Hi Martin, Martin Becze writes: > For a user to access a Yubikey an udev rule needs to be added. This can > be done by using the udev rules found in libu2f-host package. To use the > rules the following should be added to your config.scm > > (use-modules (gnu packages security-token)) > > ... > > (define %modified-desktop-services > (modify-services %minimal-desktop-services > (udev-service-type > config => > (udev-configuration (inherit config) > (rules (cons libu2f-host > (udev-configuration-rules > config))) > > > > Reference > https://guix.gnu.org/manual/en/html_node/Base-Services.html > > --- > > Of course there is more to say about yubikey but this is all I needed > for now. Did you write something for the cookbook? I've also set up a YubiKey using Guix - I actually packaged some of the things like libu2f-host - and would be happy to review / add to whatever you've written. -- Chris signature.asc Description: PGP signature
Re: Inverted index to accelerate guix package search
On Tue, 14 Jan 2020 at 22:41, Pierre Neidhardt wrote: > >> > By the way, what about using Xapian in Guix? > > > > Xapian could be really cool! > > However, the size of this dependency should be evaluated; then > > optional feature or not. > > --8<---cut here---start->8--- > $ guix size guix > ... > total: 390.5 MiB > > $ guix size guix xapian > ... > total: 414.0 MiB > --8<---cut here---end--->8--- > > Some 24 MiB. Actually Xapien comes with about 6 MiB of doc, so we could > move that out to a separate output. > > Conclusion: Xapian is very light. > The only input that Xapian drags that's not in Guix is util-linux. Cool! Already nice! This size is the runtime size, right? What about the "build time" size? Other said, all one needs to build to have xapian, i.e., the size of the full bag: starting from the bootstrap seed and having Guix+Xapian (vs. Guix alone). Cheers, simon
Re: Inverted index to accelerate guix package search
Hi, On Tue, 14 Jan 2020 at 16:27, Giovanni Biscuolo wrote: > Pierre Neidhardt writes: > > By the way, what about using Xapian in Guix? Xapian could be really cool! However, the size of this dependency should be evaluated; then optional feature or not. > Actually I would love to search (tag?!?) for packages/services/machines > (profiles?) the same way I query my email database with notmuch: it > would give me a fourth dimension in "DevOps" :-) Currently, you can output all the packages with 'guix search ""' and index them using the Python API. It is not so nice but it should give an idea about what to expose first to Guile. > I cannot find Guile in Xapian bindings [1] but NotDeft [2] have some > Emacs Lisp code to interact with Xapian There is some work. IMHO. If Guix goes to Xapian, all the Git history of all the packages should be indexed. Today, it is really hard to find the right commit corresponding to the right version -- the only solution I have is "git log"+grep; not really friendly. Some discussion spoke about using an external index, fetched for example on the Guix Data Service. But I personally do not like to rely on external service accessible through the network. However, note that index all the packages of all the Guix history using guile-git+fold-packages is not straightforward: resource consuming and piece of well-thought data structures. Well, now "guix time-machine" is here, Guix is lacking tools to browse the history of all the packages. Cheers, simon
Re: Proposal for a blog contribution on reproducible computations
Konrad Hinsen skribis: >> Perfect, thanks! I had to move files where Haunt expects them, and to >> make the table pure ASCII because guile-commonmark doesn’t support HTML >> nor tables, but here we are: >> >> https://guix.gnu.org/blog/2020/reproducible-computations-with-guix/ >> https://hpc.guix.info/blog/2020/01/reproducible-computations-with-guix/ > > Looks good, thanks! > > One problem though: the link to the script is broken: > > > https://guix.gnu.org/blog/2020/reproducible-computations-with-guix/show-dependencies.scm > > yield 404. Oops, sorry about that! Should be fixed now. > I had put it into the same directory as the Markdown file and the > images, apparently Haunt didn't like that either. Yeah, it has to be in website/static. Thanks, Ludo’.
Re: Proposal for a blog contribution on reproducible computations
Hi Ludo, > Perfect, thanks! I had to move files where Haunt expects them, and to > make the table pure ASCII because guile-commonmark doesn’t support HTML > nor tables, but here we are: > > https://guix.gnu.org/blog/2020/reproducible-computations-with-guix/ > https://hpc.guix.info/blog/2020/01/reproducible-computations-with-guix/ Looks good, thanks! One problem though: the link to the script is broken: https://guix.gnu.org/blog/2020/reproducible-computations-with-guix/show-dependencies.scm yield 404. I had put it into the same directory as the Markdown file and the images, apparently Haunt didn't like that either. Cheers, Konrad.
Re: Proposal for a blog contribution on reproducible computations
Hello! Konrad Hinsen skribis: >> You can post a patch against the guix-artwork.git repo here when you’re >> ready. > > Here it comes ! Perfect, thanks! I had to move files where Haunt expects them, and to make the table pure ASCII because guile-commonmark doesn’t support HTML nor tables, but here we are: https://guix.gnu.org/blog/2020/reproducible-computations-with-guix/ https://hpc.guix.info/blog/2020/01/reproducible-computations-with-guix/ Spread the word! Thank you, Konrad. Ludo’.
Re: Inverted index to accelerate guix package search
Pierre Neidhardt writes: > By the way, what about using Xapian in Guix? > > https://en.wikipedia.org/wiki/Xapian > > If it's relevant, maybe we can follow up with a discussion in a new > thread. IMHO it's relevant (you decide if it's worth a new thread) Actually I would love to search (tag?!?) for packages/services/machines (profiles?) the same way I query my email database with notmuch: it would give me a fourth dimension in "DevOps" :-) IMHO this is not a priority, but probably should go in the roadmap as a wishlist I cannot find Guile in Xapian bindings [1] but NotDeft [2] have some Emacs Lisp code to interact with Xapian WDYT? Thanks, Gio' [1] https://xapian.org/docs/bindings/ [2] https://tero.hasu.is/notdeft/ -- Giovanni Biscuolo Xelera IT Infrastructures signature.asc Description: PGP signature
Re: Parameterized packages
Hi Pierre, I do not not if it related. Currently, the option "--with-input" does not re-write the implicit inputs. Probably for good reasons. :-) For example, say I would like to have the version 3.4.3 of R (2017 old). This version is really old and had disappeared. Other said, it is older than the introduction of "inferior"s and so "guix time-machine" does not help (but that another story). At this point, it is not matter the GCC version that it is used to compile this R version. Well, it is not too complicated: git show cbe1314a7e:gnu/packages/statistics.scm > /tmp/stuff/old-stats.scm # edit old-stats.scm to tweak the modules guix build -L /tmp r-minimal@3.4.3 Now, I would like to build the package r-rcpp at the version 1.0.3 which is introduced recently using the old R version 3.4.3. Well, the naive idea is: guix build -L /tmp --with-inputs=r-minimal=r-minimal@3.4.3 r-rcpp but because r-minimal is an implicit inputs, it is not re-written. This behaviour is expected because the bag is really huge. To be convinced, try: "guix graph -t bag r-rcpp | dot -Tps > r-rcpp-bag.ps"... and wait... :-) >From a practical point of view, re-write any implicit inputs is touchy. And I do not know if the "parametrized package" is related or not at all -- I do not see real difference between re-compiling with this flags and re-compiling with this version -- but be able to recompile a matrix of combinaison would be really cool! All the best, simon
Re: Inverted index to accelerate guix package search
By the way, what about using Xapian in Guix? https://en.wikipedia.org/wiki/Xapian If it's relevant, maybe we can follow up with a discussion in a new thread. -- Pierre Neidhardt https://ambrevar.xyz/ signature.asc Description: PGP signature
Guix at eRum 2020 conference on May 27-30 in Milan?
This year the european R users meeting (eRum https://erum.io/) will be held on 27-30 May in Milan (UNI Bicocca and Politecnico) Call for speakers: - Closes at 11:59 PM 29 Jan 2020 - Notification of acceptance: February 26, 2020 Travel grants application period is expired, sorry! (Application period: December 9, 2019 – January 1, 2020) This will be the 3rd edition and I really hope someone from the Guix/GWL community can come here to talk about our views on reproducible research, since they *seems* to always use Docker to "solve" reproducibility issues :-S Looking at past editions talks [1]: 1. https://github.com/eRum2016/Presentations-participants https://github.com/eRum2016/Book-of-abstracts/raw/master/output/boa.pdf 2. http://2018.erum.io/#schedule I suspect there are too much misunderstandings about how to solve reproducibility issues in the R community [2]: please prove me wrong! :-) I really would like that "reproducibility" will not become the next buzzword used just for "marketing" purposes, at least non in the research community... at least non in R! Last but not least: registration fee http://2018.erum.io/#registration of the 2018 edition for people like me (Industry) was about 275 EUR... :-S Thanks! Gio' [1] I've made a quick search of "reproducible", "Guix" and "Docker" [2] e.g. this is a correct analisys of the problem giving the wrong solution https://www.youtube.com/watch?v=jNtoRLnIr8E&t=0s&list=PLUBl0DoLa5SAo_XRnkQA5GtEORg9K7kMh&index=61 P.S.: I'm definitely not qualified to apply :-) -- Giovanni Biscuolo Xelera IT Infrastructures signature.asc Description: PGP signature
Re: Package request: OpenRA
Thanks for sharing! Besides Nix seems to have a working package: https://github.com/NixOS/nixpkgs/ -- Pierre Neidhardt https://ambrevar.xyz/ signature.asc Description: PGP signature
Re: Package request: OpenRA
There aren't issuess with the DB (unless it changed), we discussed creating a package in context of tor some time in 2016 - 2018 I think.
Re: Testing the installer
Hello Ludo, That sounds like a nice idea! I guess we can discuss it more deeply during Guix Days, but here are a few thoughts: * Testing from a VM won't cover HW issues with undetected wifi networks and kmscon rendering issues, but could cover most partitioning issues. * This kind of client/server protocol would make even easier to switch to a gtk graphical installer. * Before the next release, it would be nice to fix the "installer always fail when restarted" issue than has been bitten lots of people. Thanks, Mathieu Ludovic Courtès writes: > Hello Guix! > > One of the lessons from the 1.0.0 screw-up was that we should test the > graphical installer itself: > > https://guix.gnu.org/blog/2019/gnu-guix-1.0.1-released/ > > I think we should try to do that before the next release; not doing it > means testing by hand, which also takes a lot of time. > > One idea that I had was that we could change the installer so that it > listens for connections on some pre-defined Unix-domain socket. When it > gets a connection, it would perform a dialog with its client: sent it a > summary of the current form, wait for its choice, and so on. The forms > would thus be unavailable from the keyboard: essentially ‘run-form’ > calls would be replaced by ‘draw-form’ calls in that mode. > > Our test infrastructure would thus connect (from a marionette) to the > installer and participate in that dialog. It could even take a QEMU > screenshot at each step. > > How does that sound? Am I overlooking things? > > I guess the difficulty will be to represent questions and answers. > > Ludo’.