Re: Inverted index to accelerate guix package search

2020-01-14 Thread Arun Isaac

zimoun  writes:

> However, the core issue about "search" is not 'find quickly an item'
> but 'find the relevant item'.
> For example, Bloom filter [1] or B-tree could be interesting data
> structure if we are talking about fast retrieval. Note that this part
> should even be delegated to SQLite, IMHO.
>
> [1] https://en.wikipedia.org/wiki/Bloom_filter
> [2] https://en.wikipedia.org/wiki/B-tree

I just read about Bloom filters. They are fascinating indeed!

> Well, the issue is 'scoring' a query.

I think the issue of whether to use an inverted index is orthogonal to
the quest to improve the relevance of the search results. Implementing
tf-idf like you suggested could greatly benefit from having fast
searches.

> Last, I have not read the 2 links you pointed but I think you have a
> bug in your code. The retrieval "game" using the index returns the
> package "r-mgcv" which does not contain the term "game" in its
> description. Well, the query "game" using the brute force does not
> return this very package.

Indeed, I found the bug and fixed it. Please find attached the updated
code. Also, the inverted index hash table stores package objects instead
of package names and all comparisons are now done with eq?. This gets us
an even higher speedup of around 300x.

Built index in 8.556596040725708 seconds
Brute force search took 0.4107058048248291 seconds
Inverted index search took 0.0012979507446289062 seconds

Bengt Richter  writes:

> If you want to instrument particular calls, as you know, you can probably
> get sub-microsecond resolution (try the following snippet on your platform
> to see how long it takes to get a gettimeofday value -- even on this
> Acer Swift with its Intel(R) Pentium(R) CPU N4200 @ 1.10GHz I get about 10
> calls per microsecond).

gettimeofday is indeed a good idea. I have used it in the latest version
of my proof of concept script. Like you said, I also noticed some
peculiar outliers which probably got averaged out when I was running the
search 1000 times. For the purpose of the discussion above, I have only
reported the most common time.

Pierre Neidhardt  writes:

> By the way, what about using Xapian in Guix?
>
> https://en.wikipedia.org/wiki/Xapian
>
> If it's relevant, maybe we can follow up with a discussion in a new
> thread.

I feel xapian is too much work (considering that we don't yet have guile
bindings) compared to our own simple implementation of an inverted
index. But, of course, I am biased since I wrote the inverted index
code! :-)

But, on a more serious note, if we move to xapian, we will not be able
to support regular expression based search queries that we support
today. On the other hand, I can extend the inverted index implementation
to support regular expression searches. Personally, I don't use regular
expression based search queries, and don't think they are very useful
especially if we make use of xapian's stemming. What do people think?

On the question of whether xapian is too heavy, I think we should make
it an optional dependency of Guix so that it remains possible to build
and use a more minimalistic Guix for those interested in such things.

(use-modules (gnu packages)
 (guix packages)
 (ice-9 match)
 (ice-9 time)
 (srfi srfi-1)
 (srfi srfi-26))

;;; Implementation of sets using hash tables

(define make-set make-hash-table)
(define set-element-of? hashq-ref)
(define set-add! (cut hashq-set! <> <> #t))
(define (set-fold proc init set)
  (hash-fold (lambda (element _ result)
   (proc element result))
 init set))

(define (set-intersection set1 set2)
  "Return the intersection of SET1 with SET2."
  (let ((result (make-set)))
(hash-for-each (lambda (element _)
 (when (set-element-of? set2 element)
   (set-add! result element)))
   set1)
result))

(define (trigrams str)
  "Return all trigrams in STR."
  (if (< (string-length str) 3)
  '()
  (map (lambda (start)
 (substring str start (+ start 3)))
   (iota (- (string-length str) 2)

;; Inverted index
(define index (make-hash-table))

(define (build-index!)
  "Return an inverted index of all trigrams in the descriptions of all
packages."
  (fold-packages (lambda (package _)
   (for-each (lambda (trigram)
   (let ((set (hash-ref index trigram (make-set
 (set-add! set package)
 (hash-set! index trigram set)))
 (trigrams (package-description package
 #f)
  index)

(define (match-package-fold-proc query package result)
  (if (string-contains (package-description package) query)
  (cons package result)
  result))

(define (brute-force-retrieve query)
  "Return names of all packages whose descriptions contain the string QUERY.
Search brute force by folding over all packag

Re: Documenting Yubikey setup

2020-01-14 Thread Chris Marusich
Hi Martin,

Martin Becze  writes:

> For a user to access a Yubikey an udev rule needs to be added. This can
> be done by using the udev rules found in libu2f-host package. To use the
> rules the following should be added to your config.scm
>
> (use-modules (gnu packages security-token))
>
> ...
>
> (define %modified-desktop-services
>   (modify-services %minimal-desktop-services
> (udev-service-type
>  config =>
>  (udev-configuration (inherit config)
>  (rules (cons libu2f-host
>   (udev-configuration-rules
> config)))
>
>
>
> Reference 
> https://guix.gnu.org/manual/en/html_node/Base-Services.html
>
> ---
>
> Of course there is more to say about yubikey but this is all I needed
> for now. 

Did you write something for the cookbook?  I've also set up a YubiKey
using Guix - I actually packaged some of the things like libu2f-host -
and would be happy to review / add to whatever you've written.

-- 
Chris


signature.asc
Description: PGP signature


Re: Inverted index to accelerate guix package search

2020-01-14 Thread zimoun
On Tue, 14 Jan 2020 at 22:41, Pierre Neidhardt  wrote:

> >> > By the way, what about using Xapian in Guix?
> >
> > Xapian could be really cool!
> > However, the size of this dependency should be evaluated; then
> > optional feature or not.
>
> --8<---cut here---start->8---
> $ guix size guix
> ...
> total: 390.5 MiB
>
> $ guix size guix xapian
> ...
> total: 414.0 MiB
> --8<---cut here---end--->8---
>
> Some 24 MiB.  Actually Xapien comes with about 6 MiB of doc, so we could
> move that out to a separate output.
>
> Conclusion: Xapian is very light.
> The only input that Xapian drags that's not in Guix is util-linux.

Cool!
Already nice!

This size is the runtime size, right?
What about the "build time" size? Other said, all one needs to build
to have xapian, i.e., the size of the full bag: starting from the
bootstrap seed and having Guix+Xapian (vs. Guix alone).


Cheers,
simon



Re: Inverted index to accelerate guix package search

2020-01-14 Thread zimoun
Hi,


On Tue, 14 Jan 2020 at 16:27, Giovanni Biscuolo  wrote:

> Pierre Neidhardt  writes:

> > By the way, what about using Xapian in Guix?

Xapian could be really cool!
However, the size of this dependency should be evaluated; then
optional feature or not.


> Actually I would love to search (tag?!?) for packages/services/machines
> (profiles?) the same way I query my email database with notmuch: it
> would give me a fourth dimension in "DevOps" :-)

Currently, you can output all the packages with 'guix search ""' and
index them using the Python API. It is not so nice but it should give
an idea about what to expose first to Guile.


> I cannot find Guile in Xapian bindings [1] but NotDeft [2] have some
> Emacs Lisp code to interact with Xapian

There is some work. IMHO.

If Guix goes to Xapian, all the Git history of all the packages should
be indexed. Today, it is really hard to find the right commit
corresponding to the right version -- the only solution I have is "git
log"+grep; not really friendly.

Some discussion spoke about using an external index, fetched for
example on the Guix Data Service. But I personally do not like to rely
on external service accessible through the network. However, note that
index all the packages of all the Guix history using
guile-git+fold-packages is not straightforward: resource consuming and
piece of well-thought data structures.

Well, now "guix time-machine" is here, Guix is lacking tools to browse
the history of all the packages.


Cheers,
simon



Re: Proposal for a blog contribution on reproducible computations

2020-01-14 Thread Ludovic Courtès
Konrad Hinsen  skribis:

>> Perfect, thanks!  I had to move files where Haunt expects them, and to
>> make the table pure ASCII because guile-commonmark doesn’t support HTML
>> nor tables, but here we are:
>>
>>   https://guix.gnu.org/blog/2020/reproducible-computations-with-guix/
>>   https://hpc.guix.info/blog/2020/01/reproducible-computations-with-guix/
>
> Looks good, thanks!
>
> One problem though: the link to the script is broken:
>
>   
> https://guix.gnu.org/blog/2020/reproducible-computations-with-guix/show-dependencies.scm
>
> yield 404.

Oops, sorry about that!  Should be fixed now.

> I had put it into the same directory as the Markdown file and the
> images, apparently Haunt didn't like that either.

Yeah, it has to be in website/static.

Thanks,
Ludo’.



Re: Proposal for a blog contribution on reproducible computations

2020-01-14 Thread Konrad Hinsen
Hi Ludo,

> Perfect, thanks!  I had to move files where Haunt expects them, and to
> make the table pure ASCII because guile-commonmark doesn’t support HTML
> nor tables, but here we are:
>
>   https://guix.gnu.org/blog/2020/reproducible-computations-with-guix/
>   https://hpc.guix.info/blog/2020/01/reproducible-computations-with-guix/

Looks good, thanks!

One problem though: the link to the script is broken:

  
https://guix.gnu.org/blog/2020/reproducible-computations-with-guix/show-dependencies.scm

yield 404.

I had put it into the same directory as the Markdown file and the
images, apparently Haunt didn't like that either.

Cheers,
  Konrad.





Re: Proposal for a blog contribution on reproducible computations

2020-01-14 Thread Ludovic Courtès
Hello!

Konrad Hinsen  skribis:

>> You can post a patch against the guix-artwork.git repo here when you’re
>> ready.
>
> Here it comes !

Perfect, thanks!  I had to move files where Haunt expects them, and to
make the table pure ASCII because guile-commonmark doesn’t support HTML
nor tables, but here we are:

  https://guix.gnu.org/blog/2020/reproducible-computations-with-guix/
  https://hpc.guix.info/blog/2020/01/reproducible-computations-with-guix/

Spread the word!

Thank you, Konrad.

Ludo’.



Re: Inverted index to accelerate guix package search

2020-01-14 Thread Giovanni Biscuolo
Pierre Neidhardt  writes:

> By the way, what about using Xapian in Guix?
>
> https://en.wikipedia.org/wiki/Xapian
>
> If it's relevant, maybe we can follow up with a discussion in a new
> thread.

IMHO it's relevant (you decide if it's worth a new thread)

Actually I would love to search (tag?!?) for packages/services/machines
(profiles?) the same way I query my email database with notmuch: it
would give me a fourth dimension in "DevOps" :-)

IMHO this is not a priority, but probably should go in the roadmap as a
wishlist

I cannot find Guile in Xapian bindings [1] but NotDeft [2] have some
Emacs Lisp code to interact with Xapian

WDYT?

Thanks, Gio'


[1] https://xapian.org/docs/bindings/

[2] https://tero.hasu.is/notdeft/

-- 
Giovanni Biscuolo

Xelera IT Infrastructures


signature.asc
Description: PGP signature


Re: Parameterized packages

2020-01-14 Thread zimoun
Hi Pierre,

I do not not if it related.

Currently, the option "--with-input" does not re-write the implicit
inputs. Probably for good reasons. :-)

For example, say I would like to have the version 3.4.3 of R (2017
old). This version is really old and had disappeared. Other said, it
is older than the introduction of "inferior"s and so "guix
time-machine" does not help (but that another story).
At this point, it is not matter the GCC version that it is used to
compile this R version.

Well, it is not too complicated:

  git show cbe1314a7e:gnu/packages/statistics.scm > /tmp/stuff/old-stats.scm
  # edit old-stats.scm to tweak the modules
  guix build -L /tmp r-minimal@3.4.3

Now, I would like to build the package r-rcpp at the version 1.0.3
which is introduced recently using the old R version 3.4.3. Well, the
naive idea is:

   guix build -L /tmp --with-inputs=r-minimal=r-minimal@3.4.3 r-rcpp

but because r-minimal is an implicit inputs, it is not re-written.

This behaviour is expected because the bag is really huge. To be
convinced, try: "guix graph -t bag r-rcpp | dot -Tps >
r-rcpp-bag.ps"... and wait... :-)


>From a practical point of view, re-write any implicit inputs is
touchy.  And I do not know if the "parametrized package" is related or
not at all -- I do not see real difference between re-compiling with
this flags and re-compiling with this version -- but be able to
recompile a matrix of combinaison would be really cool!


All the best,
simon



Re: Inverted index to accelerate guix package search

2020-01-14 Thread Pierre Neidhardt
By the way, what about using Xapian in Guix?

https://en.wikipedia.org/wiki/Xapian

If it's relevant, maybe we can follow up with a discussion in a new thread.

-- 
Pierre Neidhardt
https://ambrevar.xyz/


signature.asc
Description: PGP signature


Guix at eRum 2020 conference on May 27-30 in Milan?

2020-01-14 Thread Giovanni Biscuolo
This year the european R users meeting (eRum https://erum.io/) will be
held on 27-30 May in Milan (UNI Bicocca and Politecnico)

Call for speakers:

- Closes at 11:59 PM 29 Jan 2020
- Notification of acceptance: February 26, 2020

Travel grants application period is expired, sorry! (Application period:
December 9, 2019 – January 1, 2020)

This will be the 3rd edition and I really hope someone from the Guix/GWL
community can come here to talk about our views on reproducible
research, since they *seems* to always use Docker to "solve"
reproducibility issues :-S

Looking at past editions talks [1]:

1. https://github.com/eRum2016/Presentations-participants
   https://github.com/eRum2016/Book-of-abstracts/raw/master/output/boa.pdf

2. http://2018.erum.io/#schedule

I suspect there are too much misunderstandings about how to solve
reproducibility issues in the R community [2]: please prove me wrong!
:-)

I really would like that "reproducibility" will not become the next
buzzword used just for "marketing" purposes, at least non in the
research community... at least non in R!

Last but not least: registration fee http://2018.erum.io/#registration
of the 2018 edition for people like me (Industry) was about 275
EUR... :-S

Thanks! Gio'

[1] I've made a quick search of "reproducible", "Guix" and "Docker"

[2] e.g. this is a correct analisys of the problem giving the wrong
solution 
https://www.youtube.com/watch?v=jNtoRLnIr8E&t=0s&list=PLUBl0DoLa5SAo_XRnkQA5GtEORg9K7kMh&index=61


P.S.: I'm definitely not qualified to apply :-)


-- 
Giovanni Biscuolo

Xelera IT Infrastructures


signature.asc
Description: PGP signature


Re: Package request: OpenRA

2020-01-14 Thread Pierre Neidhardt
Thanks for sharing!

Besides Nix seems to have a working package:

https://github.com/NixOS/nixpkgs/

-- 
Pierre Neidhardt
https://ambrevar.xyz/


signature.asc
Description: PGP signature


Re: Package request: OpenRA

2020-01-14 Thread ng0
There aren't issuess with the DB (unless it changed), we discussed creating
a package in context of tor some time in 2016 - 2018 I think.



Re: Testing the installer

2020-01-14 Thread Mathieu Othacehe


Hello Ludo,

That sounds like a nice idea! I guess we can discuss it more deeply
during Guix Days, but here are a few thoughts:

* Testing from a VM won't cover HW issues with undetected wifi networks
and kmscon rendering issues, but could cover most partitioning issues.

* This kind of client/server protocol would make even easier to
switch to a gtk graphical installer.

* Before the next release, it would be nice to fix the "installer always
fail when restarted" issue than has been bitten lots of people.

Thanks,

Mathieu

Ludovic Courtès writes:

> Hello Guix!
>
> One of the lessons from the 1.0.0 screw-up was that we should test the
> graphical installer itself:
>
>   https://guix.gnu.org/blog/2019/gnu-guix-1.0.1-released/
>
> I think we should try to do that before the next release; not doing it
> means testing by hand, which also takes a lot of time.
>
> One idea that I had was that we could change the installer so that it
> listens for connections on some pre-defined Unix-domain socket.  When it
> gets a connection, it would perform a dialog with its client: sent it a
> summary of the current form, wait for its choice, and so on.  The forms
> would thus be unavailable from the keyboard: essentially ‘run-form’
> calls would be replaced by ‘draw-form’ calls in that mode.
>
> Our test infrastructure would thus connect (from a marionette) to the
> installer and participate in that dialog.  It could even take a QEMU
> screenshot at each step.
>
> How does that sound?  Am I overlooking things?
>
> I guess the difficulty will be to represent questions and answers.
>
> Ludo’.