Re: Announcing githubsearch!

Michael McCandless Wed, 21 Feb 2024 04:08:52 -0800

On Tue, Feb 20, 2024 at 10:06 AM Stefan Vodita <stefan.vod...@gmail.com>
wrote:


Thank you Mike, I really like all the facets!
>

Me too lol.  It was one of the big motivators for me to build this out.
GitHub's search didn't have all the facet drill-downs/up/sideways I
wanted.  Some of them are super useful like "which PRs have review
requested for me
<https://githubsearch.mikemccandless.com/search.py?sort=recentlyUpdated&dd=status%3AOpen&dd=requested_reviewers%3Amikemccand>"
or "where am I mentioned
<https://githubsearch.mikemccandless.com/search.py?sort=recentlyUpdated&dd=status%3AOpen&dd=mentioned_users%3Amikemccand>".
Also, GitHub's filter choices do not seem to be dynamically generated for
this query -- so you can pick a filter value and it brings you to 0 hits,
violating the "no dead end" promise of Lucene's facets.

I was also disappointed with GitHub search's lack of hit highlighting, to
solve the "final inch" problem (show me specifically where, in this
massive massive list of comments on a PR/issue, my search terms appear),
and also not showing me the individual comment or code review comment
(multiple ones of those on a PR) where my search terms appear, lack of
linking directly to that comment, etc.  Githubsearch uses Lucene's block
joins to achieve this.

GitHub's search doesn't offer a blended relevance+recency sort, which I
think makes a great default.  It looks like it does support phrase search
(with double-quotes), curious how that works with ngrams.

I do like that the text query language includes all of the sort/filter
criteria -- the "is:open" and "sort:comments-desc".  Githubsearch doens't
support that through the text query language, just the facets UI / REST
query URL.

Anyway, I don't want to complain (too much) about GitHub's search efforts.
Search is clearly hard, and we all (Lucene experts) have a fairly
biased/opinionated take on it all, heh.  I've never met a search engine
that I'm fully happy with ;)

One thing that bothered me about GitHub's own search was that it would
> return
> different results if I wasn't signed in. Maybe it does early stopping for
> non-authenticated users? In any case, this won't be a problem with
> githubsearch.
>

Oh, that is very interesting -- I didn't know that.

Wow, I just tested -- indeed, you cannot even search the source code (for
Lucene's repo anyways) if you are not signed in.  That's weird.

For issues/PRs searching, the three queries I tried seem to produce the
same results signed in or out.  But it is scary/dangerous if this can
differ!!


> Have you considered indexing the Lucene source code too?
>

Oh my, I have not (until now lol).  That's a great idea.  Source code
tokenization would be such a fun problem ... I wonder if GitHub
open-sources how they tokenize the many different languages' source code.
GitHub's code search is in Rust (not using Lucene nor Rucene), a custom
search engine they recently built / switched to:
https://github.blog/2023-02-06-the-technology-behind-githubs-new-code-search,
away from Elasticsearch previously I think.  It looks like they use ngrams,
maybe instead of language-specific tokenization (?), to do the initial
matching/retrieval.  I would try normal lexical tokenization to see if
highlighting could work well.

I opened this luceneserver/GitHubSearch issue
<https://github.com/mikemccand/luceneserver/issues/27> to think about this
... it'd sure be fun to build and use :)  Thank you for the suggestion
Stefan!

Mike McCandless

http://blog.mikemccandless.com

>

Re: Announcing githubsearch!

Reply via email to