On Tue, Feb 20, 2024 at 10:06 AM Stefan Vodita <stefan.vod...@gmail.com> wrote:
Thank you Mike, I really like all the facets! > Me too lol. It was one of the big motivators for me to build this out. GitHub's search didn't have all the facet drill-downs/up/sideways I wanted. Some of them are super useful like "which PRs have review requested for me <https://githubsearch.mikemccandless.com/search.py?sort=recentlyUpdated&dd=status%3AOpen&dd=requested_reviewers%3Amikemccand>" or "where am I mentioned <https://githubsearch.mikemccandless.com/search.py?sort=recentlyUpdated&dd=status%3AOpen&dd=mentioned_users%3Amikemccand>". Also, GitHub's filter choices do not seem to be dynamically generated for this query -- so you can pick a filter value and it brings you to 0 hits, violating the "no dead end" promise of Lucene's facets. I was also disappointed with GitHub search's lack of hit highlighting, to solve the "final inch" problem (show me specifically where, in this massive massive list of comments on a PR/issue, my search terms appear), and also not showing me the individual comment or code review comment (multiple ones of those on a PR) where my search terms appear, lack of linking directly to that comment, etc. Githubsearch uses Lucene's block joins to achieve this. GitHub's search doesn't offer a blended relevance+recency sort, which I think makes a great default. It looks like it does support phrase search (with double-quotes), curious how that works with ngrams. I do like that the text query language includes all of the sort/filter criteria -- the "is:open" and "sort:comments-desc". Githubsearch doens't support that through the text query language, just the facets UI / REST query URL. Anyway, I don't want to complain (too much) about GitHub's search efforts. Search is clearly hard, and we all (Lucene experts) have a fairly biased/opinionated take on it all, heh. I've never met a search engine that I'm fully happy with ;) One thing that bothered me about GitHub's own search was that it would > return > different results if I wasn't signed in. Maybe it does early stopping for > non-authenticated users? In any case, this won't be a problem with > githubsearch. > Oh, that is very interesting -- I didn't know that. Wow, I just tested -- indeed, you cannot even search the source code (for Lucene's repo anyways) if you are not signed in. That's weird. For issues/PRs searching, the three queries I tried seem to produce the same results signed in or out. But it is scary/dangerous if this can differ!! > Have you considered indexing the Lucene source code too? > Oh my, I have not (until now lol). That's a great idea. Source code tokenization would be such a fun problem ... I wonder if GitHub open-sources how they tokenize the many different languages' source code. GitHub's code search is in Rust (not using Lucene nor Rucene), a custom search engine they recently built / switched to: https://github.blog/2023-02-06-the-technology-behind-githubs-new-code-search, away from Elasticsearch previously I think. It looks like they use ngrams, maybe instead of language-specific tokenization (?), to do the initial matching/retrieval. I would try normal lexical tokenization to see if highlighting could work well. I opened this luceneserver/GitHubSearch issue <https://github.com/mikemccand/luceneserver/issues/27> to think about this ... it'd sure be fun to build and use :) Thank you for the suggestion Stefan! Mike McCandless http://blog.mikemccandless.com >