[sphinx-users] Re: Documentation about the built in search?

Charles Bouchard-Légaré Thu, 07 Apr 2022 21:29:44 -0700

I have also been looking into it lately and the only things I found was... 
reading the source code.


I was looking into trying to improve the search widget, trying to get 
closer to what is available with MkDocs (modal results as you types, etc, 
see screenshot below). 

*Here is what I found about Sphinx' search:*

   - The Javascript code for doing the actual search against the index is 
   at sphinx/themes/basic/static/searchtools.js 
   
<https://github.com/sphinx-doc/sphinx/blob/5.x/sphinx/themes/basic/static/searchtools.js>
   - The generation of the index is done at build time and the code is 
   sphinx/search/__init__.py 
   <https://github.com/sphinx-doc/sphinx/blob/5.x/sphinx/search/__init__.py>
   - Sphinx does not only «full text search»
      - First, look into «object» (like Python functions and such)
         - The displayed result is built on the domain's display name and 
         object type localized name
         - No excerpt are provided here (which is kind of sad)
         - Then, try full text search
         - An excerpt built from text around the result is displayed
         - I am no expert in full-text search, but it looks both simple and 
         pretty standard, we more priority on terms from titles.  There is a 
good 
         stemmer for several languages.
      - Sphinx' search clearly this is *not* an API
      - Sphinx' search is not much configurable and does not seems to be 
      part of a public API for users or extension developers to build on.
      - When writing a new Domain, objects a provided by the get_objects 
   method 
   
<https://www.sphinx-doc.org/en/master/extdev/domainapi.html?highlight=Domain#sphinx.domains.Domain.get_objects>
 
   (which must be provided by your implementation)
      - It returns an iterable of «objects», a 6-tuple
      - The last item priority determine how important an object is 
      regarding search
      - The URL built by the search result depend on the first, second and 
      5th item
         - fullname «Fully qualified name.»
         - dispname «Name to display when searching/linking.»
         - anchor «The anchor name for the object.»
      - In my custom Domain, search-generated URLs don't target the actual 
      documented object. I still need to investigate how the Directive 
      implementation, these three «object tuple» attribute and the search work 
      together. It seems to have still a few Python-specifics in there.
   - As part of WebSupport, Sphinx provides a few utilities to enable 
   server-side search 
   
<https://www.sphinx-doc.org/en/master/usage/advanced/websupport/searchadapters.html>.
 
   Personally, this is not interesting to me at the moment.
   
*For comparison, here is what I found about MkDocs*

   - Unless specific plugins, only full text search is done.
   - It uses lunr.js <https://lunrjs.com/>
      - The documents MkDocs registers to lunr.js are, from my understanding
         - All pages
         - All sub sections, recursively
         - Which means some text is added multiple times.  I suspect 
         subsection are prioritized in the results
         - Each item provides two "fields": title and text, somewhat like 
         Sphinx.
      - By default, they used to use lunr.py 
      <https://github.com/yeraydiazdiaz/lunr.py> for pregenerating the 
      index.  This pregeneration is configurable.
      - This is deprecated now because lunr.py has binary transitive 
         dependencies for non-english languages and this makes MkDocs harder to 
use 
         for Alpine Docker image users.
         - They offer now to subprocess lunr.js with Nodejs
         - The index can also be generated by Web workers 
         
*Other info I found:*

   - ReadTheDocs has quite interesting search features 
   <https://docs.readthedocs.io/en/stable/guides/advanced-search.html>
   - Someone did made a lunr.js extension 
   <https://github.com/rmcgibbo/sphinxcontrib-lunrsearch> for Sphinx, but 
   only indexing "objects" in a separate custom search widget. Not actively 
   maintained.
   - I've looked into trying lunr in Sphinx for fulltext.  Building an 
   Index would be quite simple with a EnvironmentCollector, but leveraging 
   incremental builds would not yield all the optimization one could want 
   because lunr dropped editable indices.  Here is a *not tested* stub that 
   would still need to be integrated with Sphinx's APIs. to give an idea
      - class Search:
          def __init__(self, env: BuildEnvironment):
              self._env = env
              self._builder = get_default_builder()
              self._builder.ref("id")
              self._builder.field("title")
              self._builder.field("text")
      
          def index_document(self, node: document):
              self._builder.add(self.extract_search_document(node, 
      section=False))
              found = node.findall(section)
              for element in found:
                  self._builder.add(
                      {
                          self.extract_search_document(element)
                      }
                  )
      
          def extract_search_document(self, node: Node, section=True):
              title_node = next(node.findall(title))
              if section:
                  anchor = title_node["ids"][0]
                  uri = 
      self._env.app.builder.get_target_uri(self._env.docname) + "#" + anchor
              else:
                  uri = 
      self._env.app.builder.get_target_uri(self._env.docname)
      
              return {
                  "id": uri,
                  "title": title_node.astext(),
                  "text": node.astext()
              }
   - All in all, I am not sure it is worth it to invest much on an 3rd 
   party search engine such as lunr. I cannot yet prove that would provide 
   much an improvement over Sphinx' search.  Adding such dependencies to 
   Sphinx would probably not be acceptable anyway  Even as a separate 
   extension, I don't clearly see an improvement here
   - I see a lot of improvement that can be done by themes. I am not sure 
   whether Sphinx' client-side search javascript code could be used for 
   queries «as you type» efficiently, but having an overlay or modal result 
   display would be great in my opinion. Sadly, my Python skills are quite 
   good, I can play with JS a bit, but web development is something I never 
   invested time or focus on.  Thus working on this would require tens of 
   hours of unpleasantness, which is quite daunting I must admit.

All-in-all, I would really like to help improve the search experience with 
Sphinx, especially on static websites outside of ReadTheDocs. I feel that 
the best early improvements to be done have to be in themes (improve the 
UI) and this is something I don't feel I can help much with.  I would 
gladly team up with anybody with Webdev skills to do something about it!

*MkDocs Search Screenshot*
[image: pydantic-search.png]
On Saturday, December 11, 2021 at 3:30:47 PM UTC-5 martin...@gmail.com 
wrote:

> Hello friends,
>
> is anybody aware of an article, blog post or other documentation that 
> helps understand how the built in search works?
>
> Regards, Martin
>
> -- 
>
> See our Sphinx made docs at https://docs.typo3.org/
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"sphinx-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to sphinx-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/sphinx-users/49f1081e-bf09-45e9-97e2-2eb96fbfd35bn%40googlegroups.com.

[sphinx-users] Re: Documentation about the built in search?

Reply via email to