pg_trgm vs. Solr ngram

Chris Thu, 09 Feb 2023 18:21:00 -0800

Hello list

I'm pondering migrating an FTS application from Solr to Postgres, justbecause we use Postgres for everything else.

The application is basically fgrep with a web frontend. However theindexed documents are very computer network specific and contain a lotof hyphenated hostnames with dot-separated domains, as well as IPv4 andIPv6 addresses. In Solr I was using ngrams and customized theTokenizerFactories until more or less only whitespace was as separator,while [.:-_\d] remains part of the ngrams. This allows to search for".12.255/32" or "xzy-eth5.example.org" without any false positives.

It looks like a straight conversion of this method is not possible sincethe tokenization in pg_trgm is not configurable afaict. Is there someother good method to search for a random substring including all thepunctuation using an index? Or a pg_trgm-style module that is moreflexible like the Solr/Lucene variant?

Or maybe hacking my own pg_trgm wouldn't be so hard and could be fun, doI pretty much just need to change the emitted tokens or will this leadto significant complications in the operators, indexes etc.?


thanks for any hints & cheers
Christian

pg_trgm vs. Solr ngram

Reply via email to