Remove similar (geographic/named) results

Leonard Ehrenfried Fri, 29 Sep 2023 03:01:17 -0700

Hi *,

I'm indexing public transport stops in a Lucene 9.7.0 index in order to get a 
fuzzy stop search for start/end points for a trip planner.


The documents look like this:

[
  {
    "id": "MARTA:485",
    "name": "ASHBY STATION",
    "coordinate": {
      "lat": 33.756478,
      "lon": -84.41723
    }
  },
  {
    "id": "MARTA:486",
    "name": "ASHBY STATION",
    "coordinate": {
      "lat": 33.756477,
      "lon": -84.417328
    }
  },
  {
    "id": "MARTA:79496",
    "name": "ASHBY STATION - SOUTHBOUND",
    "coordinate": {
      "lat": 33.756281,
      "lon": -84.417724
    }
  },
  {
    "id": "MARTA:79028",
    "name": "ASHBY STATION - NORTHBOUND",
    "coordinate": {
      "lat": 33.756066,
      "lon": -84.417371
    }
  }
]

When I execute a term query for "ashby" all of the above results are returned. 
I would like to ask for advice on how to de-duplicate the results - at the very 
least the first two results with identical names, which are very close to each 
other geographically, should be aggregated. Ideally there woudl also be a fuzzy 
combination of the thirrd and fourth result based on similarity and geographic 
closeness, but that is a secondary concern.

I've tried to read up on Collectors like DiversifiedTopDocsCollector and 
Aggregation but I'm having a bit of a hard time figuring out what is the best 
approach and how this slots into my current search code.

Can anyone give advice?

Many thanks.
--
  Leonard Ehrenfried
  m...@leonard.io - https://leonard.io

Remove similar (geographic/named) results

Reply via email to