Hi *, I'm indexing public transport stops in a Lucene 9.7.0 index in order to get a fuzzy stop search for start/end points for a trip planner.
The documents look like this: [ { "id": "MARTA:485", "name": "ASHBY STATION", "coordinate": { "lat": 33.756478, "lon": -84.41723 } }, { "id": "MARTA:486", "name": "ASHBY STATION", "coordinate": { "lat": 33.756477, "lon": -84.417328 } }, { "id": "MARTA:79496", "name": "ASHBY STATION - SOUTHBOUND", "coordinate": { "lat": 33.756281, "lon": -84.417724 } }, { "id": "MARTA:79028", "name": "ASHBY STATION - NORTHBOUND", "coordinate": { "lat": 33.756066, "lon": -84.417371 } } ] When I execute a term query for "ashby" all of the above results are returned. I would like to ask for advice on how to de-duplicate the results - at the very least the first two results with identical names, which are very close to each other geographically, should be aggregated. Ideally there woudl also be a fuzzy combination of the thirrd and fourth result based on similarity and geographic closeness, but that is a secondary concern. I've tried to read up on Collectors like DiversifiedTopDocsCollector and Aggregation but I'm having a bit of a hard time figuring out what is the best approach and how this slots into my current search code. Can anyone give advice? Many thanks. -- Leonard Ehrenfried m...@leonard.io - https://leonard.io