Hi *,
I'm indexing public transport stops in a Lucene 9.7.0 index in order to get a
fuzzy stop search for start/end points for a trip planner.
The documents look like this:
[
{
"id": "MARTA:485",
"name": "ASHBY STATION",
"coordinate": {
"lat": 33.756478,
"lon": -84.41723
}
},
{
"id": "MARTA:486",
"name": "ASHBY STATION",
"coordinate": {
"lat": 33.756477,
"lon": -84.417328
}
},
{
"id": "MARTA:79496",
"name": "ASHBY STATION - SOUTHBOUND",
"coordinate": {
"lat": 33.756281,
"lon": -84.417724
}
},
{
"id": "MARTA:79028",
"name": "ASHBY STATION - NORTHBOUND",
"coordinate": {
"lat": 33.756066,
"lon": -84.417371
}
}
]
When I execute a term query for "ashby" all of the above results are returned.
I would like to ask for advice on how to de-duplicate the results - at the very
least the first two results with identical names, which are very close to each
other geographically, should be aggregated. Ideally there woudl also be a fuzzy
combination of the thirrd and fourth result based on similarity and geographic
closeness, but that is a secondary concern.
I've tried to read up on Collectors like DiversifiedTopDocsCollector and
Aggregation but I'm having a bit of a hard time figuring out what is the best
approach and how this slots into my current search code.
Can anyone give advice?
Many thanks.
--
Leonard Ehrenfried
[email protected] - https://leonard.io