sbp opened a new issue #17: URL: https://github.com/apache/incubator-ponymail-foal/issues/17
In addition to the issue with list discovery identified in PR #16, there is an additional issue that the use of `size=0` later on in the query code to avoid `sum_other_doc_count` being greater than zero is strongly recommended against in the Elasticsearch documentation: > It is possible to not limit the number of terms that are returned by setting `size` to `0`. Don’t use this on high-cardinality fields as this will kill both your CPU since terms need to be return sorted, and your network. This means that the query will likely be very expensive on databases containing hundreds of thousands of messages, and `background.py` is running it once every couple of minutes or so. But it is necessary to use `size=0` in order to accurately enumerate all mailing lists. The underlying issue here is that Elasticsearch is not designed for accurate queries of this nature over extremely large datasets. It may therefore be necessary to add an extra index for mailing lists, which would be updated whenever `archiver.py` receives another message. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected]
