sbp opened a new issue #17:
URL: https://github.com/apache/incubator-ponymail-foal/issues/17


   In addition to the issue with list discovery identified in PR #16, there is 
an additional issue that the use of `size=0` later on in the query code to 
avoid `sum_other_doc_count` being greater than zero is strongly recommended 
against in the Elasticsearch documentation:
   
   > It is possible to not limit the number of terms that are returned by 
setting `size` to `0`. Don’t use this on high-cardinality fields as this will 
kill both your CPU since terms need to be return sorted, and your network.
   
   This means that the query will likely be very expensive on databases 
containing hundreds of thousands of messages, and `background.py` is running it 
once every couple of minutes or so. But it is necessary to use `size=0` in 
order to accurately enumerate all mailing lists.
   
   The underlying issue here is that Elasticsearch is not designed for accurate 
queries of this nature over extremely large datasets. It may therefore be 
necessary to add an extra index for mailing lists, which would be updated 
whenever `archiver.py` receives another message.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to