On Sun, 14 Nov 2021 at 16:22, Daniel Gruno <[email protected]> wrote: > > On 14/11/2021 12.19, sebb wrote: > > -1 > > > > I think this commit should be reverted. > > > > I see no purpose for the body_short header. > > It is absolutely needed. It is what makes it possible to search large > lists with very large emails and not have everything crash around you.
I see now. I was confused by the use of doc['body'] in messages.query and overlooked the fact that it does not download the body attribute. However, I'm not sure that BODY_MAXLEN serves any purpose. body_short could include the '...' marker. > When you only need the first bit of the text, there is no need to grab > several megabytes of email body, especially not if you have thousands of > emails you need to parse. In a production environment I have access to, > this caused a significant speedup and went from grabbing 2GB of data per > search to only 50MB - search time and overall backend response time was > an order of magnitude faster after this had been implemented. It is a > very vital component of a responsive service. > > The only other way would be to use scripting inside the ES query, which > is both dangerous and requires additional setup. > > As for the length, 200 is the standard used throughout the interface. > Yes, you could change it, and yes you might need to reindex if so, but > who is realistically going to change it? > > > > > It wastes space in the database. > > It increases the data sent from database to server code. > > > > What if you want to change the length? > > Are you going to update the entire database each time the length is changed? > > > > Seems to me the sensible way to do this is in the status.lua plugin. > > This can then pick up a config item for the length. > > > > Sebb > > > > On Sun, 17 Oct 2021 at 21:32, sebb <[email protected]> wrote: > >> > >> On Sun, 17 Oct 2021 at 16:43, <[email protected]> wrote: > >>> > >>> This is an automated email from the ASF dual-hosted git repository. > >>> > >>> humbedooh pushed a commit to branch master > >>> in repository > >>> https://gitbox.apache.org/repos/asf/incubator-ponymail-foal.git > >>> > >>> commit 2dff9351d119ddee5c5e0171991c54b1911f05b1 > >>> Author: Daniel Gruno <[email protected]> > >>> AuthorDate: Sun Oct 17 17:41:51 2021 +0200 > >>> > >>> Add a separate header for short bodies for stats.py > >>> --- > >>> tools/archiver.py | 5 ++++- > >>> tools/mappings.yaml | 2 ++ > >>> 2 files changed, 6 insertions(+), 1 deletion(-) > >>> > >>> diff --git a/tools/archiver.py b/tools/archiver.py > >>> index 1783d0e..dc12e39 100755 > >>> --- a/tools/archiver.py > >>> +++ b/tools/archiver.py > >>> @@ -584,6 +584,8 @@ class Archiver(object): # N.B. Also used by > >>> import-mbox.py > >>> ghash = hashlib.md5(mailaddr.encode("utf-8")).hexdigest() > >>> > >>> notes.append(["ARCHIVE: Email archived as %s at %u" % > >>> (document_id, time.time())]) > >>> + body_unflowed = body.unflow() if body else "" > >>> + body_shortened = body_unflowed[:210] # 210 so that we can > >>> tell if > 200. > >>> > >> > >> -1 > >> > >> What's so special about 200 and 210? > >> > >> These numbers should be constants (with suitable docn) or possibly > >> configuration items. > >> > >> The only bare numbers I would expect to see in code are 0 and 1 (or -1). > >> > >>> output_json = { > >>> "from_raw": msg_metadata["from"], > >>> @@ -603,7 +605,8 @@ class Archiver(object): # N.B. Also used by > >>> import-mbox.py > >>> "private": private, > >>> "references": msg_metadata["references"], > >>> "in-reply-to": irt, > >>> - "body": body.unflow() if body else "", > >>> + "body": body_unflowed, > >>> + "body_short": body_shortened, > >>> "html_source_only": body and body.html_as_source or > >>> False, > >>> "attachments": attachments, > >>> "forum": (lid or "").strip("<>").replace(".", "@", 1), > >>> diff --git a/tools/mappings.yaml b/tools/mappings.yaml > >>> index 4bb4978..6ad72d3 100644 > >>> --- a/tools/mappings.yaml > >>> +++ b/tools/mappings.yaml > >>> @@ -55,6 +55,8 @@ mbox: > >>> type: long > >>> body: > >>> type: text > >>> + body_short: > >>> + type: text > >>> cc: > >>> type: text > >>> date: >
