Re: [incubator-ponymail-foal] 01/02: Add a separate header for short bodies for stats.py

sebb Sun, 14 Nov 2021 09:32:41 -0800

On Sun, 14 Nov 2021 at 16:22, Daniel Gruno <[email protected]> wrote:
>
> On 14/11/2021 12.19, sebb wrote:
> > -1
> >
> > I think this commit should be reverted.
> >
> > I see no purpose for the body_short header.
>
> It is absolutely needed. It is what makes it possible to search large
> lists with very large emails and not have everything crash around you.


I see now.

I was confused by the use of doc['body'] in messages.query and
overlooked the fact that it does not download the body attribute.

However, I'm not sure that BODY_MAXLEN serves any purpose.

body_short could include the '...' marker.

> When you only need the first bit of the text, there is no need to grab
> several megabytes of email body, especially not if you have thousands of
> emails you need to parse. In a production environment I have access to,
> this caused a significant speedup and went from grabbing 2GB of data per
> search to only 50MB - search time and overall backend response time was
> an order of magnitude faster after this had been implemented. It is a
> very vital component of a responsive service.
>
> The only other way would be to use scripting inside the ES query, which
> is both dangerous and requires additional setup.
>
> As for the length, 200 is the standard used throughout the interface.
> Yes, you could change it, and yes you might need to reindex if so, but
> who is realistically going to change it?
>

> >
> > It wastes space in the database.
> > It increases the data sent from database to server code.
> >
> > What if you want to change the length?
> > Are you going to update the entire database each time the length is changed?
> >
> > Seems to me the sensible way to do this is in the status.lua plugin.
> > This can then pick up a config item for the length.
> >
> > Sebb
> >
> > On Sun, 17 Oct 2021 at 21:32, sebb <[email protected]> wrote:
> >>
> >> On Sun, 17 Oct 2021 at 16:43, <[email protected]> wrote:
> >>>
> >>> This is an automated email from the ASF dual-hosted git repository.
> >>>
> >>> humbedooh pushed a commit to branch master
> >>> in repository 
> >>> https://gitbox.apache.org/repos/asf/incubator-ponymail-foal.git
> >>>
> >>> commit 2dff9351d119ddee5c5e0171991c54b1911f05b1
> >>> Author: Daniel Gruno <[email protected]>
> >>> AuthorDate: Sun Oct 17 17:41:51 2021 +0200
> >>>
> >>>      Add a separate header for short bodies for stats.py
> >>> ---
> >>>   tools/archiver.py   | 5 ++++-
> >>>   tools/mappings.yaml | 2 ++
> >>>   2 files changed, 6 insertions(+), 1 deletion(-)
> >>>
> >>> diff --git a/tools/archiver.py b/tools/archiver.py
> >>> index 1783d0e..dc12e39 100755
> >>> --- a/tools/archiver.py
> >>> +++ b/tools/archiver.py
> >>> @@ -584,6 +584,8 @@ class Archiver(object):  # N.B. Also used by 
> >>> import-mbox.py
> >>>               ghash = hashlib.md5(mailaddr.encode("utf-8")).hexdigest()
> >>>
> >>>               notes.append(["ARCHIVE: Email archived as %s at %u" % 
> >>> (document_id, time.time())])
> >>> +            body_unflowed = body.unflow() if body else ""
> >>> +            body_shortened = body_unflowed[:210]  # 210 so that we can 
> >>> tell if > 200.
> >>>
> >>
> >> -1
> >>
> >> What's so special about 200 and 210?
> >>
> >> These numbers should be constants (with suitable docn) or possibly 
> >> configuration items.
> >>
> >> The only bare numbers I would expect to see in code are 0 and 1 (or -1).
> >>
> >>>               output_json = {
> >>>                   "from_raw": msg_metadata["from"],
> >>> @@ -603,7 +605,8 @@ class Archiver(object):  # N.B. Also used by 
> >>> import-mbox.py
> >>>                   "private": private,
> >>>                   "references": msg_metadata["references"],
> >>>                   "in-reply-to": irt,
> >>> -                "body": body.unflow() if body else "",
> >>> +                "body": body_unflowed,
> >>> +                "body_short": body_shortened,
> >>>                   "html_source_only": body and body.html_as_source or 
> >>> False,
> >>>                   "attachments": attachments,
> >>>                   "forum": (lid or "").strip("<>").replace(".", "@", 1),
> >>> diff --git a/tools/mappings.yaml b/tools/mappings.yaml
> >>> index 4bb4978..6ad72d3 100644
> >>> --- a/tools/mappings.yaml
> >>> +++ b/tools/mappings.yaml
> >>> @@ -55,6 +55,8 @@ mbox:
> >>>             type: long
> >>>       body:
> >>>         type: text
> >>> +    body_short:
> >>> +      type: text
> >>>       cc:
> >>>         type: text
> >>>       date:
>

Re: [incubator-ponymail-foal] 01/02: Add a separate header for short bodies for stats.py

Reply via email to