-------- Original Message --------
From: Mark Sapiro [mailto:m...@msapiro.net]
Sent: Friday, April 9, 2021, 19:07 UTC


> On 4/9/21 5:55 AM, Mark Dale via Mailman-Users wrote:
>>
>> In the archive's downloaded .txt (and also .gz) file, the non-ascii 
>> characters are missing and displayed as "?".
> ...
>> Any advice on getting the non-ascii characters written into the archive .txt 
>> file would be gratefully received.
> 
> 
> The message is prepared for the .txt file by the Article.as_text()
> method in HyperArch.py
> <https://bazaar.launchpad.net/~mailman-coders/mailman/2.1/view/head:/Mailman/Archiver/HyperArch.py#L563>.
> In order to do the email address obfuscation in the message body,
> whether or not ARCHIVER_OBSCURES_EMAILADDRS is True, the method first
> converts the body to unicode using the charset of the list's language
> and then after possible obfuscation, converts it back, again using the
> charset of the list's language. Both these conversions use
> `errors=replace` which replaces any characters not in the charset with,
> in the case of ascii, `?`.
> 
> One way to avoid this replacement would be to change the charset for
> English from ascii to utf-8. See <https://wiki.list.org/x/15958250>.
> 
> This isn't a complete solution in the case where the non-ascii
> characters are encoded other than `utf-8`, e.g., `iso-8859-1`, in the
> original message, but will probably handle most cases
> 
> 

Hi Mark,

Thank you for the comprehensive explanation of the process.

I haven't made any headway with the suggested solution of modifying the 
mm_cfg.py file. 

The author says: "The one known downside of doing this is that Python's email 
library which is used by Mailman will base64 encode charset=utf-8 message 
bodies which makes the raw message body impossible to read by eye or search 
with simple tools like grep." -- which, on reading, had me thinking I will be 
jumping from the frying pan into the fire.

However, in the spirit of things, I made the addition to the mm_cfg.py and ...

As a example, using a subscriber's name that appears in the archive.

François -- as seen in the mbox and Pipermail web archive: the cedille is 
displayed correctly.

Fran?ois -- as seen in the normal downloaded txt: the cedille is replaced by 
question mark (as expected).

François -- as seen in the mm_cfg modified download txt: the cedille replace 
by odd characters.


In short, no joy.

So I'm thinking that if the part of HyperArch.py that does the email address 
obfuscation (and back again) is removed, would that be a step in the direction 
I want to go?

My Python foo is way less than zero but I'm looking at lines 563 -- 600. Or is 
my thinking completely bonkers? 

Regards,
Mark




------------------------------------------------------
Mailman-Users mailing list -- mailman-users@python.org
To unsubscribe send an email to mailman-users-le...@python.org
https://mail.python.org/mailman3/lists/mailman-users.python.org/
Mailman FAQ: http://wiki.list.org/x/AgA3
Security Policy: http://wiki.list.org/x/QIA9
Searchable Archives: https://www.mail-archive.com/mailman-users@python.org/
    https://mail.python.org/archives/list/mailman-users@python.org/

Reply via email to