[Mailman-Users] Re: Non-ascii characters missing from Pipermail archive txt and gz downloads

2021-04-22 Thread Mark Dale via Mailman-Users



 Original Message 
From: Stephen J. Turnbull [mailto:turnbull.stephen...@u.tsukuba.ac.jp]
Sent: Wednesday, April 21, 2021, 18:28 UTC

> In fact, most modern systems will negotiate compressed streams, so if
> you provide a .txt to your webserver, the client will tell the server
> "hey, I know how to gunzip", the server will automatically gzip, the
> client gunzip, and you know nothing about it except that you have text
> onscreen.
> 
> It's unclear what the system will do if offered a .txt.gz file.  If
> the server is smart, it might say
> 
> Content-Type: text/plain; name=whatever.txt  <-- note: no .gz
> Content-Transfer-Encoding: gzip
> 
> and the end result is as above.  But it's not obviously a good idea
> for the server to second-guess the admin that way.
> 
> It's plausible that if the server just sends it as a binary, the
> client will say, "oh, they gzipped it on purpose, I should treat it as
> a file and save it", or it might say, "I know what a .txt is, and go
> ahead and transparently ungzip it.  Clients are reliably unreliable as
> a class -- some users want Do What I Mean, some what Do What I Say,
> and different clients will cater to different users.
> 
> Bottom line: if you're sure you want your .txt files treated as plain
> text and displayed as conveniently as possible, ungzip them!  Very
> likely you won't use any more bandwidth (and by the way, modern
> servers tend to cache that gzipped blob in case somebody asks for it
> again, so on-the-fly compression doesn't necessarily waste hours of
> CPU).
> 
> If for some reason you'd prefer that they be gzipped at both ends,
> that's probably more work to guarantee.


Thanks very much for this Steve. The learning from you guys never stops :-) 

And "unzip" was the pick at of the day.

Best,
Mark
--
Mailman-Users mailing list -- mailman-users@python.org
To unsubscribe send an email to mailman-users-le...@python.org
https://mail.python.org/mailman3/lists/mailman-users.python.org/
Mailman FAQ: http://wiki.list.org/x/AgA3
Security Policy: http://wiki.list.org/x/QIA9
Searchable Archives: https://www.mail-archive.com/mailman-users@python.org/
https://mail.python.org/archives/list/mailman-users@python.org/


[Mailman-Users] Re: Non-ascii characters missing from Pipermail archive txt and gz downloads

2021-04-21 Thread Stephen J. Turnbull
Mark Sapiro writes:

 > In short, the file contains just what it should, but there is a
 > Content-Transfer-Encoding issue.

Technical niggle, probably not relevant to the issue itself:

The charset parameter is an attribute of Content-Type.
Content-Transfer-Encoding should be transparent to this problem.

Steve
--
Mailman-Users mailing list -- mailman-users@python.org
To unsubscribe send an email to mailman-users-le...@python.org
https://mail.python.org/mailman3/lists/mailman-users.python.org/
Mailman FAQ: http://wiki.list.org/x/AgA3
Security Policy: http://wiki.list.org/x/QIA9
Searchable Archives: https://www.mail-archive.com/mailman-users@python.org/
https://mail.python.org/archives/list/mailman-users@python.org/


[Mailman-Users] Re: Non-ascii characters missing from Pipermail archive txt and gz downloads

2021-04-21 Thread Stephen J. Turnbull
A bit OT, I'm glossing Mark Sapiro's explanation of compressed file
handling in Mailman archive downloads.

Mark Dale via Mailman-Users writes:

 > Thank you Mark, that information is appreciated and I've made the change.

I'm glad you find it useful.  Note that the story is a little more
subtle than Mark Sapiro makes it here:

 > > However, the point of this post is to point out that the .txt.gz files
 > > are an anachronism from the days when the bit of bandwidth saved by
 > > delivering a compressed version was important to more that a few ancient
 > > curmudgeons like me.
 > > 
 > > These days, the bandwidth savings is unimportant and is probably offset
 > > by the redundant storage and processing for the .txt.gz files.

In fact, most modern systems will negotiate compressed streams, so if
you provide a .txt to your webserver, the client will tell the server
"hey, I know how to gunzip", the server will automatically gzip, the
client gunzip, and you know nothing about it except that you have text
onscreen.

It's unclear what the system will do if offered a .txt.gz file.  If
the server is smart, it might say

Content-Type: text/plain; name=whatever.txt  <-- note: no .gz
Content-Transfer-Encoding: gzip

and the end result is as above.  But it's not obviously a good idea
for the server to second-guess the admin that way.

It's plausible that if the server just sends it as a binary, the
client will say, "oh, they gzipped it on purpose, I should treat it as
a file and save it", or it might say, "I know what a .txt is, and go
ahead and transparently ungzip it.  Clients are reliably unreliable as
a class -- some users want Do What I Mean, some what Do What I Say,
and different clients will cater to different users.

Bottom line: if you're sure you want your .txt files treated as plain
text and displayed as conveniently as possible, ungzip them!  Very
likely you won't use any more bandwidth (and by the way, modern
servers tend to cache that gzipped blob in case somebody asks for it
again, so on-the-fly compression doesn't necessarily waste hours of
CPU).

If for some reason you'd prefer that they be gzipped at both ends,
that's probably more work to guarantee.

Steve

--
Mailman-Users mailing list -- mailman-users@python.org
To unsubscribe send an email to mailman-users-le...@python.org
https://mail.python.org/mailman3/lists/mailman-users.python.org/
Mailman FAQ: http://wiki.list.org/x/AgA3
Security Policy: http://wiki.list.org/x/QIA9
Searchable Archives: https://www.mail-archive.com/mailman-users@python.org/
https://mail.python.org/archives/list/mailman-users@python.org/


[Mailman-Users] Re: Non-ascii characters missing from Pipermail archive txt and gz downloads

2021-04-20 Thread Mark Dale via Mailman-Users



 Original Message 
From: Mark Sapiro [mailto:m...@msapiro.net]
Sent: Wednesday, April 21, 2021, 01:53 UTC

> Slightly off topic, but after the cron/nightly_gzip job runs, the
> .txt.gz file will be updated with the contents from the .txt file.
> 
> However, the point of this post is to point out that the .txt.gz files
> are an anachronism from the days when the bit of bandwidth saved by
> delivering a compressed version was important to more that a few ancient
> curmudgeons like me.
> 
> These days, the bandwidth savings is unimportant and is probably offset
> by the redundant storage and processing for the .txt.gz files.
> 
> If you want to get rid of these files, see
> .


Thank you Mark, that information is appreciated and I've made the change.


--
Mailman-Users mailing list -- mailman-users@python.org
To unsubscribe send an email to mailman-users-le...@python.org
https://mail.python.org/mailman3/lists/mailman-users.python.org/
Mailman FAQ: http://wiki.list.org/x/AgA3
Security Policy: http://wiki.list.org/x/QIA9
Searchable Archives: https://www.mail-archive.com/mailman-users@python.org/
https://mail.python.org/archives/list/mailman-users@python.org/


[Mailman-Users] Re: Non-ascii characters missing from Pipermail archive txt and gz downloads

2021-04-20 Thread Mark Sapiro
On 4/20/21 5:20 PM, Mark Dale via Mailman-Users wrote:
> 
> Just to clarify: there are two .txt files ...
> 
> (A) an archive .txt.gz file before I made the change to the mm_cfg.py file; 
> and 
> (B) an archive .txt.gz file after I made the change.


Slightly off topic, but after the cron/nightly_gzip job runs, the
.txt.gz file will be updated with the contents from the .txt file.

However, the point of this post is to point out that the .txt.gz files
are an anachronism from the days when the bit of bandwidth saved by
delivering a compressed version was important to more that a few ancient
curmudgeons like me.

These days, the bandwidth savings is unimportant and is probably offset
by the redundant storage and processing for the .txt.gz files.

If you want to get rid of these files, see
.


-- 
Mark Sapiro The highway is for gamblers,
San Francisco Bay Area, Californiabetter use your sense - B. Dylan
--
Mailman-Users mailing list -- mailman-users@python.org
To unsubscribe send an email to mailman-users-le...@python.org
https://mail.python.org/mailman3/lists/mailman-users.python.org/
Mailman FAQ: http://wiki.list.org/x/AgA3
Security Policy: http://wiki.list.org/x/QIA9
Searchable Archives: https://www.mail-archive.com/mailman-users@python.org/
https://mail.python.org/archives/list/mailman-users@python.org/


[Mailman-Users] Re: Non-ascii characters missing from Pipermail archive txt and gz downloads

2021-04-20 Thread Mark Dale via Mailman-Users


 Original Message 
From: Mark Sapiro [mailto:m...@msapiro.net]
Sent: Tuesday, April 20, 2021, 18:55 UTC

> On 4/19/21 10:43 PM, Mark Dale via Mailman-Users wrote:
>>
>> François -- as seen in the mm_cfg modified download txt: the cedille 
>> replace by odd characters.
> 
> How are you viewing the .txt file? The two bytes C3 A7 are the utf-8
> representation of the c-cedilla character. If you view that file as
> iso-8859-1 (latin-1 or western) encoding, you will see those two bytes
> as ç, but if you view it as uf-8 encoding, you will see the c-cedilla.
> 
> In short, the file contains just what it should, but there is a
> Content-Transfer-Encoding issue. If you are viewing it in a browser, the
> issue is the default content character set in your web server. For
> example with Apache something like
> 
> AddCharset utf-8 .txt
> 
> will do what you want, or perhaps your browser has a selection. E.g.,
> Firefox has a text encoding selection in the View menu and you want
> Unicode, not Western.
> 
> If you are actually downloading the file and viewing it with something
> else, the issue is with whatever you are viewing it with.
> 


... Uh-oh ...  ... you're right that the issue is what I'm 
viewing it with. 

Just to clarify: there are two .txt files ...

(A) an archive .txt.gz file before I made the change to the mm_cfg.py file; and 
(B) an archive .txt.gz file after I made the change.

I followed the txt.gz link on the Pipermail page and got the options to 
"Download" or "Open file".

FAIL -- On choosing the "Open file" with ArchiveManager/JEdit, File-A showed 
the c-cedilla replaced by the question-mark; and File-B showed it replaced by 
the ç characters.

SUCCESS -- However, choosing "Download", then gunzip and then open with JEdit I 
get a better result: File A showed the c-cedilla replaced by the question-mark 
as expected; but File-B shows the c-cedilla (happy days!). 


So in short, Mark Sapiro's recommended fix -- https://wiki.list.org/x/15958250 
-- has cracked this little chestnut.

Many thanks once again Mark.




--
Mailman-Users mailing list -- mailman-users@python.org
To unsubscribe send an email to mailman-users-le...@python.org
https://mail.python.org/mailman3/lists/mailman-users.python.org/
Mailman FAQ: http://wiki.list.org/x/AgA3
Security Policy: http://wiki.list.org/x/QIA9
Searchable Archives: https://www.mail-archive.com/mailman-users@python.org/
https://mail.python.org/archives/list/mailman-users@python.org/


[Mailman-Users] Re: Non-ascii characters missing from Pipermail archive txt and gz downloads

2021-04-20 Thread Mark Sapiro
On 4/19/21 10:43 PM, Mark Dale via Mailman-Users wrote:
> 
> François -- as seen in the mm_cfg modified download txt: the cedille replace 
> by odd characters.

How are you viewing the .txt file? The two bytes C3 A7 are the utf-8
representation of the c-cedilla character. If you view that file as
iso-8859-1 (latin-1 or western) encoding, you will see those two bytes
as ç, but if you view it as uf-8 encoding, you will see the c-cedilla.

In short, the file contains just what it should, but there is a
Content-Transfer-Encoding issue. If you are viewing it in a browser, the
issue is the default content character set in your web server. For
example with Apache something like

AddCharset utf-8 .txt

will do what you want, or perhaps your browser has a selection. E.g.,
Firefox has a text encoding selection in the View menu and you want
Unicode, not Western.

If you are actually downloading the file and viewing it with something
else, the issue is with whatever you are viewing it with.

> In short, no joy.
> 
> So I'm thinking that if the part of HyperArch.py that does the email address 
> obfuscation (and back again) is removed, would that be a step in the 
> direction I want to go?
> 
> My Python foo is way less than zero but I'm looking at lines 563 -- 600. Or 
> is my thinking completely bonkers? 


That won't help. As I said, the file is no correct and no unrecognized
characters have been replaced, so modifying that code by say deleting
lines 587-599 won't change anything.


-- 
Mark Sapiro The highway is for gamblers,
San Francisco Bay Area, Californiabetter use your sense - B. Dylan
--
Mailman-Users mailing list -- mailman-users@python.org
To unsubscribe send an email to mailman-users-le...@python.org
https://mail.python.org/mailman3/lists/mailman-users.python.org/
Mailman FAQ: http://wiki.list.org/x/AgA3
Security Policy: http://wiki.list.org/x/QIA9
Searchable Archives: https://www.mail-archive.com/mailman-users@python.org/
https://mail.python.org/archives/list/mailman-users@python.org/


[Mailman-Users] Re: Non-ascii characters missing from Pipermail archive txt and gz downloads

2021-04-19 Thread Mark Dale via Mailman-Users


 Original Message 
From: Mark Sapiro [mailto:m...@msapiro.net]
Sent: Friday, April 9, 2021, 19:07 UTC


> On 4/9/21 5:55 AM, Mark Dale via Mailman-Users wrote:
>>
>> In the archive's downloaded .txt (and also .gz) file, the non-ascii 
>> characters are missing and displayed as "?".
> ...
>> Any advice on getting the non-ascii characters written into the archive .txt 
>> file would be gratefully received.
> 
> 
> The message is prepared for the .txt file by the Article.as_text()
> method in HyperArch.py
> .
> In order to do the email address obfuscation in the message body,
> whether or not ARCHIVER_OBSCURES_EMAILADDRS is True, the method first
> converts the body to unicode using the charset of the list's language
> and then after possible obfuscation, converts it back, again using the
> charset of the list's language. Both these conversions use
> `errors=replace` which replaces any characters not in the charset with,
> in the case of ascii, `?`.
> 
> One way to avoid this replacement would be to change the charset for
> English from ascii to utf-8. See .
> 
> This isn't a complete solution in the case where the non-ascii
> characters are encoded other than `utf-8`, e.g., `iso-8859-1`, in the
> original message, but will probably handle most cases
> 
> 

Hi Mark,

Thank you for the comprehensive explanation of the process.

I haven't made any headway with the suggested solution of modifying the 
mm_cfg.py file. 

The author says: "The one known downside of doing this is that Python's email 
library which is used by Mailman will base64 encode charset=utf-8 message 
bodies which makes the raw message body impossible to read by eye or search 
with simple tools like grep." -- which, on reading, had me thinking I will be 
jumping from the frying pan into the fire.

However, in the spirit of things, I made the addition to the mm_cfg.py and ...

As a example, using a subscriber's name that appears in the archive.

François -- as seen in the mbox and Pipermail web archive: the cedille is 
displayed correctly.

Fran?ois -- as seen in the normal downloaded txt: the cedille is replaced by 
question mark (as expected).

François -- as seen in the mm_cfg modified download txt: the cedille replace 
by odd characters.


In short, no joy.

So I'm thinking that if the part of HyperArch.py that does the email address 
obfuscation (and back again) is removed, would that be a step in the direction 
I want to go?

My Python foo is way less than zero but I'm looking at lines 563 -- 600. Or is 
my thinking completely bonkers? 

Regards,
Mark




--
Mailman-Users mailing list -- mailman-users@python.org
To unsubscribe send an email to mailman-users-le...@python.org
https://mail.python.org/mailman3/lists/mailman-users.python.org/
Mailman FAQ: http://wiki.list.org/x/AgA3
Security Policy: http://wiki.list.org/x/QIA9
Searchable Archives: https://www.mail-archive.com/mailman-users@python.org/
https://mail.python.org/archives/list/mailman-users@python.org/


[Mailman-Users] Re: Non-ascii characters missing from Pipermail archive txt and gz downloads

2021-04-09 Thread Mark Sapiro
On 4/9/21 5:55 AM, Mark Dale via Mailman-Users wrote:
> 
> In the archive's downloaded .txt (and also .gz) file, the non-ascii 
> characters are missing and displayed as "?".
...
> Any advice on getting the non-ascii characters written into the archive .txt 
> file would be gratefully received.


The message is prepared for the .txt file by the Article.as_text()
method in HyperArch.py
.
In order to do the email address obfuscation in the message body,
whether or not ARCHIVER_OBSCURES_EMAILADDRS is True, the method first
converts the body to unicode using the charset of the list's language
and then after possible obfuscation, converts it back, again using the
charset of the list's language. Both these conversions use
`errors=replace` which replaces any characters not in the charset with,
in the case of ascii, `?`.

One way to avoid this replacement would be to change the charset for
English from ascii to utf-8. See .

This isn't a complete solution in the case where the non-ascii
characters are encoded other than `utf-8`, e.g., `iso-8859-1`, in the
original message, but will probably handle most cases


-- 
Mark Sapiro The highway is for gamblers,
San Francisco Bay Area, Californiabetter use your sense - B. Dylan
--
Mailman-Users mailing list -- mailman-users@python.org
To unsubscribe send an email to mailman-users-le...@python.org
https://mail.python.org/mailman3/lists/mailman-users.python.org/
Mailman FAQ: http://wiki.list.org/x/AgA3
Security Policy: http://wiki.list.org/x/QIA9
Searchable Archives: https://www.mail-archive.com/mailman-users@python.org/
https://mail.python.org/archives/list/mailman-users@python.org/