Re: [Mailman-Developers] Patch for HyperArch

2016-03-12 Thread Stephen J. Turnbull
Mark Sapiro writes:
 > On 03/12/2016 08:23 AM, Stephen J. Turnbull wrote:
 > > Mark Sapiro writes:
 > > 
 > >  > The Received: header check is important. For an "imported" mbox, the
 > >  > From_ separators may reflect when the mbox was exported from it's source
 > >  > rather than the message date. If the messages have Received: headers,
 > >  > the later ones at least tend to have good dates.
 > > 
 > > Overengineering (seems to be becoming a habit?) perhaps, but if you're
 > > going to parse one Received field, why not do them all, sort, and take
 > > the latest reasonable one?  Leaving the sorted list on msg_data might
 > > also be useful to spam filters (although we don't really want to
 > > recommend spam filtering in Mailman...).
 > 
 > 
 > I see your point,

About "overengineering"?  :-)

Gotcha on the rest, but "overengineered, yes" was you needed to say.
(I guess the "header field contents are in a list ordered as you would
expect datum" is generally useful though.  Thanks for explaining that,
even if it's not part of the spec.)

___
Mailman-Developers mailing list
Mailman-Developers@python.org
https://mail.python.org/mailman/listinfo/mailman-developers
Mailman FAQ: http://wiki.list.org/x/AgA3
Searchable Archives: 
http://www.mail-archive.com/mailman-developers%40python.org/
Unsubscribe: 
https://mail.python.org/mailman/options/mailman-developers/archive%40jab.org

Security Policy: http://wiki.list.org/x/QIA9


Re: [Mailman-Developers] Patch for HyperArch

2016-03-12 Thread Mark Sapiro
On 03/12/2016 08:23 AM, Stephen J. Turnbull wrote:
> Mark Sapiro writes:
> 
>  > The Received: header check is important. For an "imported" mbox, the
>  > From_ separators may reflect when the mbox was exported from it's source
>  > rather than the message date. If the messages have Received: headers,
>  > the later ones at least tend to have good dates.
> 
> Overengineering (seems to be becoming a habit?) perhaps, but if you're
> going to parse one Received field, why not do them all, sort, and take
> the latest reasonable one?  Leaving the sorted list on msg_data might
> also be useful to spam filters (although we don't really want to
> recommend spam filtering in Mailman...).


I see your point, but my feeling is that bad dates tend to come from the
original poster's machine so that if the Date: header is bad, maybe the
first (bottom-most in the message headers) Received: header also has a
bad date, but subsequent ones are likely good.

I think the likelihood that the last (top-most) Received: date is also
bad but an intermediate one is good is vanishingly small.

I also note that the docs say that in the case of multiple 'Xxx:'
headers, the one returned by email.message.get('xxx') is indeterminate,
but I've looked at the code and in the message object, the header's are
kept in a list (not a dictionary) in the order parsed from the original
text, so get() which returns the first found will reliably return the
top-most one.

Also note that this change really only affects processing of imported
mailboxes with bin/arch. For posts to a list being archived, ArchRunner
has already fixed bad dates and even if it hasn't because the site set
ARCHIVER_CLOBBER_DATE_POLICY = 0, ArchRunner still added an
X-List-Received-Date: header and pipermail._set_date() will look at that
before looking at any Received: headers.

So we're really only dealing with defective messages from imported
mailboxes, and they often won't even have Received: headers.

-- 
Mark Sapiro The highway is for gamblers,
San Francisco Bay Area, Californiabetter use your sense - B. Dylan
___
Mailman-Developers mailing list
Mailman-Developers@python.org
https://mail.python.org/mailman/listinfo/mailman-developers
Mailman FAQ: http://wiki.list.org/x/AgA3
Searchable Archives: 
http://www.mail-archive.com/mailman-developers%40python.org/
Unsubscribe: 
https://mail.python.org/mailman/options/mailman-developers/archive%40jab.org

Security Policy: http://wiki.list.org/x/QIA9


Re: [Mailman-Developers] Patch for HyperArch

2016-03-12 Thread Stephen J. Turnbull
Mark Sapiro writes:

 > The Received: header check is important. For an "imported" mbox, the
 > From_ separators may reflect when the mbox was exported from it's source
 > rather than the message date. If the messages have Received: headers,
 > the later ones at least tend to have good dates.

Overengineering (seems to be becoming a habit?) perhaps, but if you're
going to parse one Received field, why not do them all, sort, and take
the latest reasonable one?  Leaving the sorted list on msg_data might
also be useful to spam filters (although we don't really want to
recommend spam filtering in Mailman...).

___
Mailman-Developers mailing list
Mailman-Developers@python.org
https://mail.python.org/mailman/listinfo/mailman-developers
Mailman FAQ: http://wiki.list.org/x/AgA3
Searchable Archives: 
http://www.mail-archive.com/mailman-developers%40python.org/
Unsubscribe: 
https://mail.python.org/mailman/options/mailman-developers/archive%40jab.org

Security Policy: http://wiki.list.org/x/QIA9


Re: [Mailman-Developers] Patch for HyperArch

2016-03-11 Thread Mark Sapiro
On 03/11/2016 10:14 AM, Sebastian Hagedorn wrote:
> 
> As far as I'm concerned, the problem is fixed. Thank you very much for
> all your help!


And thank you for reporting and testing.

-- 
Mark Sapiro The highway is for gamblers,
San Francisco Bay Area, Californiabetter use your sense - B. Dylan
___
Mailman-Developers mailing list
Mailman-Developers@python.org
https://mail.python.org/mailman/listinfo/mailman-developers
Mailman FAQ: http://wiki.list.org/x/AgA3
Searchable Archives: 
http://www.mail-archive.com/mailman-developers%40python.org/
Unsubscribe: 
https://mail.python.org/mailman/options/mailman-developers/archive%40jab.org

Security Policy: http://wiki.list.org/x/QIA9


Re: [Mailman-Developers] Patch for HyperArch

2016-03-11 Thread Sebastian Hagedorn

I have fixed this and pushed rev 1634. The entire patch combining revs
1633 and 1644 is attached.


Thanks again. This time it ran through, and all mails with broken date 
headers are archived as expected: the ones with only the "From foo@bar" 
line have the date of the migration from Majordomo to Mailman, all others 
according to their received headers.


As far as I'm concerned, the problem is fixed. Thank you very much for all 
your help!

--
Sebastian Hagedorn - Weyertal 121, Zimmer 2.02
Regionales Rechenzentrum (RRZK)
Universität zu Köln / Cologne University - Tel. +49-221-470-89578
___
Mailman-Developers mailing list
Mailman-Developers@python.org
https://mail.python.org/mailman/listinfo/mailman-developers
Mailman FAQ: http://wiki.list.org/x/AgA3
Searchable Archives: 
http://www.mail-archive.com/mailman-developers%40python.org/
Unsubscribe: 
https://mail.python.org/mailman/options/mailman-developers/archive%40jab.org

Security Policy: http://wiki.list.org/x/QIA9

Re: [Mailman-Developers] Patch for HyperArch

2016-03-11 Thread Mark Sapiro
On 03/11/2016 06:39 AM, Sebastian Hagedorn wrote:
> --On 10. März 2016 um 17:00:18 -0800 Mark Sapiro  wrote:
> 
>> I have reported this bug and fixed it. The bug is
>>  and the fix is
>> 
...
> 
> Thanks. I applied the patch and tried ro re-archive the list in
> question. Now I get this:
> 
> Schreibe Archivzustand in Datei
> /var/lib/mailman/archives/private/linux-users/pipermail.pck
> Traceback (most recent call last):
>  File "./bin/arch", line 201, in 
>main()
>  File "./bin/arch", line 189, in main
>archiver.processUnixMailbox(fp, start, end)
>  File "/usr/lib/mailman/Mailman/Archiver/pipermail.py", line 597, in
> processUnixMailbox
>a = self._makeArticle(m, self.sequence)
>  File "/usr/lib/mailman/Mailman/Archiver/HyperArch.py", line 688, in
> _makeArticle
>mlist=self.maillist)
>  File "/usr/lib/mailman/Mailman/Archiver/HyperArch.py", line 264, in
> __init__
>self.__super_init(message, sequence, keepHeaders)
>  File "/usr/lib/mailman/Mailman/Archiver/pipermail.py", line 187, in
> __init__
>self._set_date(message)
>  File "/usr/lib/mailman/Mailman/Archiver/HyperArch.py", line 600, in
> _set_date
>self.__super_set_date(message)
>  File "/usr/lib/mailman/Mailman/Archiver/pipermail.py", line 256, in
> _set_date
>message.get('received'), flags=re.S))
>  File "/usr/lib/python2.7/re.py", line 155, in sub
>return _compile(pattern, flags).sub(repl, string, count)
> TypeError: expected string or buffer
> 
> For the time being, I have reverted the patch.


Thanks for testing and thanks for the report.

I have fixed this and pushed rev 1634. The entire patch combining revs
1633 and 1644 is attached.

-- 
Mark Sapiro The highway is for gamblers,
San Francisco Bay Area, Californiabetter use your sense - B. Dylan
=== modified file 'Mailman/Archiver/pipermail.py'
--- Mailman/Archiver/pipermail.py   2013-12-14 00:53:13 +
+++ Mailman/Archiver/pipermail.py   2016-03-11 17:30:47 +
@@ -16,6 +16,7 @@
 VERSION = __version__
 CACHESIZE = 100# Number of slots in the cache
 
+from Mailman import mm_cfg
 from Mailman import Errors
 from Mailman.Mailbox import ArchiverMailbox
 from Mailman.Logging.Syslog import syslog
@@ -230,21 +231,30 @@
 self.body = s.readlines()
 
 def _set_date(self, message):
-def floatdate(header):
-missing = []
-datestr = message.get(header, missing)
-if datestr is missing:
+def floatdate(datestr):
+if not datestr:
 return None
 date = parsedate_tz(datestr)
 try:
-return mktime_tz(date)
+date = mktime_tz(date)
+if (date < 0 or
+date - time.time() >
+mm_cfg.ARCHIVER_ALLOWABLE_SANE_DATE_SKEW
+   ):
+return None
+return date
 except (TypeError, ValueError, OverflowError):
 return None
-date = floatdate('date')
-if date is None:
-date = floatdate('x-list-received-date')
-if date is None:
-# What's left to try?
+date = floatdate(message.get('date'))
+if date is None:
+date = floatdate(message.get('x-list-received-date'))
+if date is None:
+date = floatdate(re.sub(r'^.*;\s*', '',
+message.get('received', ''), flags=re.S))
+if date is None:
+date = floatdate(re.sub(r'From \s*\S+\s+', '',
+message.get_unixfrom() or '' ))
+if date is None:
 date = self._last_article_time + 1
 self._last_article_time = date
 self.date = '%011i' % date

___
Mailman-Developers mailing list
Mailman-Developers@python.org
https://mail.python.org/mailman/listinfo/mailman-developers
Mailman FAQ: http://wiki.list.org/x/AgA3
Searchable Archives: 
http://www.mail-archive.com/mailman-developers%40python.org/
Unsubscribe: 
https://mail.python.org/mailman/options/mailman-developers/archive%40jab.org

Security Policy: http://wiki.list.org/x/QIA9

Re: [Mailman-Developers] Patch for HyperArch

2016-03-11 Thread Sebastian Hagedorn

--On 10. März 2016 um 17:00:18 -0800 Mark Sapiro  wrote:


I have reported this bug and fixed it. The bug is
 and the fix is


The fix looks at message timestamps in the following order:

The Date: header if any
An X-List-Received-Date header if any
The last Received: header if any
The Unix From_ line

The first parseable date which is in the current epoch (>= 1970) and not
more than mm_cfg.ARCHIVER_ALLOWABLE_SANE_DATE_SKEW (default 15 days) in
the future is accepted. If none of those produce an acceptable date, the
current time is used.

This differs from past behavior by the addition of range checks on the
date and the addition of Received: and Unix From_ date checks.

The Received: header check is important. For an "imported" mbox, the
From_ separators may reflect when the mbox was exported from it's source
rather than the message date. If the messages have Received: headers,
the later ones at least tend to have good dates.


Thanks. I applied the patch and tried ro re-archive the list in question. 
Now I get this:


Schreibe Archivzustand in Datei 
/var/lib/mailman/archives/private/linux-users/pipermail.pck

Traceback (most recent call last):
 File "./bin/arch", line 201, in 
   main()
 File "./bin/arch", line 189, in main
   archiver.processUnixMailbox(fp, start, end)
 File "/usr/lib/mailman/Mailman/Archiver/pipermail.py", line 597, in 
processUnixMailbox

   a = self._makeArticle(m, self.sequence)
 File "/usr/lib/mailman/Mailman/Archiver/HyperArch.py", line 688, in 
_makeArticle

   mlist=self.maillist)
 File "/usr/lib/mailman/Mailman/Archiver/HyperArch.py", line 264, in 
__init__

   self.__super_init(message, sequence, keepHeaders)
 File "/usr/lib/mailman/Mailman/Archiver/pipermail.py", line 187, in 
__init__

   self._set_date(message)
 File "/usr/lib/mailman/Mailman/Archiver/HyperArch.py", line 600, in 
_set_date

   self.__super_set_date(message)
 File "/usr/lib/mailman/Mailman/Archiver/pipermail.py", line 256, in 
_set_date

   message.get('received'), flags=re.S))
 File "/usr/lib/python2.7/re.py", line 155, in sub
   return _compile(pattern, flags).sub(repl, string, count)
TypeError: expected string or buffer

For the time being, I have reverted the patch.
--
   .:.Sebastian Hagedorn - Weyertal 121 (Gebäude 133), Zimmer 2.02.:.
.:.Regionales Rechenzentrum (RRZK).:.
  .:.Universität zu Köln / Cologne University - ✆ +49-221-470-89578.:.
___
Mailman-Developers mailing list
Mailman-Developers@python.org
https://mail.python.org/mailman/listinfo/mailman-developers
Mailman FAQ: http://wiki.list.org/x/AgA3
Searchable Archives: 
http://www.mail-archive.com/mailman-developers%40python.org/
Unsubscribe: 
https://mail.python.org/mailman/options/mailman-developers/archive%40jab.org

Security Policy: http://wiki.list.org/x/QIA9

Re: [Mailman-Developers] Patch for HyperArch

2016-03-10 Thread Mark Sapiro
On 03/10/2016 10:10 AM, Mark Sapiro wrote:
> 
> For the actual "fix", my inclination is to modify the _set_date method
> in pipermail.py (this is called from Hyperarch.py as
> self.__super_set_date(message) just before it does self.fromdate =
> time.ctime(int(self.date)).
> 
> I would have this check the date and if it's not within say 50 years of
> now, replace the date with something reasonable. My question at this
> point is what's that something reasonable. I think it comes down to a
> choice between the From_ date if that's reasonable or the current date,
> but I don't know which is better.


I have reported this bug and fixed it. The bug is
 and the fix is


The fix looks at message timestamps in the following order:

The Date: header if any
An X-List-Received-Date header if any
The last Received: header if any
The Unix From_ line

The first parseable date which is in the current epoch (>= 1970) and not
more than mm_cfg.ARCHIVER_ALLOWABLE_SANE_DATE_SKEW (default 15 days) in
the future is accepted. If none of those produce an acceptable date, the
current time is used.

This differs from past behavior by the addition of range checks on the
date and the addition of Received: and Unix From_ date checks.

The Received: header check is important. For an "imported" mbox, the
From_ separators may reflect when the mbox was exported from it's source
rather than the message date. If the messages have Received: headers,
the later ones at least tend to have good dates.

-- 
Mark Sapiro The highway is for gamblers,
San Francisco Bay Area, Californiabetter use your sense - B. Dylan
___
Mailman-Developers mailing list
Mailman-Developers@python.org
https://mail.python.org/mailman/listinfo/mailman-developers
Mailman FAQ: http://wiki.list.org/x/AgA3
Searchable Archives: 
http://www.mail-archive.com/mailman-developers%40python.org/
Unsubscribe: 
https://mail.python.org/mailman/options/mailman-developers/archive%40jab.org

Security Policy: http://wiki.list.org/x/QIA9


Re: [Mailman-Developers] Patch for HyperArch

2016-03-10 Thread Mark Sapiro
On 03/10/2016 03:19 AM, Sebastian Hagedorn wrote:
> 
> Unless you're really interested in the other differences you referred to
> in your other message, I won't bother to analyze them further. It seems
> clear to me that you have identified the main issue.


I understand the issue, and I know how to "fix" it.

I'm a bit uncertain about what to change a bad date to. Normally,
messages in the cumulative .mbox have at least three sources of date.
There is a Date: header, The mbox From_ separator line, and at least if
the message originally came via Mailman, an X-List-Received-Date: header
that was added by Mailman's ArchRunner when the message was archived.

Also, depending to an extent on site configuration, if the message was
originally archived by Mailman, it's archived Date: header will normally
be "close" to the time it was received by Mailman. See the code in the
_dispose() method in Mailman/Queue/ArchRunner.py.

So what this says is if a message in the mbox has a bad Date:, it is
probably from an imported mbox, and it's not clear that the From_ date
will be any better.

In the messages and excerpts you posted earlier, the From_ dates were
all within a few minutes of "Mon Nov  7 14:08:46 2005" which is probably
the time that portion of the mbox was built from a majordomo archive.

I have made a script at 
(mirrored at ) which
augments the standard bin/cleanarch script to also replace Date: headers
with the date from From_ if they differ by more than
mm_cfg.ARCHIVER_ALLOWABLE_SANE_DATE_SKEW (default = 15 days).

This may be sufficient. If you run it with the -n option against your
mbox, it will report the line #s of the bad dates, what they are and
what they would be changed to.

For the actual "fix", my inclination is to modify the _set_date method
in pipermail.py (this is called from Hyperarch.py as
self.__super_set_date(message) just before it does self.fromdate =
time.ctime(int(self.date)).

I would have this check the date and if it's not within say 50 years of
now, replace the date with something reasonable. My question at this
point is what's that something reasonable. I think it comes down to a
choice between the From_ date if that's reasonable or the current date,
but I don't know which is better.

Does anyone have an idea?

-- 
Mark Sapiro The highway is for gamblers,
San Francisco Bay Area, Californiabetter use your sense - B. Dylan
___
Mailman-Developers mailing list
Mailman-Developers@python.org
https://mail.python.org/mailman/listinfo/mailman-developers
Mailman FAQ: http://wiki.list.org/x/AgA3
Searchable Archives: 
http://www.mail-archive.com/mailman-developers%40python.org/
Unsubscribe: 
https://mail.python.org/mailman/options/mailman-developers/archive%40jab.org

Security Policy: http://wiki.list.org/x/QIA9


Re: [Mailman-Developers] Patch for HyperArch

2016-03-10 Thread Sebastian Hagedorn

--On 9. März 2016 um 15:15:20 -0800 Mark Sapiro  wrote:


We played around and found that the error is related to our version of
Python. Here's a minimal test script that shows the issue:

from email.Utils import parseaddr, parsedate_tz, mktime_tz, formatdate
print mktime_tz(parsedate_tz("Fri, 4 Feb 100 00:51:42 +0100 (MET)"));

That's the Date header from the single piece of legitimate mail. Python
2.4 throws the same exception you were seeing: "ValueError: year out of
range". However, our Python 2.7 (which we use for Mailman) does this:

-59008522098

When that value is then passed to time.ctime(), you get "ValueError:
timestamp out of range for platform time_t". We're on RHEL 5, and our
version of Python 2.7 is from the IUSCommunity repo:
python27-2.7.10-1.ius.el5. Which version of Python were you using?



Thinking about this a bit more, I think what you say is the crux of the
difference between yours and mine. In your Python,
time.ctime(-59008522098) throws the ValueError, and in mine it returns a
date string which may cause problems later on in the processing.

I think the difference is not with a Python version per se, but rather
with the underlying C environment and C library 'time' functions that
Python was compiled with.


That makes sense.


In any case, I think I now have enough understanding of the issue to
work up some kind of fix that will work in both your situation and mine.

Note that your original suggested patch won't solve the problem for me
because my time.ctime(-59008522098) does not throw a ValueError.


Right. Of course I didn't know that at the time :-)

Unless you're really interested in the other differences you referred to in 
your other message, I won't bother to analyze them further. It seems clear 
to me that you have identified the main issue.


Thanks for your help!
--
   .:.Sebastian Hagedorn - Weyertal 121 (Gebäude 133), Zimmer 2.02.:.
.:.Regionales Rechenzentrum (RRZK).:.
  .:.Universität zu Köln / Cologne University - ✆ +49-221-470-89578.:.
___
Mailman-Developers mailing list
Mailman-Developers@python.org
https://mail.python.org/mailman/listinfo/mailman-developers
Mailman FAQ: http://wiki.list.org/x/AgA3
Searchable Archives: 
http://www.mail-archive.com/mailman-developers%40python.org/
Unsubscribe: 
https://mail.python.org/mailman/options/mailman-developers/archive%40jab.org

Security Policy: http://wiki.list.org/x/QIA9

Re: [Mailman-Developers] Patch for HyperArch

2016-03-09 Thread Mark Sapiro
> 
> We played around and found that the error is related to our version of
> Python. Here's a minimal test script that shows the issue:
> 
> from email.Utils import parseaddr, parsedate_tz, mktime_tz, formatdate
> print mktime_tz(parsedate_tz("Fri, 4 Feb 100 00:51:42 +0100 (MET)"));
> 
> That's the Date header from the single piece of legitimate mail. Python
> 2.4 throws the same exception you were seeing: "ValueError: year out of
> range". However, our Python 2.7 (which we use for Mailman) does this:
> 
> -59008522098
> 
> When that value is then passed to time.ctime(), you get "ValueError:
> timestamp out of range for platform time_t". We're on RHEL 5, and our
> version of Python 2.7 is from the IUSCommunity repo:
> python27-2.7.10-1.ius.el5. Which version of Python were you using?


Thinking about this a bit more, I think what you say is the crux of the
difference between yours and mine. In your Python,
time.ctime(-59008522098) throws the ValueError, and in mine it returns a
date string which may cause problems later on in the processing.

I think the difference is not with a Python version per se, but rather
with the underlying C environment and C library 'time' functions that
Python was compiled with.

In any case, I think I now have enough understanding of the issue to
work up some kind of fix that will work in both your situation and mine.

Note that your original suggested patch won't solve the problem for me
because my time.ctime(-59008522098) does not throw a ValueError.

-- 
Mark Sapiro The highway is for gamblers,
San Francisco Bay Area, Californiabetter use your sense - B. Dylan
___
Mailman-Developers mailing list
Mailman-Developers@python.org
https://mail.python.org/mailman/listinfo/mailman-developers
Mailman FAQ: http://wiki.list.org/x/AgA3
Searchable Archives: 
http://www.mail-archive.com/mailman-developers%40python.org/
Unsubscribe: 
https://mail.python.org/mailman/options/mailman-developers/archive%40jab.org

Security Policy: http://wiki.list.org/x/QIA9


Re: [Mailman-Developers] Patch for HyperArch

2016-03-09 Thread Mark Sapiro
On 03/09/2016 05:26 AM, Sebastian Hagedorn wrote:
> 
> --On 8. März 2016 um 15:43:11 -0800 Mark Sapiro  wrote:
> 
>> I'm still having difficulty duplicating what you saw.
>>
>> For the above three messages, the first one just gets detected as an
>> invalid date and archived under the current date by the current bin/arch.
>>
>> The other two throw "ValueError: year out of range" at
>>
>>   File "/var/MM/21/Mailman/Archiver/HyperArch.py", line 984, in
>> dateToVolName
>> return time.strftime("%Y-%B",datetuple)
>>
>> which is a problem, but not the one you saw. I would like to see
>> messages that cause this error
>>
>>   File "/usr/lib/mailman/Mailman/Archiver/HyperArch.py", line 601, in
>> _set_date
>> self.fromdate = time.ctime(int(self.date))
>> ValueError: timestamp out of range for platform time_t
>>
>> that you reported. Can you find those bad messages in the .mbox input
>> file you used and send them to me?
> 
> We played around and found that the error is related to our version of
> Python. Here's a minimal test script that shows the issue:
> 
> from email.Utils import parseaddr, parsedate_tz, mktime_tz, formatdate
> print mktime_tz(parsedate_tz("Fri, 4 Feb 100 00:51:42 +0100 (MET)"));
> 
> That's the Date header from the single piece of legitimate mail. Python
> 2.4 throws the same exception you were seeing: "ValueError: year out of
> range". However, our Python 2.7 (which we use for Mailman) does this:
> 
> -59008522098
> 
> When that value is then passed to time.ctime(), you get "ValueError:
> timestamp out of range for platform time_t". We're on RHEL 5, and our
> version of Python 2.7 is from the IUSCommunity repo:
> python27-2.7.10-1.ius.el5. Which version of Python were you using?


The particular system on which I'm testing is Ubuntu 15.10 and has
python2.7 2.7.10-4ubuntu installed via apt-get. It produces the same

-59008522098

result from your test script, but when I run bin/arch --wipe, I get

Traceback (most recent call last):
  File "../../../bin/arch", line 201, in 
main()
  File "../../../bin/arch", line 189, in main
archiver.processUnixMailbox(fp, start, end)
  File "/var/MM/21/Mailman/Archiver/pipermail.py", line 586, in
processUnixMailbox
self.add_article(a)
  File "/var/MM/21/Mailman/Archiver/pipermail.py", line 611, in add_article
archives = self.get_archives(article)
  File "/var/MM/21/Mailman/Archiver/HyperArch.py", line 914, in get_archives
res = self.dateToVolName(float(article.date))
  File "/var/MM/21/Mailman/Archiver/HyperArch.py", line 984, in
dateToVolName
return time.strftime("%Y-%B",datetuple)
ValueError: year out of range

I notice a few things here. First, your error comes in processing

a = self._makeArticle(m, self.sequence)

called from pipermail.processUnixMailbox

Mine comes from

self.add_article(a)

which is called after _makeArticle has already made the article, so I
don't see an exception in _makeArticle. In fact, after running
_makeArticle, I see

>>> a.fromdate
'Wed Feb  3 15:58:44 100\n'

which is exactly what is returned by time.ctime(-59008522098)

The other curious thing is there are no differences between the 2.1.18
pipermail.py and mine yet your

a = self._makeArticle(m, self.sequence)

is at line 587 in your traceback, and in my pipermail.py it is at line 584.

In any case, The Date: that threw the original ValueError: timestamp out
of range for platform time_t is apparently not the "Fri, 4 Feb 100
00:51:42 +0100 (MET)" one, and I'd still like to see it. If you can find
all four messages in the .mbox file that produced the "Kein Betreff"
messages in the current archive, I'd like to see them.

-- 
Mark Sapiro The highway is for gamblers,
San Francisco Bay Area, Californiabetter use your sense - B. Dylan
___
Mailman-Developers mailing list
Mailman-Developers@python.org
https://mail.python.org/mailman/listinfo/mailman-developers
Mailman FAQ: http://wiki.list.org/x/AgA3
Searchable Archives: 
http://www.mail-archive.com/mailman-developers%40python.org/
Unsubscribe: 
https://mail.python.org/mailman/options/mailman-developers/archive%40jab.org

Security Policy: http://wiki.list.org/x/QIA9

Re: [Mailman-Developers] Patch for HyperArch

2016-03-08 Thread Mark Sapiro
On 03/08/2016 04:42 AM, Sebastian Hagedorn wrote:
> 
> Here are the culprits. They are now easy to find, because they are all
> new in the archive :-) (which is public, btw:
> )
> 
...
> 
> Then there are a few broken spam messages:
> 
> --
> 
>> From owner-linux-us...@rrz.uni-koeln.de  Mon Nov  7 14:13:18 2005
...
> Date: 18 ãöîáø 2000
> 
>> From owner-linux-us...@rrz.uni-koeln.de  Mon Nov  7 14:08:09 2005
...
> Date: Sun, 30 Jun 0102 17:30:47 -0600
> 
>> From owner-linux-us...@rrz.uni-koeln.de  Mon Nov  7 14:11:54 2005
...
> Date: Fri, 4 Feb 100 00:51:42 +0100 (MET)


I'm still having difficulty duplicating what you saw.

For the above three messages, the first one just gets detected as an
invalid date and archived under the current date by the current bin/arch.

The other two throw "ValueError: year out of range" at

  File "/var/MM/21/Mailman/Archiver/HyperArch.py", line 984, in
dateToVolName
return time.strftime("%Y-%B",datetuple)

which is a problem, but not the one you saw. I would like to see
messages that cause this error

  File "/usr/lib/mailman/Mailman/Archiver/HyperArch.py", line 601, in
_set_date
self.fromdate = time.ctime(int(self.date))
ValueError: timestamp out of range for platform time_t

that you reported. Can you find those bad messages in the .mbox input
file you used and send them to me?

-- 
Mark Sapiro The highway is for gamblers,
San Francisco Bay Area, Californiabetter use your sense - B. Dylan
___
Mailman-Developers mailing list
Mailman-Developers@python.org
https://mail.python.org/mailman/listinfo/mailman-developers
Mailman FAQ: http://wiki.list.org/x/AgA3
Searchable Archives: 
http://www.mail-archive.com/mailman-developers%40python.org/
Unsubscribe: 
https://mail.python.org/mailman/options/mailman-developers/archive%40jab.org

Security Policy: http://wiki.list.org/x/QIA9

Re: [Mailman-Developers] Patch for HyperArch

2016-03-08 Thread Sebastian Hagedorn

--On 7. März 2016 um 14:35:47 -0800 Mark Sapiro  wrote:


If an exception is caught, the date is simply set to the current time.


I understand the patch, but I'm not sure if setting the current time is
appropriate. In particular, the self.__super_set_date(message) method,
if it doesn't find a valid date: or x-list-received-date: header in the
message will set the time to that of the previous article + 1 second.


I suppose that didn't work, because there were multiple such messages in a 
row, if you can even call them that ;-)



In any case, I'd like to understand more about why/how the issue occurs.
To that end, I'd like to see a copy of the offending message from the
mbox file. Also, I wonder if bypassing the error and setting a date
which will almost certainly archive the message in the wrong period is
better than fixing the message in the mbox.


The thing is that initially I had no way of finding out which messages 
caused the problem. That's why I primarily looked for a way to just 
complete the job of rebuilding the archive. I agree that there may be 
better solutions. I could imagine skipping such broken messages, for 
example.


Here are the culprits. They are now easy to find, because they are all new 
in the archive :-) (which is public, btw: 
)


-

From foo@bar  Mon Nov  7 14:08:46 2005

169562

From foo@bar  Mon Nov  7 14:08:46 2005

27203

From foo@bar  Mon Nov  7 14:08:46 2005

108420

From foo@bar  Mon Nov  7 14:08:46 2005

35662

--

Don't ask me how they ended up in that .mbox file in the first place ;-)
I assume they are an artefact from the time when we moved from Majordomo to 
Mailman.


Then there are a few broken spam messages:

--

From owner-linux-us...@rrz.uni-koeln.de  Mon Nov  7 14:13:18 2005
Received: (from daemon@localhost)
   by mail1.rrz.Uni-Koeln.DE (8.9.3/8.9.3) id CAA05687
   for linux-users-out; Mon, 18 Dec 2000 02:34:21 +0100 (MET)
Received: from horizon.barak-online.net (horizon.barak.net.il 
[206.49.94.218])

   by mail1.rrz.Uni-Koeln.DE (8.9.3/8.9.3) with ESMTP id CAA05681
   for ; Mon, 18 Dec 2000 02:34:18 +0100 
(MET)
Received: from rrz.uni-koeln.de (pop09-1-ras1-p146.barak.net.il 
[212.150.107.146])

   by horizon.barak-online.net (8.9.3/8.9.1) with SMTP id DAA28082
   for linux-us...@rrz.uni-koeln.de; Mon, 18 Dec 2000 03:33:39 +0200 
(IST)

Message-Id: <200012180133.daa28...@horizon.barak-online.net>
From: mnadiv
REPLY-TO: mna...@barak-online.net
X-Mailer: EzyMassMailer V2.xx
Date: 18 ãöîáø 2000

From owner-linux-us...@rrz.uni-koeln.de  Mon Nov  7 14:08:09 2005
Received: from mail1.rrz.Uni-Koeln.DE (localhost [127.0.0.1])
   by mail1.rrz.Uni-Koeln.DE (8.12.3/8.12.2) with ESMTP id 
g5UBbvtD027896
   (version=TLSv1/SSLv3 cipher=EDH-RSA-DES-CBC3-SHA bits=168 
verify=NOT)
   for ; Sun, 30 Jun 2002 
13:37:57 +0200 (MEST)

Received: (from daemon@localhost)
   by mail1.rrz.Uni-Koeln.DE (8.12.3/8.12.3/Submit) id g5UBbvU7027895
   for linux-users-out; Sun, 30 Jun 2002 13:37:57 +0200 (MEST)
Received: from yahoo.com ([213.201.170.67])
   by mail1.rrz.Uni-Koeln.DE (8.12.3/8.12.2) with SMTP id 
g5UBbstC027887
   for ; Sun, 30 Jun 2002 13:37:55 +0200 
(MEST)
Received: from [181.21.240.177] by f64.law4.hottestmale.com with asmtp; 01 
Jul 0102 04:46:55 -0900
Received: from 15.9.163.146 ([15.9.163.146]) by m10.grp.snv.yahui.com with 
QMQP; Sun, 30 Jun 0102 19:38:13 -0800

Reply-To: 
Message-ID: <025b58a45d4d$4771c5b1$6dd17ce3@warwgu>
From: 
To: susann...@yahoo.com
Subject: Hello
Date: Sun, 30 Jun 0102 17:30:47 -0600

--

There is a single legitimate message with a broken Date header:

From owner-linux-us...@rrz.uni-koeln.de  Mon Nov  7 14:11:54 2005
Received: (from daemon@localhost)
   by mail1.rrz.Uni-Koeln.DE (8.9.3/8.9.3) id AAA14125
   for linux-users-out; Fri, 4 Feb 2000 00:50:08 +0100 (MET)
Received: from mailhost.informatik.uni-bonn.de 
(olymp.informatik.uni-bonn.de [131.220.4.1])

   by mail1.rrz.Uni-Koeln.DE (8.9.3/8.9.3) with ESMTP id AAA14116
   for ; Fri, 4 Feb 2000 00:50:06 +0100 
(MET)
Received: from zeus.informatik.uni-bonn.de (zeus.informatik.uni-bonn.de 
[131.220.5.25])

   by mailhost.informatik.uni-bonn.de (Postfix) with ESMTP
   id 15D5562E9; Fri,  4 Feb 2000 00:51:32 +0100 (MET)
Received: (from guertler@localhost)
   by zeus.informatik.uni-bonn.de (8.8.8+Sun/8.8.8) id AAA19251;
   Fri, 4 Feb 2000 00:51:43 +0100 (MET)
From: Michael Guertler 
Message-Id: <22032351.aaa19...@zeus.informatik.uni-bonn.de>
Subject: Re: AVI -> MPEG
To: ocor...@astro.uni-bonn.de (Oliver Cordes)
Date: 

Re: [Mailman-Developers] Patch for HyperArch

2016-03-07 Thread Mark Sapiro
On 03/07/2016 04:00 AM, Sebastian Hagedorn wrote:
> Hi,
> 
> we recently needed to rebuild a rather old list archive. The oldest
> mails are from 2001, and as far as I could tell the last complete
> rebuild happened in 2005. When we ran "arch --wipe" now, it failed:
> 
...
> ValueError: timestamp out of range for platform time_t
> 
> Obviously the mails that caused this error were broken, but a previous
> version of arch was able to build the archive regardless. I wrote the
> following patch to work around the problem (I wrote it for 2.1.18, but I
> checked that the code looks the same in 2.1.21):
> 
> --- /service/HyperArch.py2014-07-16 13:01:11.0 +0200
> +++ HyperArch.py2016-03-07 11:25:34.0 +0100
> @@ -598,7 +598,14 @@
> 
> def _set_date(self, message):
> self.__super_set_date(message)
> -self.fromdate = time.ctime(int(self.date))
> +try:
> +self.fromdate = time.ctime(int(self.date))
> +except ValueError:
> +syslog('error',
> +   'Archive error. Date %s is invalid.',
> +   int(self.date))
> +self.date = str(int(time.time()))
> +self.fromdate = time.ctime(int(self.date))
> 
> def loadbody_fromHTML(self,fileobj):
> self.body = []
> 
> If an exception is caught, the date is simply set to the current time.

I understand the patch, but I'm not sure if setting the current time is
appropriate. In particular, the self.__super_set_date(message) method,
if it doesn't find a valid date: or x-list-received-date: header in the
message will set the time to that of the previous article + 1 second.

In any case, I'd like to understand more about why/how the issue occurs.
To that end, I'd like to see a copy of the offending message from the
mbox file. Also, I wonder if bypassing the error and setting a date
which will almost certainly archive the message in the wrong period is
better than fixing the message in the mbox.

-- 
Mark Sapiro The highway is for gamblers,
San Francisco Bay Area, Californiabetter use your sense - B. Dylan



signature.asc
Description: OpenPGP digital signature
___
Mailman-Developers mailing list
Mailman-Developers@python.org
https://mail.python.org/mailman/listinfo/mailman-developers
Mailman FAQ: http://wiki.list.org/x/AgA3
Searchable Archives: 
http://www.mail-archive.com/mailman-developers%40python.org/
Unsubscribe: 
https://mail.python.org/mailman/options/mailman-developers/archive%40jab.org

Security Policy: http://wiki.list.org/x/QIA9

Re: [Mailman-Developers] [PATCH] Port HyperArch/pipermail to mimelib

2001-10-12 Thread Barry A. Warsaw


 BG == Ben Gertzfield [EMAIL PROTECTED] writes:

BG Here's a port of HyperArch and pipermail to mimelib.  This
BG allows proper parsing of multipart messages, and will make
BG i18n handling much easier.  This is a big step forward, I
BG think, because now we no longer have two very different
BG Message classes in Mailman.

I'm still looking at this patch.  I have some qualms about it.  If I
commit this patch, we'll need to further do the mimelib-email
conversion, but that shouldn't be hard.

First...

BG This also patches pythonlib/mailbox.py to use mimelib instead
BG of rfc822.  This is the last use of rfc822 in Mailman, so we
BG can now remove pythonlib/rfc822.py completely from the
BG archives -- now we use mimelib entirely!

It also modifies pythonlib/cgi.py to use mimelib.  Neither are good
ideas because it means our copies get farther out of sync with
Python's and we'll always have to carry around our copies.

The purpose of the Mailman/pythonlib directory is to allow us to defer
requiring newer versions of Python.  Right now, Mailman should work
with Python 2.0, but some of the modules that have been patched since
then have useful stuff we need now.  So I put copies of the latest
standard library files in Mailman/pythonlib as a form of forward
compability.  Eventually, I can remove these once I require a version
of Python that has these patches in them.

An example is Cookie.py.  When MM required only Py1.5.2, I had to
provide a Cookie.py, but because Py2.0 has its own Cookie.py, we can
use that and forget about our own copy.  Similarly with cgi.py,
rfc822.py, and others (I do need to do a bit of cleaning up here
though).

Fortunately, I think your changes to cgi.py aren't necessary, and we
can accomplish your mailbox.py changes by changing Mailman/Mailbox.py
instead.  We do still need rfc822.py (I think) because email/mimelib
package in some cases just wraps rfc822.py code instead of
reimplementing or cutting-and-pasting the source.

BG This patch depends on the mimelib patch I just sent; it uses
BG the get_decoded_payload() function I added to get a nice text
BG representation of even a multi-part message.  This will let us
BG even display a message for non-text parts of a message, and
BG eventually will let HyperArch display attachments inline.  And
BG of course, as I mentioned in my previous mail, this will
BG prevent base64 gobbeldygook from showing up in the archives.

BG This patch even deals with multiple text/* attachments to a
BG message, and will include them all in the archive even if
BG they're base64 or quoted-printable encoded.

I think this is a decent patch, and I'm probably going to commit
these, after I rewrite them for the email package.

BG It currently does not deal with replacing high-ASCII
BG characters with HTML entities in HyperArch; I'm going to deal
BG with that next by taking the htmlentitydefs module's hash,
BG inverting it, and using that as a big global
BG search-and-replace, if the charset is undefined or iso-8859-1.

My biggest question here is why you took most of the code out of
Article._get_body() in HyperArch.py.  IIRC, Jeremy added all this
stuff so that charset handling would be saner.  The idea is that if
there is a single charset for the message, that would be the charset
used for the web archive page.  But if the page had multiple charsets,
then it would pick the most common one.  AFAIK, there's no way to
represent multiple charsets in a single HTML page.  An example of the
latter is an index page for a list that has Subject: fields with many
different charsets.  Which one do you pick?

In your patch, it seems like everything comes out iso-8859-1, and that
doesn't seem right.

-Barry

___
Mailman-Developers mailing list
[EMAIL PROTECTED]
http://mail.python.org/mailman/listinfo/mailman-developers