Re: Javamail address parsing (again).

2006-01-13 Thread Rick McGuire

Dain Sundstrom wrote:


On Jan 12, 2006, at 3:24 PM, Rick McGuire wrote:


Dain Sundstrom wrote:

On Jan 11, 2006, at 1:17 PM, Bruce Snyder wrote:


Is it possible to look at the Sun implementation's source code to
distinguish enforced vs. ignored rules?


That would make the code not clean room.

I propose we ask Sun for a formal definition of the parser for this 
class, and in a parallel track make an effort to try to match their 
bugs.  The code from the second track doesn't have to be perfect, 
but just good enough.  We simply let our users know that our goal 
with the "implement.sun.javamail.bugs=true" code is to emulate the 
sun bugs, and if they find something that produces different results 
for the same text, we consider it a bug.
I'm becoming less and less convinced this is a good idea.  So far, 
I've found many, many sun bugs in this code where they produce 
results that are in conflict with with RFC822.  The API documentation 
refers relaxed parsing rules, which says to me there are addresses 
that would not be valid under RFC822, but javamail will accept them 
based on the type of parsing requested.  I can accept that.


However, the great majority of the problems I've found have been 
involved with internet addresses that RFC822 says ARE valid, but the 
javamail code does not handle them properly.  And there are a few 
situations where it appears the authors just chose to punt and say 
"yeah, whatever".
It appears that the solution is to write hacked code that mostly, 
sorta, kinda does what it claims to do, or write a good parser, then 
triple the size of the code trying to get all of the Sun bugs to work 
properly.
Working strictly from the RFC822 spec, I had a fairly nice parser 
written that gave very good RFC822 compliance, but things turned 
nightmarish when I discovered the sorts of Sun behaviors I had to 
insert back in.  I think I've completely rewritten this code about 5 
times now, and am getting pretty close to the Sun "relaxed rules".  
Inserting some of the real bugs back in to the parsing might pose 
similar problems.
It really appears that this code somewhat "lost it's way somewhere".  
It's serving two purposes that are really at odds with each other.  
The first purpose, is to parse any internet address that might appear 
in a received message.  For that purpose, the code needs to accept 
any valid internet address as defined by RFC822.  The Sun code does 
not currently do that, and making the new version "bug compatible" 
would also not achieve that.


The other purpose of the InternetAddress parser is to process email 
addresses entered into applications and perform some validation on 
the addresses.  This is where the "relaxed rules" come in to play, 
and basically allows internet addresses that are not strictly RFC822 
compatible to pass.  Now for those, I'm relatively comfortable that 
this can be made compatible.  It is very difficult though, when the 
requirement becomes one of being both more and less restrictive at 
the same time, with no good definition of the what rules are being used.



Ok, how about we say, in "sun bug mode" we will parse all addresses 
that are valid RFC822 address or are sucessfull parsed by sun's 
javamail implementation?  This means that valid RFC822 addresses that 
sun's implementation rejects will be accepted by ours.  We would 
further would consider it a bug to reject addresses accepted by sun's 
implementation when in "sun bug mode".
That sounds like a more reasonable goal. 



-dain





Re: Javamail address parsing (again).

2006-01-12 Thread Dain Sundstrom


On Jan 12, 2006, at 3:24 PM, Rick McGuire wrote:


Dain Sundstrom wrote:

On Jan 11, 2006, at 1:17 PM, Bruce Snyder wrote:


Is it possible to look at the Sun implementation's source code to
distinguish enforced vs. ignored rules?


That would make the code not clean room.

I propose we ask Sun for a formal definition of the parser for  
this class, and in a parallel track make an effort to try to match  
their bugs.  The code from the second track doesn't have to be  
perfect, but just good enough.  We simply let our users know that  
our goal with the "implement.sun.javamail.bugs=true" code is to  
emulate the sun bugs, and if they find something that produces  
different results for the same text, we consider it a bug.
I'm becoming less and less convinced this is a good idea.  So far,  
I've found many, many sun bugs in this code where they produce  
results that are in conflict with with RFC822.  The API  
documentation refers relaxed parsing rules, which says to me there  
are addresses that would not be valid under RFC822, but javamail  
will accept them based on the type of parsing requested.  I can  
accept that.


However, the great majority of the problems I've found have been  
involved with internet addresses that RFC822 says ARE valid, but  
the javamail code does not handle them properly.  And there are a  
few situations where it appears the authors just chose to punt and  
say "yeah, whatever".
It appears that the solution is to write hacked code that mostly,  
sorta, kinda does what it claims to do, or write a good parser,  
then triple the size of the code trying to get all of the Sun bugs  
to work properly.
Working strictly from the RFC822 spec, I had a fairly nice parser  
written that gave very good RFC822 compliance, but things turned  
nightmarish when I discovered the sorts of Sun behaviors I had to  
insert back in.  I think I've completely rewritten this code about  
5 times now, and am getting pretty close to the Sun "relaxed  
rules".  Inserting some of the real bugs back in to the parsing  
might pose similar problems.
It really appears that this code somewhat "lost it's way  
somewhere".  It's serving two purposes that are really at odds with  
each other.  The first purpose, is to parse any internet address  
that might appear in a received message.  For that purpose, the  
code needs to accept any valid internet address as defined by  
RFC822.  The Sun code does not currently do that, and making the  
new version "bug compatible" would also not achieve that.


The other purpose of the InternetAddress parser is to process email  
addresses entered into applications and perform some validation on  
the addresses.  This is where the "relaxed rules" come in to play,  
and basically allows internet addresses that are not strictly  
RFC822 compatible to pass.  Now for those, I'm relatively  
comfortable that this can be made compatible.  It is very difficult  
though, when the requirement becomes one of being both more and  
less restrictive at the same time, with no good definition of the  
what rules are being used.



Ok, how about we say, in "sun bug mode" we will parse all addresses  
that are valid RFC822 address or are sucessfull parsed by sun's  
javamail implementation?  This means that valid RFC822 addresses that  
sun's implementation rejects will be accepted by ours.  We would  
further would consider it a bug to reject addresses accepted by sun's  
implementation when in "sun bug mode".


-dain


Re: Javamail address parsing (again).

2006-01-12 Thread Rick McGuire

Dain Sundstrom wrote:

On Jan 11, 2006, at 1:17 PM, Bruce Snyder wrote:


Is it possible to look at the Sun implementation's source code to
distinguish enforced vs. ignored rules?


That would make the code not clean room.

I propose we ask Sun for a formal definition of the parser for this 
class, and in a parallel track make an effort to try to match their 
bugs.  The code from the second track doesn't have to be perfect, but 
just good enough.  We simply let our users know that our goal with the 
"implement.sun.javamail.bugs=true" code is to emulate the sun bugs, 
and if they find something that produces different results for the 
same text, we consider it a bug.
I'm becoming less and less convinced this is a good idea.  So far, I've 
found many, many sun bugs in this code where they produce results that 
are in conflict with with RFC822.  The API documentation refers relaxed 
parsing rules, which says to me there are addresses that would not be 
valid under RFC822, but javamail will accept them based on the type of 
parsing requested.  I can accept that.


However, the great majority of the problems I've found have been 
involved with internet addresses that RFC822 says ARE valid, but the 
javamail code does not handle them properly.  And there are a few 
situations where it appears the authors just chose to punt and say 
"yeah, whatever". 

It appears that the solution is to write hacked code that mostly, sorta, 
kinda does what it claims to do, or write a good parser, then triple the 
size of the code trying to get all of the Sun bugs to work properly. 

Working strictly from the RFC822 spec, I had a fairly nice parser 
written that gave very good RFC822 compliance, but things turned 
nightmarish when I discovered the sorts of Sun behaviors I had to insert 
back in.  I think I've completely rewritten this code about 5 times now, 
and am getting pretty close to the Sun "relaxed rules".  Inserting some 
of the real bugs back in to the parsing might pose similar problems. 

It really appears that this code somewhat "lost it's way somewhere".  
It's serving two purposes that are really at odds with each other.  The 
first purpose, is to parse any internet address that might appear in a 
received message.  For that purpose, the code needs to accept any valid 
internet address as defined by RFC822.  The Sun code does not currently 
do that, and making the new version "bug compatible" would also not 
achieve that.


The other purpose of the InternetAddress parser is to process email 
addresses entered into applications and perform some validation on the 
addresses.  This is where the "relaxed rules" come in to play, and 
basically allows internet addresses that are not strictly RFC822 
compatible to pass.  Now for those, I'm relatively comfortable that this 
can be made compatible.  It is very difficult though, when the 
requirement becomes one of being both more and less restrictive at the 
same time, with no good definition of the what rules are being used.




-dain





Re: Javamail address parsing (again).

2006-01-12 Thread Dain Sundstrom

On Jan 11, 2006, at 1:17 PM, Bruce Snyder wrote:


Is it possible to look at the Sun implementation's source code to
distinguish enforced vs. ignored rules?


That would make the code not clean room.

I propose we ask Sun for a formal definition of the parser for this  
class, and in a parallel track make an effort to try to match their  
bugs.  The code from the second track doesn't have to be perfect, but  
just good enough.  We simply let our users know that our goal with  
the "implement.sun.javamail.bugs=true" code is to emulate the sun  
bugs, and if they find something that produces different results for  
the same text, we consider it a bug.


-dain


Re: Javamail address parsing (again).

2006-01-11 Thread Bruce Snyder
On 1/11/06, Rick McGuire <[EMAIL PROTECTED]> wrote:
> This is starting to drive me nuts.  Writing an address parsing method
> that conforms to RFC822 is fairly easy.  Writing one that conforms to
> the javamail spec seems to be a hopeless task.  This is the complete API
> spec for the InternetAddress.parseHeader() method:
>
> Parse the given sequence of addresses into InternetAddress objects.
> If |strict| is false, the full syntax rules for individual addresses
> are not enforced. If |strict| is true, many (but not all) of the
> RFC822 syntax rules are enforced.
>
> To better support the range of "invalid" addresses seen in real
> messages, this method enforces fewer syntax rules than the |parse|
> method when the strict flag is false and enforces more rules when
> the strict flag is true. If the strict flag is false and the parse
> is successful in separating out an email address or addresses, the
> syntax of the addresses themselves is not checked.
>
> There is absolutely no definition I can find of:
>
> * What syntax rules are not enforced if strict is false.
> * What syntax rules are not enforeced if strict is true.
> * What is the difference in syntax rule enforcement between
>   parseHeader() and parse().  parse() seems to a rule set that lies
>   between parseHeader() with strict false and parseHeader with
>   strict true.
> * What does it mean to be "successful in separating out an email
>   address or addresses" without checking the syntax?  How do you
>   recognize it as an email address without having syntax rules?
>
> There don't appear to be any other sources of information available out
> there that further define this behavior.  I've been running lots of
> little test cases against the Sun version to try to figure out the
> rules, and frankly, the results have been pretty random.  The Sun
> version both allows forms that RFC822 says is invalid and rejects forms
> that RFC822 explicitly says are valid (which does not sound like a
> relaxed rule to me).  Rather tough to distinguish between bugs and
> intentional behavior.
>
> Any suggestions on additional information sources on this or suggestions
> on how to decide which behaviors to support?

Is it possible to look at the Sun implementation's source code to
distinguish enforced vs. ignored rules?

Bruce
--
perl -e 'print unpack("u30","D0G)[EMAIL 
PROTECTED]&5R\"F)R=6-E+G-N>61Ehttp://geronimo.apache.org/)

Castor (http://castor.org/)


Javamail address parsing (again).

2006-01-11 Thread Rick McGuire
This is starting to drive me nuts.  Writing an address parsing method 
that conforms to RFC822 is fairly easy.  Writing one that conforms to 
the javamail spec seems to be a hopeless task.  This is the complete API 
spec for the InternetAddress.parseHeader() method:


   Parse the given sequence of addresses into InternetAddress objects.
   If |strict| is false, the full syntax rules for individual addresses
   are not enforced. If |strict| is true, many (but not all) of the
   RFC822 syntax rules are enforced.

   To better support the range of "invalid" addresses seen in real
   messages, this method enforces fewer syntax rules than the |parse|
   method when the strict flag is false and enforces more rules when
   the strict flag is true. If the strict flag is false and the parse
   is successful in separating out an email address or addresses, the
   syntax of the addresses themselves is not checked.

There is absolutely no definition I can find of:

   * What syntax rules are not enforced if strict is false.
   * What syntax rules are not enforeced if strict is true.
   * What is the difference in syntax rule enforcement between
 parseHeader() and parse().  parse() seems to a rule set that lies
 between parseHeader() with strict false and parseHeader with
 strict true. 
   * What does it mean to be "successful in separating out an email

 address or addresses" without checking the syntax?  How do you
 recognize it as an email address without having syntax rules?

There don't appear to be any other sources of information available out 
there that further define this behavior.  I've been running lots of 
little test cases against the Sun version to try to figure out the 
rules, and frankly, the results have been pretty random.  The Sun 
version both allows forms that RFC822 says is invalid and rejects forms 
that RFC822 explicitly says are valid (which does not sound like a 
relaxed rule to me).  Rather tough to distinguish between bugs and 
intentional behavior.


Any suggestions on additional information sources on this or suggestions 
on how to decide which behaviors to support?


Rick