[RFC] [PATCH] lib/database.cc: change how the parent of a message is calculated

2013-02-25 Thread Aaron Ecay
Presently, the code which finds the parent of a message as it is being
added to the database assumes that the first Message-ID-like substring
of the In-Reply-To header is the parent Message ID.  Some mail clients,
however, put stuff other than the Message-ID of the parent in the
In-Reply-To header, such as the email address of the sender of the
parent.  This can fool notmuch.

The updated algorithm prefers the last Message ID in the References
header.  The References header lists messages oldest-first, so the last
Message ID is the parent (RFC2822, p. 24).  The References header is
also less likely to be in a non-standard
syntax (http://cr.yp.to/immhf/thread.html,
http://www.jwz.org/doc/threading.html).  In case the References header
is not to be found, fall back to the old behavior.
---

I especially notice this problem on public mailing lists, where
certain people's messages always cause an "out-dent" of the threading,
instead of being nested under whichever message they are replies to.

Technically, putting non-Message-ID crud in the In-Reply-To field is a
violation of RFC2822, but it appears that in practice the References
header is respected more often than the In-Reply-To one.

 lib/database.cc | 30 ++
 1 file changed, 22 insertions(+), 8 deletions(-)

diff --git a/lib/database.cc b/lib/database.cc
index 91d4329..cbf33ae 100644
--- a/lib/database.cc
+++ b/lib/database.cc
@@ -501,8 +501,10 @@ _parse_message_id (void *ctx, const char *message_id, 
const char **next)
  * 'message_id' in the result (to avoid mass confusion when a single
  * message references itself cyclically---and yes, mail messages are
  * not infrequent in the wild that do this---don't ask me why).
+ *
+ * Return the last reference parsed.
 */
-static void
+static char *
 parse_references (void *ctx,
  const char *message_id,
  GHashTable *hash,
@@ -511,7 +513,7 @@ parse_references (void *ctx,
 char *ref;

 if (refs == NULL || *refs == '\0')
-   return;
+   return NULL;

 while (*refs) {
ref = _parse_message_id (ctx, refs, &refs);
@@ -519,6 +521,8 @@ parse_references (void *ctx,
if (ref && strcmp (ref, message_id))
g_hash_table_insert (hash, ref, NULL);
 }
+
+return ref;
 }

 notmuch_status_t
@@ -1365,7 +1369,7 @@ _notmuch_database_generate_doc_id (notmuch_database_t 
*notmuch)
 notmuch->last_doc_id++;

 if (notmuch->last_doc_id == 0)
-   INTERNAL_ERROR ("Xapian document IDs are exhausted.\n");
+   INTERNAL_ERROR ("Xapian document IDs are exhausted.\n");

 return notmuch->last_doc_id;
 }
@@ -1509,7 +1513,7 @@ _notmuch_database_link_message_to_parents 
(notmuch_database_t *notmuch,
   const char **thread_id)
 {
 GHashTable *parents = NULL;
-const char *refs, *in_reply_to, *in_reply_to_message_id;
+const char *refs, *in_reply_to, *in_reply_to_message_id, 
*last_ref_message_id;
 GList *l, *keys = NULL;
 notmuch_status_t ret = NOTMUCH_STATUS_SUCCESS;

@@ -1517,21 +1521,31 @@ _notmuch_database_link_message_to_parents 
(notmuch_database_t *notmuch,
 _my_talloc_free_for_g_hash, NULL);

 refs = notmuch_message_file_get_header (message_file, "references");
-parse_references (message, notmuch_message_get_message_id (message),
- parents, refs);
+last_ref_message_id = parse_references (message,
+   notmuch_message_get_message_id 
(message),
+   parents, refs);

 in_reply_to = notmuch_message_file_get_header (message_file, 
"in-reply-to");
 parse_references (message, notmuch_message_get_message_id (message),
  parents, in_reply_to);

-/* Carefully avoid adding any self-referential in-reply-to term. */
 in_reply_to_message_id = _parse_message_id (message, in_reply_to, NULL);
+/* If the parent message ID from the Reply-To and References
+ * headers are different, use the References one.  This is because
+ * the Reply-To header is more likely to be in an non-standard
+ * format. */
+if (in_reply_to_message_id &&
+   last_ref_message_id &&
+   strcmp (last_ref_message_id, in_reply_to_message_id)) {
+   in_reply_to_message_id = last_ref_message_id;
+}
+/* Carefully avoid adding any self-referential in-reply-to term. */
 if (in_reply_to_message_id &&
strcmp (in_reply_to_message_id,
notmuch_message_get_message_id (message)))
 {
_notmuch_message_add_term (message, "replyto",
-_parse_message_id (message, in_reply_to, NULL));
+in_reply_to_message_id);
 }

 keys = g_hash_table_get_keys (parents);
--
1.8.1.4


[RFC] [PATCH] lib/database.cc: change how the parent of a message is calculated

2013-02-26 Thread Jani Nikula
On Tue, 26 Feb 2013, Aaron Ecay  wrote:
> Presently, the code which finds the parent of a message as it is being
> added to the database assumes that the first Message-ID-like substring
> of the In-Reply-To header is the parent Message ID.  Some mail clients,
> however, put stuff other than the Message-ID of the parent in the
> In-Reply-To header, such as the email address of the sender of the
> parent.  This can fool notmuch.

Hi Aaron, please provide references to a few messages like this. If
available on the notmuch list an id: reference would be best, but
otherwise some archive that allows viewing full message headers or
downloading the full message would be best.

Thanks,
Jani.

>
> The updated algorithm prefers the last Message ID in the References
> header.  The References header lists messages oldest-first, so the last
> Message ID is the parent (RFC2822, p. 24).  The References header is
> also less likely to be in a non-standard
> syntax (http://cr.yp.to/immhf/thread.html,
> http://www.jwz.org/doc/threading.html).  In case the References header
> is not to be found, fall back to the old behavior.
> ---
>
> I especially notice this problem on public mailing lists, where
> certain people's messages always cause an "out-dent" of the threading,
> instead of being nested under whichever message they are replies to.
>
> Technically, putting non-Message-ID crud in the In-Reply-To field is a
> violation of RFC2822, but it appears that in practice the References
> header is respected more often than the In-Reply-To one.
>
>  lib/database.cc | 30 ++
>  1 file changed, 22 insertions(+), 8 deletions(-)
>
> diff --git a/lib/database.cc b/lib/database.cc
> index 91d4329..cbf33ae 100644
> --- a/lib/database.cc
> +++ b/lib/database.cc
> @@ -501,8 +501,10 @@ _parse_message_id (void *ctx, const char *message_id, 
> const char **next)
>   * 'message_id' in the result (to avoid mass confusion when a single
>   * message references itself cyclically---and yes, mail messages are
>   * not infrequent in the wild that do this---don't ask me why).
> + *
> + * Return the last reference parsed.
>  */
> -static void
> +static char *
>  parse_references (void *ctx,
> const char *message_id,
> GHashTable *hash,
> @@ -511,7 +513,7 @@ parse_references (void *ctx,
>  char *ref;
>
>  if (refs == NULL || *refs == '\0')
> - return;
> + return NULL;
>
>  while (*refs) {
>   ref = _parse_message_id (ctx, refs, &refs);
> @@ -519,6 +521,8 @@ parse_references (void *ctx,
>   if (ref && strcmp (ref, message_id))
>   g_hash_table_insert (hash, ref, NULL);
>  }
> +
> +return ref;
>  }
>
>  notmuch_status_t
> @@ -1365,7 +1369,7 @@ _notmuch_database_generate_doc_id (notmuch_database_t 
> *notmuch)
>  notmuch->last_doc_id++;
>
>  if (notmuch->last_doc_id == 0)
> - INTERNAL_ERROR ("Xapian document IDs are exhausted.\n");
> + INTERNAL_ERROR ("Xapian document IDs are exhausted.\n");
>
>  return notmuch->last_doc_id;
>  }
> @@ -1509,7 +1513,7 @@ _notmuch_database_link_message_to_parents 
> (notmuch_database_t *notmuch,
>  const char **thread_id)
>  {
>  GHashTable *parents = NULL;
> -const char *refs, *in_reply_to, *in_reply_to_message_id;
> +const char *refs, *in_reply_to, *in_reply_to_message_id, 
> *last_ref_message_id;
>  GList *l, *keys = NULL;
>  notmuch_status_t ret = NOTMUCH_STATUS_SUCCESS;
>
> @@ -1517,21 +1521,31 @@ _notmuch_database_link_message_to_parents 
> (notmuch_database_t *notmuch,
>_my_talloc_free_for_g_hash, NULL);
>
>  refs = notmuch_message_file_get_header (message_file, "references");
> -parse_references (message, notmuch_message_get_message_id (message),
> -   parents, refs);
> +last_ref_message_id = parse_references (message,
> + notmuch_message_get_message_id 
> (message),
> + parents, refs);
>
>  in_reply_to = notmuch_message_file_get_header (message_file, 
> "in-reply-to");
>  parse_references (message, notmuch_message_get_message_id (message),
> parents, in_reply_to);
>
> -/* Carefully avoid adding any self-referential in-reply-to term. */
>  in_reply_to_message_id = _parse_message_id (message, in_reply_to, NULL);
> +/* If the parent message ID from the Reply-To and References
> + * headers are different, use the References one.  This is because
> + * the Reply-To header is more likely to be in an non-standard
> + * format. */
> +if (in_reply_to_message_id &&
> + last_ref_message_id &&
> + strcmp (last_ref_message_id, in_reply_to_message_id)) {
> + in_reply_to_message_id = last_ref_message_id;
> +}
> +/* Carefully avoid adding any self-referential in-reply-to term. */
>  if (in_reply_to_message_id &&
>   strcmp (i

[RFC] [PATCH] lib/database.cc: change how the parent of a message is calculated

2013-02-26 Thread Aaron Ecay
2013ko otsailak 26an, Jani Nikula-ek idatzi zuen:
> Hi Aaron, please provide references to a few messages like this. If
> available on the notmuch list an id: reference would be best, but
> otherwise some archive that allows viewing full message headers or
> downloading the full message would be best.

Here?s a message from the notmuch list that has passed through gmane,
which inserts References but not In-Reply-To headers:
id:877h0sa207.fsf at fester.com .  Here?s a link to one with a borked
In-Reply-To header:
http://article.gmane.org/gmane.emacs.orgmode/66616/raw

-- 
Aaron Ecay


[RFC] [PATCH] lib/database.cc: change how the parent of a message is calculated

2013-02-28 Thread Jani Nikula

Hi Aaron -

On Tue, 26 Feb 2013, Aaron Ecay  wrote:
> Presently, the code which finds the parent of a message as it is being
> added to the database assumes that the first Message-ID-like substring
> of the In-Reply-To header is the parent Message ID.  Some mail clients,
> however, put stuff other than the Message-ID of the parent in the
> In-Reply-To header, such as the email address of the sender of the
> parent.  This can fool notmuch.

I think the background is that RFC 822 defines In-Reply-To (and
References too for that matter) as *(phrase / msg-id), while RFC 2822
defines them as 1*msg-id. I'd like something about RFC 822 being
mentioned in the commit message.

The problem in the gmane message you link to in
id:87liaa3luc.fsf at gmail.com is likely related to the FAQ item 05.26 "How
do I fix a bogus In-Reply-To or missing References field?" in the MH FAQ
http://www.newt.com/faq/mh.html.

> The updated algorithm prefers the last Message ID in the References
> header.  The References header lists messages oldest-first, so the last
> Message ID is the parent (RFC2822, p. 24).  The References header is
> also less likely to be in a non-standard
> syntax (http://cr.yp.to/immhf/thread.html,
> http://www.jwz.org/doc/threading.html).  In case the References header
> is not to be found, fall back to the old behavior.
> ---
>
> I especially notice this problem on public mailing lists, where
> certain people's messages always cause an "out-dent" of the threading,
> instead of being nested under whichever message they are replies to.
>
> Technically, putting non-Message-ID crud in the In-Reply-To field is a
> violation of RFC2822, but it appears that in practice the References
> header is respected more often than the In-Reply-To one.
>
>  lib/database.cc | 30 ++
>  1 file changed, 22 insertions(+), 8 deletions(-)
>
> diff --git a/lib/database.cc b/lib/database.cc
> index 91d4329..cbf33ae 100644
> --- a/lib/database.cc
> +++ b/lib/database.cc
> @@ -501,8 +501,10 @@ _parse_message_id (void *ctx, const char *message_id, 
> const char **next)
>   * 'message_id' in the result (to avoid mass confusion when a single
>   * message references itself cyclically---and yes, mail messages are
>   * not infrequent in the wild that do this---don't ask me why).
> + *
> + * Return the last reference parsed.
>  */
> -static void
> +static char *
>  parse_references (void *ctx,
> const char *message_id,
> GHashTable *hash,
> @@ -511,7 +513,7 @@ parse_references (void *ctx,
>  char *ref;
>
>  if (refs == NULL || *refs == '\0')
> - return;
> + return NULL;
>
>  while (*refs) {
>   ref = _parse_message_id (ctx, refs, &refs);
> @@ -519,6 +521,8 @@ parse_references (void *ctx,
>   if (ref && strcmp (ref, message_id))
>   g_hash_table_insert (hash, ref, NULL);
>  }
> +
> +return ref;

As the comment for the function says, we explicitly avoid including
self-references. I think I'd err on the safe side and return NULL if the
last ref equals message-id.

>  }
>
>  notmuch_status_t
> @@ -1365,7 +1369,7 @@ _notmuch_database_generate_doc_id (notmuch_database_t 
> *notmuch)
>  notmuch->last_doc_id++;
>
>  if (notmuch->last_doc_id == 0)
> - INTERNAL_ERROR ("Xapian document IDs are exhausted.\n");
> + INTERNAL_ERROR ("Xapian document IDs are exhausted.\n");

I don't know how you got this non-change hunk here, but please remove
it. :)

>
>  return notmuch->last_doc_id;
>  }
> @@ -1509,7 +1513,7 @@ _notmuch_database_link_message_to_parents 
> (notmuch_database_t *notmuch,
>  const char **thread_id)
>  {
>  GHashTable *parents = NULL;
> -const char *refs, *in_reply_to, *in_reply_to_message_id;
> +const char *refs, *in_reply_to, *in_reply_to_message_id, 
> *last_ref_message_id;
>  GList *l, *keys = NULL;
>  notmuch_status_t ret = NOTMUCH_STATUS_SUCCESS;
>
> @@ -1517,21 +1521,31 @@ _notmuch_database_link_message_to_parents 
> (notmuch_database_t *notmuch,
>_my_talloc_free_for_g_hash, NULL);
>
>  refs = notmuch_message_file_get_header (message_file, "references");
> -parse_references (message, notmuch_message_get_message_id (message),
> -   parents, refs);
> +last_ref_message_id = parse_references (message,
> + notmuch_message_get_message_id 
> (message),
> + parents, refs);
>
>  in_reply_to = notmuch_message_file_get_header (message_file, 
> "in-reply-to");
>  parse_references (message, notmuch_message_get_message_id (message),
> parents, in_reply_to);
>
> -/* Carefully avoid adding any self-referential in-reply-to term. */
>  in_reply_to_message_id = _parse_message_id (message, in_reply_to, NULL);

I wonder if you should reuse your parse_references() change here, so
you'd set in_reply_to_m

[RFC] [PATCH] lib/database.cc: change how the parent of a message is calculated

2013-03-01 Thread Jani Nikula
On Thu, 28 Feb 2013, Jani Nikula  wrote:
> Hi Aaron -
>
> On Tue, 26 Feb 2013, Aaron Ecay  wrote:
>> Presently, the code which finds the parent of a message as it is being
>> added to the database assumes that the first Message-ID-like substring
>> of the In-Reply-To header is the parent Message ID.  Some mail clients,
>> however, put stuff other than the Message-ID of the parent in the
>> In-Reply-To header, such as the email address of the sender of the
>> parent.  This can fool notmuch.
>
> I think the background is that RFC 822 defines In-Reply-To (and
> References too for that matter) as *(phrase / msg-id), while RFC 2822
> defines them as 1*msg-id. I'd like something about RFC 822 being
> mentioned in the commit message.
>
> The problem in the gmane message you link to in
> id:87liaa3luc.fsf at gmail.com is likely related to the FAQ item 05.26 "How
> do I fix a bogus In-Reply-To or missing References field?" in the MH FAQ
> http://www.newt.com/faq/mh.html.
>
>> The updated algorithm prefers the last Message ID in the References
>> header.  The References header lists messages oldest-first, so the last
>> Message ID is the parent (RFC2822, p. 24).  The References header is
>> also less likely to be in a non-standard
>> syntax (http://cr.yp.to/immhf/thread.html,
>> http://www.jwz.org/doc/threading.html).  In case the References header
>> is not to be found, fall back to the old behavior.
>> ---
>>
>> I especially notice this problem on public mailing lists, where
>> certain people's messages always cause an "out-dent" of the threading,
>> instead of being nested under whichever message they are replies to.
>>
>> Technically, putting non-Message-ID crud in the In-Reply-To field is a
>> violation of RFC2822, but it appears that in practice the References
>> header is respected more often than the In-Reply-To one.
>>
>>  lib/database.cc | 30 ++
>>  1 file changed, 22 insertions(+), 8 deletions(-)
>>
>> diff --git a/lib/database.cc b/lib/database.cc
>> index 91d4329..cbf33ae 100644
>> --- a/lib/database.cc
>> +++ b/lib/database.cc
>> @@ -501,8 +501,10 @@ _parse_message_id (void *ctx, const char *message_id, 
>> const char **next)
>>   * 'message_id' in the result (to avoid mass confusion when a single
>>   * message references itself cyclically---and yes, mail messages are
>>   * not infrequent in the wild that do this---don't ask me why).
>> + *
>> + * Return the last reference parsed.
>>  */
>> -static void
>> +static char *
>>  parse_references (void *ctx,
>>const char *message_id,
>>GHashTable *hash,
>> @@ -511,7 +513,7 @@ parse_references (void *ctx,
>>  char *ref;
>>
>>  if (refs == NULL || *refs == '\0')
>> -return;
>> +return NULL;
>>
>>  while (*refs) {
>>  ref = _parse_message_id (ctx, refs, &refs);
>> @@ -519,6 +521,8 @@ parse_references (void *ctx,
>>  if (ref && strcmp (ref, message_id))
>>  g_hash_table_insert (hash, ref, NULL);
>>  }
>> +
>> +return ref;
>
> As the comment for the function says, we explicitly avoid including
> self-references. I think I'd err on the safe side and return NULL if the
> last ref equals message-id.
>
>>  }
>>
>>  notmuch_status_t
>> @@ -1365,7 +1369,7 @@ _notmuch_database_generate_doc_id (notmuch_database_t 
>> *notmuch)
>>  notmuch->last_doc_id++;
>>
>>  if (notmuch->last_doc_id == 0)
>> -INTERNAL_ERROR ("Xapian document IDs are exhausted.\n");
>> +INTERNAL_ERROR ("Xapian document IDs are exhausted.\n");
>
> I don't know how you got this non-change hunk here, but please remove
> it. :)
>
>>
>>  return notmuch->last_doc_id;
>>  }
>> @@ -1509,7 +1513,7 @@ _notmuch_database_link_message_to_parents 
>> (notmuch_database_t *notmuch,
>> const char **thread_id)
>>  {
>>  GHashTable *parents = NULL;
>> -const char *refs, *in_reply_to, *in_reply_to_message_id;
>> +const char *refs, *in_reply_to, *in_reply_to_message_id, 
>> *last_ref_message_id;
>>  GList *l, *keys = NULL;
>>  notmuch_status_t ret = NOTMUCH_STATUS_SUCCESS;
>>
>> @@ -1517,21 +1521,31 @@ _notmuch_database_link_message_to_parents 
>> (notmuch_database_t *notmuch,
>>   _my_talloc_free_for_g_hash, NULL);
>>
>>  refs = notmuch_message_file_get_header (message_file, "references");
>> -parse_references (message, notmuch_message_get_message_id (message),
>> -  parents, refs);
>> +last_ref_message_id = parse_references (message,
>> +notmuch_message_get_message_id 
>> (message),
>> +parents, refs);
>>
>>  in_reply_to = notmuch_message_file_get_header (message_file, 
>> "in-reply-to");
>>  parse_references (message, notmuch_message_get_message_id (message),
>>parents, in_reply_to);
>>
>> -/* Carefully avoid adding any self-referential in-reply-to term. */
>>  in

[RFC] [PATCH] lib/database.cc: change how the parent of a message is calculated

2013-03-03 Thread Aaron Ecay
Hi Jani,

Thanks to you and Austin for the comments.

2013ko martxoak 1an, Jani Nikula-ek idatzi zuen:
>> I think the background is that RFC 822 defines In-Reply-To (and
>> References too for that matter) as *(phrase / msg-id), while RFC 2822
>> defines them as 1*msg-id. I'd like something about RFC 822 being
>> mentioned in the commit message.
>> 
>> The problem in the gmane message you link to in
>> id:87liaa3luc.fsf at gmail.com is likely related to the FAQ item 05.26
>> "How do I fix a bogus In-Reply-To or missing References field?" in
>> the MH FAQ http://www.newt.com/faq/mh.html.

Likely yes.  But I think notmuch should handle these messages, since
they are seen in the wild (and I don?t think you disagree with me on
this point?)


>> 
>> As the comment for the function says, we explicitly avoid including
>> self-references. I think I'd err on the safe side and return NULL if
>> the last ref equals message-id.

Done.

>> 
>> I don't know how you got this non-change hunk here, but please remove
>> it. :)

That?s what I get for setting my editor to delete trailing whitespace on
save (then not reading outgoing patches carefully).  Fixed.

>> I wonder if you should reuse your parse_references() change here, so
>> you'd set in_reply_to_message_id to the last message-id in
>> In-Reply-To. This might tackle some of the problematic cases
>> directly, but should still be all right per RFC 2822. I didn't verify
>> how the parser handles an RFC 2822 violating free form header though.
> 
> Strike that based on http://www.jwz.org/doc/threading.html:
> 
> "If there are multiple things in In-Reply-To that look like
> Message-IDs, only use the first one of them: odds are that the later
> ones are actually email addresses, not IDs."

Hmm.  I think it?s a toss-up which of multiple quasi-message-ids is the
real one.  In the email message example I linked upthread, it was the
last one that was real.  I decided to use the last one, because it
allows the self-reference checking to be pushed entirely into
parse_references.  If you feel strongly that we should use the first
one, I can change it back.

> I talked to Austin (CC) about the patch on IRC, and his comment was,
> perceptive as always:
> 
>  23:38 amdragon Is the logic in that patch equivalent to always using
> the last message ID in references unless there is no references
> header?  Seems like it is, but in a convoluted way.
> 
> And that's actually the case, isn't it? To make the code reflect that,
> you should use last_ref_message_id, and if that's NULL, fallback to
> in_reply_to_message_id.

Yes.  Fixed.

> 
>> I suggest adding an else if branch (or revamp the above if condition)
>> to tackle the missing In-Reply-To header:
>> 
>> else if (!in_reply_to_message_id && last_ref_message_id) {
>> in_reply_to_message_id = last_ref_message_id; }
> 
> Strike that, it should be the other way round.

Now that the self-reference check is in parse_references, the
conditional is much simpler.

One additional change I made in this version was to factor out 3 calls
to ?notmuch_message_get_message_id (message)? into a variable inside the
_notmuch_database_link_message_to_parents function, for a small boost to
readability (and perhaps speed, depending on how clever the compiler is
I guess).

I also added tests ? those are the first of two patches that will follow
this email, the second being the code to make them pass.

-- 
Aaron Ecay


[RFC] [PATCH] lib/database.cc: change how the parent of a message is calculated

2013-03-03 Thread Aaron Ecay
git send-email is mad about lines >998 characters in the test patch, so
I?m sending the patches as attachments to this email.  (Is there a
better way to include the expected output of a notmuch command which
outputs long lines in a test script?)

-- next part --
A non-text attachment was scrubbed...
Name: 0001-test-add-tests-for-the-handling-of-References-and-In.patch
Type: text/x-patch
Size: 7358 bytes
Desc: not available
URL: 

-- next part --
A non-text attachment was scrubbed...
Name: 0002-lib-database.cc-change-how-the-parent-of-a-message-i.patch
Type: text/x-patch
Size: 9138 bytes
Desc: not available
URL: 

-- next part --


-- 
Aaron Ecay


[RFC] [PATCH] lib/database.cc: change how the parent of a message is calculated

2013-03-03 Thread David Bremner
Aaron Ecay  writes:

> git send-email is mad about lines >998 characters in the test patch, so
> I?m sending the patches as attachments to this email.  (Is there a
> better way to include the expected output of a notmuch command which
> outputs long lines in a test script?)

What about pretty printing the json in the source?
test_expect_equal_json canonicalizes both arguments before comparing, so
it should be ok.

d


[RFC] [PATCH] lib/database.cc: change how the parent of a message is calculated

2013-03-04 Thread Tomi Ollila
On Mon, Mar 04 2013, Aaron Ecay  wrote:

> git send-email is mad about lines >998 characters in the test patch, so
> I?m sending the patches as attachments to this email.  (Is there a
> better way to include the expected output of a notmuch command which
> outputs long lines in a test script?)

Simplest thing is to split the expected string to multiple lines
and use test_expect_equal_json() for checking.

Tomi



[RFC] [PATCH] lib/database.cc: change how the parent of a message is calculated

2013-02-25 Thread Aaron Ecay
Presently, the code which finds the parent of a message as it is being
added to the database assumes that the first Message-ID-like substring
of the In-Reply-To header is the parent Message ID.  Some mail clients,
however, put stuff other than the Message-ID of the parent in the
In-Reply-To header, such as the email address of the sender of the
parent.  This can fool notmuch.

The updated algorithm prefers the last Message ID in the References
header.  The References header lists messages oldest-first, so the last
Message ID is the parent (RFC2822, p. 24).  The References header is
also less likely to be in a non-standard
syntax (http://cr.yp.to/immhf/thread.html,
http://www.jwz.org/doc/threading.html).  In case the References header
is not to be found, fall back to the old behavior.
---

I especially notice this problem on public mailing lists, where
certain people's messages always cause an "out-dent" of the threading,
instead of being nested under whichever message they are replies to.

Technically, putting non-Message-ID crud in the In-Reply-To field is a
violation of RFC2822, but it appears that in practice the References
header is respected more often than the In-Reply-To one.

 lib/database.cc | 30 ++
 1 file changed, 22 insertions(+), 8 deletions(-)

diff --git a/lib/database.cc b/lib/database.cc
index 91d4329..cbf33ae 100644
--- a/lib/database.cc
+++ b/lib/database.cc
@@ -501,8 +501,10 @@ _parse_message_id (void *ctx, const char *message_id, 
const char **next)
  * 'message_id' in the result (to avoid mass confusion when a single
  * message references itself cyclically---and yes, mail messages are
  * not infrequent in the wild that do this---don't ask me why).
+ *
+ * Return the last reference parsed.
 */
-static void
+static char *
 parse_references (void *ctx,
  const char *message_id,
  GHashTable *hash,
@@ -511,7 +513,7 @@ parse_references (void *ctx,
 char *ref;

 if (refs == NULL || *refs == '\0')
-   return;
+   return NULL;

 while (*refs) {
ref = _parse_message_id (ctx, refs, &refs);
@@ -519,6 +521,8 @@ parse_references (void *ctx,
if (ref && strcmp (ref, message_id))
g_hash_table_insert (hash, ref, NULL);
 }
+
+return ref;
 }

 notmuch_status_t
@@ -1365,7 +1369,7 @@ _notmuch_database_generate_doc_id (notmuch_database_t 
*notmuch)
 notmuch->last_doc_id++;

 if (notmuch->last_doc_id == 0)
-   INTERNAL_ERROR ("Xapian document IDs are exhausted.\n");
+   INTERNAL_ERROR ("Xapian document IDs are exhausted.\n");

 return notmuch->last_doc_id;
 }
@@ -1509,7 +1513,7 @@ _notmuch_database_link_message_to_parents 
(notmuch_database_t *notmuch,
   const char **thread_id)
 {
 GHashTable *parents = NULL;
-const char *refs, *in_reply_to, *in_reply_to_message_id;
+const char *refs, *in_reply_to, *in_reply_to_message_id, 
*last_ref_message_id;
 GList *l, *keys = NULL;
 notmuch_status_t ret = NOTMUCH_STATUS_SUCCESS;

@@ -1517,21 +1521,31 @@ _notmuch_database_link_message_to_parents 
(notmuch_database_t *notmuch,
 _my_talloc_free_for_g_hash, NULL);

 refs = notmuch_message_file_get_header (message_file, "references");
-parse_references (message, notmuch_message_get_message_id (message),
- parents, refs);
+last_ref_message_id = parse_references (message,
+   notmuch_message_get_message_id 
(message),
+   parents, refs);

 in_reply_to = notmuch_message_file_get_header (message_file, 
"in-reply-to");
 parse_references (message, notmuch_message_get_message_id (message),
  parents, in_reply_to);

-/* Carefully avoid adding any self-referential in-reply-to term. */
 in_reply_to_message_id = _parse_message_id (message, in_reply_to, NULL);
+/* If the parent message ID from the Reply-To and References
+ * headers are different, use the References one.  This is because
+ * the Reply-To header is more likely to be in an non-standard
+ * format. */
+if (in_reply_to_message_id &&
+   last_ref_message_id &&
+   strcmp (last_ref_message_id, in_reply_to_message_id)) {
+   in_reply_to_message_id = last_ref_message_id;
+}
+/* Carefully avoid adding any self-referential in-reply-to term. */
 if (in_reply_to_message_id &&
strcmp (in_reply_to_message_id,
notmuch_message_get_message_id (message)))
 {
_notmuch_message_add_term (message, "replyto",
-_parse_message_id (message, in_reply_to, NULL));
+in_reply_to_message_id);
 }

 keys = g_hash_table_get_keys (parents);
--
1.8.1.4
___
notmuch mailing list
notmuch@notmuchmail.org
http://notmuchmail.org/mailman/listinfo/notmu

Re: [RFC] [PATCH] lib/database.cc: change how the parent of a message is calculated

2013-02-26 Thread Jani Nikula
On Tue, 26 Feb 2013, Aaron Ecay  wrote:
> Presently, the code which finds the parent of a message as it is being
> added to the database assumes that the first Message-ID-like substring
> of the In-Reply-To header is the parent Message ID.  Some mail clients,
> however, put stuff other than the Message-ID of the parent in the
> In-Reply-To header, such as the email address of the sender of the
> parent.  This can fool notmuch.

Hi Aaron, please provide references to a few messages like this. If
available on the notmuch list an id: reference would be best, but
otherwise some archive that allows viewing full message headers or
downloading the full message would be best.

Thanks,
Jani.

>
> The updated algorithm prefers the last Message ID in the References
> header.  The References header lists messages oldest-first, so the last
> Message ID is the parent (RFC2822, p. 24).  The References header is
> also less likely to be in a non-standard
> syntax (http://cr.yp.to/immhf/thread.html,
> http://www.jwz.org/doc/threading.html).  In case the References header
> is not to be found, fall back to the old behavior.
> ---
>
> I especially notice this problem on public mailing lists, where
> certain people's messages always cause an "out-dent" of the threading,
> instead of being nested under whichever message they are replies to.
>
> Technically, putting non-Message-ID crud in the In-Reply-To field is a
> violation of RFC2822, but it appears that in practice the References
> header is respected more often than the In-Reply-To one.
>
>  lib/database.cc | 30 ++
>  1 file changed, 22 insertions(+), 8 deletions(-)
>
> diff --git a/lib/database.cc b/lib/database.cc
> index 91d4329..cbf33ae 100644
> --- a/lib/database.cc
> +++ b/lib/database.cc
> @@ -501,8 +501,10 @@ _parse_message_id (void *ctx, const char *message_id, 
> const char **next)
>   * 'message_id' in the result (to avoid mass confusion when a single
>   * message references itself cyclically---and yes, mail messages are
>   * not infrequent in the wild that do this---don't ask me why).
> + *
> + * Return the last reference parsed.
>  */
> -static void
> +static char *
>  parse_references (void *ctx,
> const char *message_id,
> GHashTable *hash,
> @@ -511,7 +513,7 @@ parse_references (void *ctx,
>  char *ref;
>
>  if (refs == NULL || *refs == '\0')
> - return;
> + return NULL;
>
>  while (*refs) {
>   ref = _parse_message_id (ctx, refs, &refs);
> @@ -519,6 +521,8 @@ parse_references (void *ctx,
>   if (ref && strcmp (ref, message_id))
>   g_hash_table_insert (hash, ref, NULL);
>  }
> +
> +return ref;
>  }
>
>  notmuch_status_t
> @@ -1365,7 +1369,7 @@ _notmuch_database_generate_doc_id (notmuch_database_t 
> *notmuch)
>  notmuch->last_doc_id++;
>
>  if (notmuch->last_doc_id == 0)
> - INTERNAL_ERROR ("Xapian document IDs are exhausted.\n");
> + INTERNAL_ERROR ("Xapian document IDs are exhausted.\n");
>
>  return notmuch->last_doc_id;
>  }
> @@ -1509,7 +1513,7 @@ _notmuch_database_link_message_to_parents 
> (notmuch_database_t *notmuch,
>  const char **thread_id)
>  {
>  GHashTable *parents = NULL;
> -const char *refs, *in_reply_to, *in_reply_to_message_id;
> +const char *refs, *in_reply_to, *in_reply_to_message_id, 
> *last_ref_message_id;
>  GList *l, *keys = NULL;
>  notmuch_status_t ret = NOTMUCH_STATUS_SUCCESS;
>
> @@ -1517,21 +1521,31 @@ _notmuch_database_link_message_to_parents 
> (notmuch_database_t *notmuch,
>_my_talloc_free_for_g_hash, NULL);
>
>  refs = notmuch_message_file_get_header (message_file, "references");
> -parse_references (message, notmuch_message_get_message_id (message),
> -   parents, refs);
> +last_ref_message_id = parse_references (message,
> + notmuch_message_get_message_id 
> (message),
> + parents, refs);
>
>  in_reply_to = notmuch_message_file_get_header (message_file, 
> "in-reply-to");
>  parse_references (message, notmuch_message_get_message_id (message),
> parents, in_reply_to);
>
> -/* Carefully avoid adding any self-referential in-reply-to term. */
>  in_reply_to_message_id = _parse_message_id (message, in_reply_to, NULL);
> +/* If the parent message ID from the Reply-To and References
> + * headers are different, use the References one.  This is because
> + * the Reply-To header is more likely to be in an non-standard
> + * format. */
> +if (in_reply_to_message_id &&
> + last_ref_message_id &&
> + strcmp (last_ref_message_id, in_reply_to_message_id)) {
> + in_reply_to_message_id = last_ref_message_id;
> +}
> +/* Carefully avoid adding any self-referential in-reply-to term. */
>  if (in_reply_to_message_id &&
>   strcmp (i

Re: [RFC] [PATCH] lib/database.cc: change how the parent of a message is calculated

2013-02-26 Thread Aaron Ecay
2013ko otsailak 26an, Jani Nikula-ek idatzi zuen:
> Hi Aaron, please provide references to a few messages like this. If
> available on the notmuch list an id: reference would be best, but
> otherwise some archive that allows viewing full message headers or
> downloading the full message would be best.

Here’s a message from the notmuch list that has passed through gmane,
which inserts References but not In-Reply-To headers:
id:877h0sa207@fester.com .  Here’s a link to one with a borked
In-Reply-To header:
http://article.gmane.org/gmane.emacs.orgmode/66616/raw

-- 
Aaron Ecay
___
notmuch mailing list
notmuch@notmuchmail.org
http://notmuchmail.org/mailman/listinfo/notmuch


Re: [RFC] [PATCH] lib/database.cc: change how the parent of a message is calculated

2013-02-28 Thread Jani Nikula

Hi Aaron -

On Tue, 26 Feb 2013, Aaron Ecay  wrote:
> Presently, the code which finds the parent of a message as it is being
> added to the database assumes that the first Message-ID-like substring
> of the In-Reply-To header is the parent Message ID.  Some mail clients,
> however, put stuff other than the Message-ID of the parent in the
> In-Reply-To header, such as the email address of the sender of the
> parent.  This can fool notmuch.

I think the background is that RFC 822 defines In-Reply-To (and
References too for that matter) as *(phrase / msg-id), while RFC 2822
defines them as 1*msg-id. I'd like something about RFC 822 being
mentioned in the commit message.

The problem in the gmane message you link to in
id:87liaa3luc@gmail.com is likely related to the FAQ item 05.26 "How
do I fix a bogus In-Reply-To or missing References field?" in the MH FAQ
http://www.newt.com/faq/mh.html.

> The updated algorithm prefers the last Message ID in the References
> header.  The References header lists messages oldest-first, so the last
> Message ID is the parent (RFC2822, p. 24).  The References header is
> also less likely to be in a non-standard
> syntax (http://cr.yp.to/immhf/thread.html,
> http://www.jwz.org/doc/threading.html).  In case the References header
> is not to be found, fall back to the old behavior.
> ---
>
> I especially notice this problem on public mailing lists, where
> certain people's messages always cause an "out-dent" of the threading,
> instead of being nested under whichever message they are replies to.
>
> Technically, putting non-Message-ID crud in the In-Reply-To field is a
> violation of RFC2822, but it appears that in practice the References
> header is respected more often than the In-Reply-To one.
>
>  lib/database.cc | 30 ++
>  1 file changed, 22 insertions(+), 8 deletions(-)
>
> diff --git a/lib/database.cc b/lib/database.cc
> index 91d4329..cbf33ae 100644
> --- a/lib/database.cc
> +++ b/lib/database.cc
> @@ -501,8 +501,10 @@ _parse_message_id (void *ctx, const char *message_id, 
> const char **next)
>   * 'message_id' in the result (to avoid mass confusion when a single
>   * message references itself cyclically---and yes, mail messages are
>   * not infrequent in the wild that do this---don't ask me why).
> + *
> + * Return the last reference parsed.
>  */
> -static void
> +static char *
>  parse_references (void *ctx,
> const char *message_id,
> GHashTable *hash,
> @@ -511,7 +513,7 @@ parse_references (void *ctx,
>  char *ref;
>
>  if (refs == NULL || *refs == '\0')
> - return;
> + return NULL;
>
>  while (*refs) {
>   ref = _parse_message_id (ctx, refs, &refs);
> @@ -519,6 +521,8 @@ parse_references (void *ctx,
>   if (ref && strcmp (ref, message_id))
>   g_hash_table_insert (hash, ref, NULL);
>  }
> +
> +return ref;

As the comment for the function says, we explicitly avoid including
self-references. I think I'd err on the safe side and return NULL if the
last ref equals message-id.

>  }
>
>  notmuch_status_t
> @@ -1365,7 +1369,7 @@ _notmuch_database_generate_doc_id (notmuch_database_t 
> *notmuch)
>  notmuch->last_doc_id++;
>
>  if (notmuch->last_doc_id == 0)
> - INTERNAL_ERROR ("Xapian document IDs are exhausted.\n");
> + INTERNAL_ERROR ("Xapian document IDs are exhausted.\n");

I don't know how you got this non-change hunk here, but please remove
it. :)

>
>  return notmuch->last_doc_id;
>  }
> @@ -1509,7 +1513,7 @@ _notmuch_database_link_message_to_parents 
> (notmuch_database_t *notmuch,
>  const char **thread_id)
>  {
>  GHashTable *parents = NULL;
> -const char *refs, *in_reply_to, *in_reply_to_message_id;
> +const char *refs, *in_reply_to, *in_reply_to_message_id, 
> *last_ref_message_id;
>  GList *l, *keys = NULL;
>  notmuch_status_t ret = NOTMUCH_STATUS_SUCCESS;
>
> @@ -1517,21 +1521,31 @@ _notmuch_database_link_message_to_parents 
> (notmuch_database_t *notmuch,
>_my_talloc_free_for_g_hash, NULL);
>
>  refs = notmuch_message_file_get_header (message_file, "references");
> -parse_references (message, notmuch_message_get_message_id (message),
> -   parents, refs);
> +last_ref_message_id = parse_references (message,
> + notmuch_message_get_message_id 
> (message),
> + parents, refs);
>
>  in_reply_to = notmuch_message_file_get_header (message_file, 
> "in-reply-to");
>  parse_references (message, notmuch_message_get_message_id (message),
> parents, in_reply_to);
>
> -/* Carefully avoid adding any self-referential in-reply-to term. */
>  in_reply_to_message_id = _parse_message_id (message, in_reply_to, NULL);

I wonder if you should reuse your parse_references() change here, so
you'd set in_reply_to_mess

Re: [RFC] [PATCH] lib/database.cc: change how the parent of a message is calculated

2013-03-01 Thread Jani Nikula
On Thu, 28 Feb 2013, Jani Nikula  wrote:
> Hi Aaron -
>
> On Tue, 26 Feb 2013, Aaron Ecay  wrote:
>> Presently, the code which finds the parent of a message as it is being
>> added to the database assumes that the first Message-ID-like substring
>> of the In-Reply-To header is the parent Message ID.  Some mail clients,
>> however, put stuff other than the Message-ID of the parent in the
>> In-Reply-To header, such as the email address of the sender of the
>> parent.  This can fool notmuch.
>
> I think the background is that RFC 822 defines In-Reply-To (and
> References too for that matter) as *(phrase / msg-id), while RFC 2822
> defines them as 1*msg-id. I'd like something about RFC 822 being
> mentioned in the commit message.
>
> The problem in the gmane message you link to in
> id:87liaa3luc@gmail.com is likely related to the FAQ item 05.26 "How
> do I fix a bogus In-Reply-To or missing References field?" in the MH FAQ
> http://www.newt.com/faq/mh.html.
>
>> The updated algorithm prefers the last Message ID in the References
>> header.  The References header lists messages oldest-first, so the last
>> Message ID is the parent (RFC2822, p. 24).  The References header is
>> also less likely to be in a non-standard
>> syntax (http://cr.yp.to/immhf/thread.html,
>> http://www.jwz.org/doc/threading.html).  In case the References header
>> is not to be found, fall back to the old behavior.
>> ---
>>
>> I especially notice this problem on public mailing lists, where
>> certain people's messages always cause an "out-dent" of the threading,
>> instead of being nested under whichever message they are replies to.
>>
>> Technically, putting non-Message-ID crud in the In-Reply-To field is a
>> violation of RFC2822, but it appears that in practice the References
>> header is respected more often than the In-Reply-To one.
>>
>>  lib/database.cc | 30 ++
>>  1 file changed, 22 insertions(+), 8 deletions(-)
>>
>> diff --git a/lib/database.cc b/lib/database.cc
>> index 91d4329..cbf33ae 100644
>> --- a/lib/database.cc
>> +++ b/lib/database.cc
>> @@ -501,8 +501,10 @@ _parse_message_id (void *ctx, const char *message_id, 
>> const char **next)
>>   * 'message_id' in the result (to avoid mass confusion when a single
>>   * message references itself cyclically---and yes, mail messages are
>>   * not infrequent in the wild that do this---don't ask me why).
>> + *
>> + * Return the last reference parsed.
>>  */
>> -static void
>> +static char *
>>  parse_references (void *ctx,
>>const char *message_id,
>>GHashTable *hash,
>> @@ -511,7 +513,7 @@ parse_references (void *ctx,
>>  char *ref;
>>
>>  if (refs == NULL || *refs == '\0')
>> -return;
>> +return NULL;
>>
>>  while (*refs) {
>>  ref = _parse_message_id (ctx, refs, &refs);
>> @@ -519,6 +521,8 @@ parse_references (void *ctx,
>>  if (ref && strcmp (ref, message_id))
>>  g_hash_table_insert (hash, ref, NULL);
>>  }
>> +
>> +return ref;
>
> As the comment for the function says, we explicitly avoid including
> self-references. I think I'd err on the safe side and return NULL if the
> last ref equals message-id.
>
>>  }
>>
>>  notmuch_status_t
>> @@ -1365,7 +1369,7 @@ _notmuch_database_generate_doc_id (notmuch_database_t 
>> *notmuch)
>>  notmuch->last_doc_id++;
>>
>>  if (notmuch->last_doc_id == 0)
>> -INTERNAL_ERROR ("Xapian document IDs are exhausted.\n");
>> +INTERNAL_ERROR ("Xapian document IDs are exhausted.\n");
>
> I don't know how you got this non-change hunk here, but please remove
> it. :)
>
>>
>>  return notmuch->last_doc_id;
>>  }
>> @@ -1509,7 +1513,7 @@ _notmuch_database_link_message_to_parents 
>> (notmuch_database_t *notmuch,
>> const char **thread_id)
>>  {
>>  GHashTable *parents = NULL;
>> -const char *refs, *in_reply_to, *in_reply_to_message_id;
>> +const char *refs, *in_reply_to, *in_reply_to_message_id, 
>> *last_ref_message_id;
>>  GList *l, *keys = NULL;
>>  notmuch_status_t ret = NOTMUCH_STATUS_SUCCESS;
>>
>> @@ -1517,21 +1521,31 @@ _notmuch_database_link_message_to_parents 
>> (notmuch_database_t *notmuch,
>>   _my_talloc_free_for_g_hash, NULL);
>>
>>  refs = notmuch_message_file_get_header (message_file, "references");
>> -parse_references (message, notmuch_message_get_message_id (message),
>> -  parents, refs);
>> +last_ref_message_id = parse_references (message,
>> +notmuch_message_get_message_id 
>> (message),
>> +parents, refs);
>>
>>  in_reply_to = notmuch_message_file_get_header (message_file, 
>> "in-reply-to");
>>  parse_references (message, notmuch_message_get_message_id (message),
>>parents, in_reply_to);
>>
>> -/* Carefully avoid adding any self-referential in-reply-to term. */
>>  in_re

Re: [RFC] [PATCH] lib/database.cc: change how the parent of a message is calculated

2013-03-03 Thread Aaron Ecay
Hi Jani,

Thanks to you and Austin for the comments.

2013ko martxoak 1an, Jani Nikula-ek idatzi zuen:
>> I think the background is that RFC 822 defines In-Reply-To (and
>> References too for that matter) as *(phrase / msg-id), while RFC 2822
>> defines them as 1*msg-id. I'd like something about RFC 822 being
>> mentioned in the commit message.
>> 
>> The problem in the gmane message you link to in
>> id:87liaa3luc@gmail.com is likely related to the FAQ item 05.26
>> "How do I fix a bogus In-Reply-To or missing References field?" in
>> the MH FAQ http://www.newt.com/faq/mh.html.

Likely yes.  But I think notmuch should handle these messages, since
they are seen in the wild (and I don’t think you disagree with me on
this point?)


>> 
>> As the comment for the function says, we explicitly avoid including
>> self-references. I think I'd err on the safe side and return NULL if
>> the last ref equals message-id.

Done.

>> 
>> I don't know how you got this non-change hunk here, but please remove
>> it. :)

That’s what I get for setting my editor to delete trailing whitespace on
save (then not reading outgoing patches carefully).  Fixed.

>> I wonder if you should reuse your parse_references() change here, so
>> you'd set in_reply_to_message_id to the last message-id in
>> In-Reply-To. This might tackle some of the problematic cases
>> directly, but should still be all right per RFC 2822. I didn't verify
>> how the parser handles an RFC 2822 violating free form header though.
> 
> Strike that based on http://www.jwz.org/doc/threading.html:
> 
> "If there are multiple things in In-Reply-To that look like
> Message-IDs, only use the first one of them: odds are that the later
> ones are actually email addresses, not IDs."

Hmm.  I think it’s a toss-up which of multiple quasi-message-ids is the
real one.  In the email message example I linked upthread, it was the
last one that was real.  I decided to use the last one, because it
allows the self-reference checking to be pushed entirely into
parse_references.  If you feel strongly that we should use the first
one, I can change it back.

> I talked to Austin (CC) about the patch on IRC, and his comment was,
> perceptive as always:
> 
>  23:38 amdragon Is the logic in that patch equivalent to always using
> the last message ID in references unless there is no references
> header?  Seems like it is, but in a convoluted way.
> 
> And that's actually the case, isn't it? To make the code reflect that,
> you should use last_ref_message_id, and if that's NULL, fallback to
> in_reply_to_message_id.

Yes.  Fixed.

> 
>> I suggest adding an else if branch (or revamp the above if condition)
>> to tackle the missing In-Reply-To header:
>> 
>> else if (!in_reply_to_message_id && last_ref_message_id) {
>> in_reply_to_message_id = last_ref_message_id; }
> 
> Strike that, it should be the other way round.

Now that the self-reference check is in parse_references, the
conditional is much simpler.

One additional change I made in this version was to factor out 3 calls
to “notmuch_message_get_message_id (message)” into a variable inside the
_notmuch_database_link_message_to_parents function, for a small boost to
readability (and perhaps speed, depending on how clever the compiler is
I guess).

I also added tests – those are the first of two patches that will follow
this email, the second being the code to make them pass.

-- 
Aaron Ecay
___
notmuch mailing list
notmuch@notmuchmail.org
http://notmuchmail.org/mailman/listinfo/notmuch


Re: [RFC] [PATCH] lib/database.cc: change how the parent of a message is calculated

2013-03-03 Thread Aaron Ecay
git send-email is mad about lines >998 characters in the test patch, so
I’m sending the patches as attachments to this email.  (Is there a
better way to include the expected output of a notmuch command which
outputs long lines in a test script?)

>From 23836241dd304b98f2a05803fbb5a5a94f563050 Mon Sep 17 00:00:00 2001
From: Aaron Ecay 
Date: Sun, 3 Mar 2013 18:14:07 -0500
Subject: [PATCH 1/2] test: add tests for the handling of References and
 In-Reply-To headers

These tests are known_broken, the following commit fixes them.
---
 test/thread-replies | 55 +
 1 file changed, 55 insertions(+)
 create mode 100755 test/thread-replies

diff --git a/test/thread-replies b/test/thread-replies
new file mode 100755
index 000..fd11a09
--- /dev/null
+++ b/test/thread-replies
@@ -0,0 +1,55 @@
+#!/usr/bin/env bash
+#
+# Copyright (c) 2013 Aaron Ecay
+#
+
+test_description='test of proper handling of in-reply-to and references headers
+
+This test makes sure that the thread structure in the notmuch database is
+constructed properly, even in the presence of non-RFC-compliant headers'
+
+. ./test-lib.sh
+
+test_begin_subtest "Use References when In-Reply-To is broken"
+test_subtest_known_broken
+add_message '[id]="f...@one.com"' \
+'[subject]=one'
+add_message '[in-reply-to]="mumble"' \
+'[references]=""' \
+'[subject]="Re: one"'
+output=$(notmuch show --format=json 'subject:one')
+test_expect_equal "$output" '[[[{"id": "f...@one.com", "match": true, "excluded": false, "filename": "/home/aecay/development/notmuch/notmuch-git/src/notmuch/test/tmp.thread-replies/mail/msg-001", "timestamp": 978709437, "date_relative": "2001-01-05", "tags": ["inbox", "unread"], "headers": {"Subject": "one", "From": "Notmuch Test Suite ", "To": "Notmuch Test Suite ", "Date": "Fri, 05 Jan 2001 15:43:57 +"}, "body": [{"id": 1, "content-type": "text/plain", "content": "This is just a test message (#1)\n"}]}, [[{"id": "msg-002@notmuch-test-suite", "match": true, "excluded": false, "filename": "/home/aecay/development/notmuch/notmuch-git/src/notmuch/test/tmp.thread-replies/mail/msg-002", "timestamp": 978709437, "date_relative": "2001-01-05", "tags": ["inbox", "unread"], "headers": {"Subject": "Re: one", "From": "Notmuch Test Suite ", "To": "Notmuch Test Suite ", "Date": "Fri, 05 Jan 2001 15:43:57 +"}, "body": [{"id": 1, "content-type": "text/plain", "content": "This is just a test message (#2)\n"}]}, []]'
+
+test_begin_subtest "Prefer References to In-Reply-To"
+test_subtest_known_broken
+add_message '[id]="f...@two.com"' \
+'[subject]=two'
+add_message '[in-reply-to]=""' \
+'[references]=""' \
+'[subject]="Re: two"'
+output=$(notmuch show --format=json 'subject:two')
+test_expect_equal "$output" '[[[{"id": "f...@two.com", "match": true, "excluded": false, "filename": "/home/aecay/development/notmuch/notmuch-git/src/notmuch/test/tmp.thread-replies/mail/msg-003", "timestamp": 978709437, "date_relative": "2001-01-05", "tags": ["inbox", "unread"], "headers": {"Subject": "two", "From": "Notmuch Test Suite ", "To": "Notmuch Test Suite ", "Date": "Fri, 05 Jan 2001 15:43:57 +"}, "body": [{"id": 1, "content-type": "text/plain", "content": "This is just a test message (#3)\n"}]}, [[{"id": "msg-004@notmuch-test-suite", "match": true, "excluded": false, "filename": "/home/aecay/development/notmuch/notmuch-git/src/notmuch/test/tmp.thread-replies/mail/msg-004", "timestamp": 978709437, "date_relative": "2001-01-05", "tags": ["inbox", "unread"], "headers": {"Subject": "Re: two", "From": "Notmuch Test Suite ", "To": "Notmuch Test Suite ", "Date": "Fri, 05 Jan 2001 15:43:57 +"}, "body": [{"id": 1, "content-type": "text/plain", "content": "This is just a test message (#4)\n"}]}, []]'
+
+test_begin_subtest "Use In-Reply-To when no References"
+test_subtest_known_broken
+add_message '[id]="f...@three.com"' \
+'[subject]="three"'
+add_message '[in-reply-to]=""' \
+'[subject]="Re: three"'
+output=$(notmuch show --format=json 'subject:three')
+test_expect_equal "$output" '[[[{"id": "f...@three.com", "match": true, "excluded": false, "filename": "/home/aecay/development/notmuch/notmuch-git/src/notmuch/test/tmp.thread-replies/mail/msg-005", "timestamp": 978709437, "date_relative": "2001-01-05", "tags": ["inbox", "unread"], "headers": {"Subject": "three", "From": "Notmuch Test Suite ", "To": "Notmuch Test Suite ", "Date": "Fri, 05 Jan 2001 15:43:57 +"}, "body": [{"id": 1, "content-type": "text/plain", "content": "This is just a test message (#5)\n"}]}, [[{"id": "msg-006@notmuch-test-suite", "match": true, "excluded": false, "filename": "/home/aecay/development/notmuch/notmuch-git/src/notmuch/test/tmp.thread-replies/mail/msg-006", "timestamp": 978709437, "date_relative": "2001-01-05", "tags": ["inbox", "unread"], "headers": {"Subject": "Re: three", "From": "Notmuch Test Suite ", "To": "Notmuch Test Suite ", "Date": "Fri, 05 Jan 2001 15:43:57 +"}, "body": [{"i

Re: [RFC] [PATCH] lib/database.cc: change how the parent of a message is calculated

2013-03-03 Thread David Bremner
Aaron Ecay  writes:

> git send-email is mad about lines >998 characters in the test patch, so
> I’m sending the patches as attachments to this email.  (Is there a
> better way to include the expected output of a notmuch command which
> outputs long lines in a test script?)

What about pretty printing the json in the source?
test_expect_equal_json canonicalizes both arguments before comparing, so
it should be ok.

d
___
notmuch mailing list
notmuch@notmuchmail.org
http://notmuchmail.org/mailman/listinfo/notmuch


Re: [RFC] [PATCH] lib/database.cc: change how the parent of a message is calculated

2013-03-03 Thread Tomi Ollila
On Mon, Mar 04 2013, Aaron Ecay  wrote:

> git send-email is mad about lines >998 characters in the test patch, so
> I’m sending the patches as attachments to this email.  (Is there a
> better way to include the expected output of a notmuch command which
> outputs long lines in a test script?)

Simplest thing is to split the expected string to multiple lines
and use test_expect_equal_json() for checking.

Tomi

___
notmuch mailing list
notmuch@notmuchmail.org
http://notmuchmail.org/mailman/listinfo/notmuch