Re: Cheap way to check for new messages in a thread

2023-06-16 Thread Konstantin Ryabitsev
On Thu, Mar 30, 2023 at 11:29:51AM +, Eric Wong wrote:
> This implements the mbox.gz retrieval.  I didn't want to deal
> with HTML nor figuring out how to expose more  elements,
> yet; but I figure mbox.gz is the most important.
> 
> Now deployed on 80x24.org/lore:
> 
> MSGID=20230327080502.GA570847@ziqianlu-desk2
> curl -d '' -sSf \
>https://80x24.org/lore/all/"$MSGID/?x=m=rt:2023-03-29..; | \
>zcat | grep -i ^Message-ID:

Eric:

Reviving this old thread for some clarification. I noticed that this only
works for /all/, but not for individual inboxes. E.g.:

$ curl -d '' -sSf \
  https://lore.kernel.org/all/"$MSGID/?x=m=rt:2023-03-29..; \
  | zgrep -i ^Message-ID:
Message-ID: 

but with /lkml/ I get a 404:

$ curl -d '' -sSf \
  https://lore.kernel.org/lkml/"$MSGID/?x=m=rt:2023-03-29..; \
  | zgrep -i ^Message-ID:
curl: (22) The requested URL returned error: 404

Is that intentionally restricted to just extindex?

-K



Re: Cheap way to check for new messages in a thread

2023-04-11 Thread Eric Wong
Eric Wong  wrote:
> Konstantin Ryabitsev  wrote:
> > I can't easily test this, because lore is currently mostly on 1.9 and the
> > patch doesn't cleanly apply to that tree. However, I will be happy to test 
> > it
> > out once 2.0 is out and we've updated to it on our systems.
> 
> Fwiw, master is good on Linux for mail.

Erm, almost :x   The --batch-command support added a difficult-to-trigger bug
that went undetected for a few months:

  https://public-inbox.org/meta/2023042350.297099-...@80x24.org/
  ("git: fix cat_async_retry")



Re: Cheap way to check for new messages in a thread

2023-03-30 Thread Eric Wong
Konstantin Ryabitsev  wrote:
> I can't easily test this, because lore is currently mostly on 1.9 and the
> patch doesn't cleanly apply to that tree. However, I will be happy to test it
> out once 2.0 is out and we've updated to it on our systems.

Fwiw, master is good on Linux for mail.  codesearch still
needs work, and lei on FreeBSD gets stuck sometimes.



Re: Cheap way to check for new messages in a thread

2023-03-30 Thread Konstantin Ryabitsev
On Thu, Mar 30, 2023 at 11:29:51AM +, Eric Wong wrote:
> > Per-thread search is something I've wanted for a while, anyways,
> > so I think I'll do /$MSGID/?q= in between ongoing work for
> 
> This implements the mbox.gz retrieval.  I didn't want to deal
> with HTML nor figuring out how to expose more  elements,
> yet; but I figure mbox.gz is the most important.

Nice, thanks!

I can't easily test this, because lore is currently mostly on 1.9 and the
patch doesn't cleanly apply to that tree. However, I will be happy to test it
out once 2.0 is out and we've updated to it on our systems.

Cheers,
-K



Re: Cheap way to check for new messages in a thread

2023-03-30 Thread Eric Wong
Eric Wong  wrote:
> Konstantin Ryabitsev  wrote:
> > However, if you do want to add ability to cheaply do a "give me just the
> > newest messages in this thread since this datetime", that would be great for
> > my needs. :)
> 
> Per-thread search is something I've wanted for a while, anyways,
> so I think I'll do /$MSGID/?q= in between ongoing work for

This implements the mbox.gz retrieval.  I didn't want to deal
with HTML nor figuring out how to expose more  elements,
yet; but I figure mbox.gz is the most important.

Now deployed on 80x24.org/lore:

MSGID=20230327080502.GA570847@ziqianlu-desk2
curl -d '' -sSf \
   https://80x24.org/lore/all/"$MSGID/?x=m=rt:2023-03-29..; | \
   zcat | grep -i ^Message-ID:

shows the expected messages.
---8<---
Subject: [PATCH] www: support POST /$INBOX/$MSGID/?x=m=

This allows filtering the contents of any existing thread using
a search query.  It uses the existing THREADID column in Xapian
so we can internally add a Xapian OP_FILTER to the results.

This new functionality is orthogonal to the existing `t=1'
parameter which gives mairix-style thread expansion.  It doesn't
make sense to use `t=1' with this functionality, but it's not
disallowed, either.

The indentation change in Over->next_by_mid is to ensure
DBI->prepare_cached can share across both ->next_by_mid
and ->mid2tid.

I also noticed the existing regex for `POST /$INBOX/?x=m=' was
allowing extra characters.  With an added \z, it's now as strict
was originally intended and AFAIK nothing was generating invalid
URLs for it

Reported-by: Konstantin Ryabitsev 
Link: 
https://public-inbox.org/meta/aaniyhk7wfm4e6m5mbukcrhevzoc6ftctyrfwvmz4fkykwwtlj@mverfng6ytas/T/
---
 lib/PublicInbox/Mbox.pm   |  5 
 lib/PublicInbox/Over.pm   | 24 ++-
 lib/PublicInbox/Search.pm |  6 +
 lib/PublicInbox/WWW.pm|  4 +++-
 t/psgi_v2.t   | 50 ++-
 5 files changed, 81 insertions(+), 8 deletions(-)

diff --git a/lib/PublicInbox/Mbox.pm b/lib/PublicInbox/Mbox.pm
index 18db9d38..e1abf7ec 100644
--- a/lib/PublicInbox/Mbox.pm
+++ b/lib/PublicInbox/Mbox.pm
@@ -229,6 +229,11 @@ sub mbox_all {
return PublicInbox::WWW::need($ctx, 'Overview');
 
my $qopts = $ctx->{qopts} = { relevance => -2 }; # ORDER BY docid DESC
+
+   # {threadid} limits results to a given thread
+   # {threads} collapses results from messages in the same thread,
+   # allowing us to use ->expand_thread w/o duplicates in our own code
+   $qopts->{threadid} = $over->mid2tid($ctx->{mid}) if 
defined($ctx->{mid});
$qopts->{threads} = 1 if $q->{t};
$srch->query_approxidate($ctx->{ibx}->git, $q_string);
my $mset = $srch->mset($q_string, $qopts);
diff --git a/lib/PublicInbox/Over.pm b/lib/PublicInbox/Over.pm
index 271e2246..6ba27118 100644
--- a/lib/PublicInbox/Over.pm
+++ b/lib/PublicInbox/Over.pm
@@ -283,13 +283,35 @@ SELECT eidx_key FROM inboxes WHERE ibx_id = ?
$rows;
 }
 
+sub mid2tid {
+   my ($self, $mid) = @_;
+   my $dbh = dbh($self);
+
+   my $sth = $dbh->prepare_cached(<<'', undef, 1);
+SELECT id FROM msgid WHERE mid = ? LIMIT 1
+
+   $sth->execute($mid);
+   my $id = $sth->fetchrow_array or return;
+   $sth = $dbh->prepare_cached(<<'', undef, 1);
+SELECT num FROM id2num WHERE id = ? AND num > ?
+ORDER BY num ASC LIMIT 1
+
+   $sth->execute($id, 0);
+   my $num = $sth->fetchrow_array or return;
+   $sth = $dbh->prepare(<<'');
+SELECT tid FROM over WHERE num = ? LIMIT 1
+
+   $sth->execute($num);
+   $sth->fetchrow_array;
+}
+
 sub next_by_mid {
my ($self, $mid, $id, $prev) = @_;
my $dbh = dbh($self);
 
unless (defined $$id) {
my $sth = $dbh->prepare_cached(<<'', undef, 1);
-   SELECT id FROM msgid WHERE mid = ? LIMIT 1
+SELECT id FROM msgid WHERE mid = ? LIMIT 1
 
$sth->execute($mid);
$$id = $sth->fetchrow_array;
diff --git a/lib/PublicInbox/Search.pm b/lib/PublicInbox/Search.pm
index 5133a3b7..6c3d9f93 100644
--- a/lib/PublicInbox/Search.pm
+++ b/lib/PublicInbox/Search.pm
@@ -386,6 +386,12 @@ sub mset {
sortable_serialise($uid_range->[1]));
$query = $X{Query}->new(OP_FILTER(), $query, $range);
}
+   if (defined(my $tid = $opt->{threadid})) {
+   $tid = sortable_serialise($tid);
+   $query = $X{Query}->new(OP_FILTER(), $query,
+   $X{Query}->new(OP_VALUE_RANGE(), THREADID, 
$tid, $tid));
+   }
+
my $xdb = xdb($self);
my $enq = $X{Enquire}->new($xdb);
$enq->set_query($query);
diff --git a/lib/PublicInbox/WWW.pm b/lib/PublicInbox/WWW.pm
index 9ffcb879..a8f1ad17 100644
--- a/lib/PublicInbox/WWW.pm
+++ b/lib/PublicInbox/WWW.pm
@@ -68,7 +68,9 @@ sub call {
my ($idx, $fn) = ($3, $4);
return 

Re: Cheap way to check for new messages in a thread

2023-03-29 Thread Eric Wong
Konstantin Ryabitsev  wrote:
> I'm fine with either of these, and just to stress, it's not really blocking
> anything I'm working on -- bugbot is in initial rollout stages, so while the
> number of tracked bugs/threads remains low, even if we re-download a hundred
> threads every 10 minutes, it's just internal churn between two adjacent VMs.
> If it becomes heavy, I can always look into switching to lei and performing
> local queries instead of doing external polling.

Alright.

> However, if you do want to add ability to cheaply do a "give me just the
> newest messages in this thread since this datetime", that would be great for
> my needs. :)

Per-thread search is something I've wanted for a while, anyways,
so I think I'll do /$MSGID/?q= in between ongoing work for
codesearch and chasing down FreeBSD issues.

I may not expose /$MSGID/?q= it via HTML just yet since I find
 elements confusing as a user :x

Indexing References/IRT would be a waste of space and I/O due to
MUA truncations; so I'm hesitant to do it since we already index
THREADID.

thread:{sub-query} will be nice, but I'll get to it after I deal
with the lei FUSE stuff since I've already done most of the C
work from another project.  Normal FSes are so inefficient
for storing Maildir outputs.



Re: Cheap way to check for new messages in a thread

2023-03-28 Thread Konstantin Ryabitsev
On Tue, Mar 28, 2023 at 10:08:30PM +, Eric Wong wrote:
> > I think this is a workable approach, but would require a reindex, right?
> 
> Yes, it requires a reindex to take effect, which takes ~2 days
> on my lore mirror.  The biggest problem is MUAs are likely to
> cull References: when threads get too long; so accuracy gets
> lost.
> 
> Supporting /$MSGID/?q=... doesn't seem like the worst idea,
> actually; since I've seen some web forums (phpBB maybe?) have a
> "search in thread" function.
> 
> thread:{sub-query} is ideal; and I wouldn't rule out doing any
> combination of the three (I don't like separating before/after).

I'm fine with either of these, and just to stress, it's not really blocking
anything I'm working on -- bugbot is in initial rollout stages, so while the
number of tracked bugs/threads remains low, even if we re-download a hundred
threads every 10 minutes, it's just internal churn between two adjacent VMs.
If it becomes heavy, I can always look into switching to lei and performing
local queries instead of doing external polling.

However, if you do want to add ability to cheaply do a "give me just the
newest messages in this thread since this datetime", that would be great for
my needs. :)

-K



Re: Cheap way to check for new messages in a thread

2023-03-28 Thread Eric Wong
Konstantin Ryabitsev  wrote:
> On Tue, Mar 28, 2023 at 07:45:49PM +, Eric Wong wrote:
> > C) index References:/In-Reply-To: so searching `ref:$MSGID'
> >can work.  This doesn't work for some MUAs and deep
> >threads, though.
> 
> I think this is a workable approach, but would require a reindex, right?

Yes, it requires a reindex to take effect, which takes ~2 days
on my lore mirror.  The biggest problem is MUAs are likely to
cull References: when threads get too long; so accuracy gets
lost.

Supporting /$MSGID/?q=... doesn't seem like the worst idea,
actually; since I've seen some web forums (phpBB maybe?) have a
"search in thread" function.

thread:{sub-query} is ideal; and I wouldn't rule out doing any
combination of the three (I don't like separating before/after).



Re: Cheap way to check for new messages in a thread

2023-03-28 Thread Konstantin Ryabitsev
On Tue, Mar 28, 2023 at 07:45:49PM +, Eric Wong wrote:
> C) index References:/In-Reply-To: so searching `ref:$MSGID'
>can work.  This doesn't work for some MUAs and deep
>threads, though.

I think this is a workable approach, but would require a reindex, right?

-K



Re: Cheap way to check for new messages in a thread

2023-03-28 Thread Eric Wong
Konstantin Ryabitsev  wrote:
> On Mon, Mar 27, 2023 at 09:38:49PM +, Eric Wong wrote:
> > I thought about that, too; but I'm worried about having one-off
> > stuff that ends up needing to be supported indefinitely.
> > 
> > JMAP for this would take more time, but I'd be more comfortable
> > carrying it long-term.
> > 
> > I don't expect trimming after the first paragraph to be a huge
> > improvement.  Retrieving any part of the message from git and
> > dealing with MIME is expensive, anyways.  I wouldn't expect it
> > to be a big (if any) improvement compared to POST-ing for the
> > mbox.gz (=m=1) endpoint with rt:$SINCE..
> 
> Hmm... This didn't seem to do the right thing for me. For example, this
> thread:
> 
> https://lore.kernel.org/lkml/20230327080502.GA570847@ziqianlu-desk2
> 
> If I ask for any new messages in that thread since 2023032712, I get
> nothing:
> 
> curl -Sf -d '' 
> 'https://lore.kernel.org/all/?x=m=1=mid%3A20230327080502.GA570847@ziqianlu-desk2+AND+dt%3A2023032812..'

Ugh, that's because the thread expansion (t=1) happens after
Xapian handles dt:/rt:/d:

I don't know if there's a good way to do that entirely within
Xapian via high-level Perl bindings.

Some options:

A) grab MSGID first, lookup THREADID for a given MSGID,
   use remaining query

   The problem is figuring out which parts of the query to
   handle, first.  Maybe a solution below...

B) add explicit before= and after= parameters which allow us
   to do filtering ourselves in the thread expansion phase

C) index References:/In-Reply-To: so searching `ref:$MSGID'
   can work.  This doesn't work for some MUAs and deep
   threads, though.

D) Support `thread:{subquery}' like notmuch.
   Thus `thread:{mid:$MSGID} AND dt:$START..' would communicate
   to Xapian what we want for A).

   I'm not sure this is doable unless using Xapian via C++,
   but I've been considering providing the option to use C++
   anyways to support less hacky approxidate query parsing.
   According to notmuch docs, it's expensive, though :<

I think it's possible to support /$INBOX/$MSGID/t.mbox.gz?q=...
for A) without too much difficulty.  I'll have to think
about it a bit...

D) is good for long-term consideration if proper timeouts can
be implemented.

> > The mbox.gz endpoints should be a bit more efficient for the
> > server than Atom feeds; decoding MIME and HTML escaping takes up
> > considerable CPU time.
> 
> Good to know. I'm really looking for a way to ask the remote system "hey, is
> there anything new in this thread?" so that I can quickly ignore threads
> without any updates.

All the mbox.gz endpoints will 404 if there's no results, and
the `-f' flag of curl will ensure nothing's emitted to stdout
in that case.



Re: Cheap way to check for new messages in a thread

2023-03-28 Thread Konstantin Ryabitsev
On Mon, Mar 27, 2023 at 09:38:49PM +, Eric Wong wrote:
> I thought about that, too; but I'm worried about having one-off
> stuff that ends up needing to be supported indefinitely.
> 
> JMAP for this would take more time, but I'd be more comfortable
> carrying it long-term.
> 
> I don't expect trimming after the first paragraph to be a huge
> improvement.  Retrieving any part of the message from git and
> dealing with MIME is expensive, anyways.  I wouldn't expect it
> to be a big (if any) improvement compared to POST-ing for the
> mbox.gz (=m=1) endpoint with rt:$SINCE..

Hmm... This didn't seem to do the right thing for me. For example, this
thread:

https://lore.kernel.org/lkml/20230327080502.GA570847@ziqianlu-desk2

If I ask for any new messages in that thread since 2023032712, I get
nothing:

curl -Sf -d '' 
'https://lore.kernel.org/all/?x=m=1=mid%3A20230327080502.GA570847@ziqianlu-desk2+AND+dt%3A2023032812..'

> The mbox.gz endpoints should be a bit more efficient for the
> server than Atom feeds; decoding MIME and HTML escaping takes up
> considerable CPU time.

Good to know. I'm really looking for a way to ask the remote system "hey, is
there anything new in this thread?" so that I can quickly ignore threads
without any updates.

-K



Re: Cheap way to check for new messages in a thread

2023-03-27 Thread Eric Wong
Konstantin Ryabitsev  wrote:
> On Mon, Mar 27, 2023 at 07:10:49PM +, Eric Wong wrote:
> > > For the bugzilla integration work I'm doing, I need a way to check if 
> > > there
> > > were any updates to a thread since the last check. Right now, I'm just
> > > grabbing the full thread, parsing it and seeing if there are any new
> > > message-IDs that we don't know about, but it's very wasteful. Any way to 
> > > just
> > > issue something like "how many messages are in a thread with this 
> > > message-id"
> > > or "are there any updates to a thread with this message-id since
> > > MMDDHHMMSS?
> > 
> >   lei q -t --only /path/to/(inbox|extindex) mid:$MSGID rt:APPROXIDATE..
> > 
> > Returns JSON and won't retrieve message bodies from git.
> 
> Ah, I was hoping to have a fully remote way of doing this.
> 
> > I wouldn't query down to the second due to propagation delays,
> > clock skew, etc, though.
> > 
> > There might be a JMAP endpoint I can implement for WWW which
> > only retrieves that info, but getting backreferences (required
> > by the JMAP spec) to work properly seemed painful.
> 
> What about a "bodiless" atom feed? It's already available per thread, so
> perhaps there could be a mode that skips the bodies or trims them after the
> first paragraph?

I thought about that, too; but I'm worried about having one-off
stuff that ends up needing to be supported indefinitely.

JMAP for this would take more time, but I'd be more comfortable
carrying it long-term.

I don't expect trimming after the first paragraph to be a huge
improvement.  Retrieving any part of the message from git and
dealing with MIME is expensive, anyways.  I wouldn't expect it
to be a big (if any) improvement compared to POST-ing for the
mbox.gz (=m=1) endpoint with rt:$SINCE..

The mbox.gz endpoints should be a bit more efficient for the
server than Atom feeds; decoding MIME and HTML escaping takes up
considerable CPU time.



Re: Cheap way to check for new messages in a thread

2023-03-27 Thread Konstantin Ryabitsev
On Mon, Mar 27, 2023 at 07:10:49PM +, Eric Wong wrote:
> > For the bugzilla integration work I'm doing, I need a way to check if there
> > were any updates to a thread since the last check. Right now, I'm just
> > grabbing the full thread, parsing it and seeing if there are any new
> > message-IDs that we don't know about, but it's very wasteful. Any way to 
> > just
> > issue something like "how many messages are in a thread with this 
> > message-id"
> > or "are there any updates to a thread with this message-id since
> > MMDDHHMMSS?
> 
>   lei q -t --only /path/to/(inbox|extindex) mid:$MSGID rt:APPROXIDATE..
> 
> Returns JSON and won't retrieve message bodies from git.

Ah, I was hoping to have a fully remote way of doing this.

> I wouldn't query down to the second due to propagation delays,
> clock skew, etc, though.
> 
> There might be a JMAP endpoint I can implement for WWW which
> only retrieves that info, but getting backreferences (required
> by the JMAP spec) to work properly seemed painful.

What about a "bodiless" atom feed? It's already available per thread, so
perhaps there could be a mode that skips the bodies or trims them after the
first paragraph?

-K



Re: Cheap way to check for new messages in a thread

2023-03-27 Thread Eric Wong
Konstantin Ryabitsev  wrote:
> Hello:
> 
> For the bugzilla integration work I'm doing, I need a way to check if there
> were any updates to a thread since the last check. Right now, I'm just
> grabbing the full thread, parsing it and seeing if there are any new
> message-IDs that we don't know about, but it's very wasteful. Any way to just
> issue something like "how many messages are in a thread with this message-id"
> or "are there any updates to a thread with this message-id since
> MMDDHHMMSS?

  lei q -t --only /path/to/(inbox|extindex) mid:$MSGID rt:APPROXIDATE..

Returns JSON and won't retrieve message bodies from git.

I wouldn't query down to the second due to propagation delays,
clock skew, etc, though.


There might be a JMAP endpoint I can implement for WWW which
only retrieves that info, but getting backreferences (required
by the JMAP spec) to work properly seemed painful.