[PATCH] TODO: add some Xapian-related stuff

2022-08-26 Thread Eric Wong
Just to more clearly spell out what needs to be done on
the search side.
---
 TODO | 7 +++
 1 file changed, 7 insertions(+)

diff --git a/TODO b/TODO
index 36055911..14dcfe72 100644
--- a/TODO
+++ b/TODO
@@ -111,6 +111,13 @@ all need to be considered for everything we introduce)
 * improve performance and avoid head-of-line blocking on slow storage
   (done for most git blob retrievals, Xapian needs work)
 
+* allow optional use of separate Xapian worker process to implement
+  timeouts and avoid head-of-line blocking problems.  Consider
+  just-ahead-of-time builds to take advantage of custom date parsers
+  (approxidate) and other features not available to Perl bindings.
+
+* integrate git approxidate parsing into Xapian w/o spawning git
+
 * HTTP(S) search API (likely JMAP, but GraphQL could be an option)
   It should support git-specific prefixes (dfpre:, dfpost:, dfn:, etc)
   as extensions.  If JMAP, it should have HTTP(S) analogues to



Re: extindex for git? [was: an even bigger git show than before...]

2022-08-26 Thread Eric Wong
Konstantin Ryabitsev  wrote:
> On Thu, Aug 25, 2022 at 09:34:42PM +, Eric Wong wrote:
> > > I wanted to add search to git repos ages ago, but it was silly
> > > expensive in terms of space.  That was before extindex...
> > > 
> > > extindex ought to be able to offer space savings across forks
> > > and similar documents (commits vs patch mails).
> > > 
> > > At least dfpre/dfpost/dfn/subject may be enough, even...
> > 
> > And I'm also thinking extindexing coderepos can make
> > auto-assocation with inboxes possible.
> > 
> > Right now, configuring coderepos on a large scale is a huge PITA
> > given the M:N associations between inboxes and coderepos.
> > 
> > Being able to do fuzzy JOIN-ish operations based on
> > blobs/filenames/subjects would allow extindex to automatically
> > associate coderepos with inboxes and vice-versa.
> 
> I wonder how well this would work in the presence of many forks? E.g. most of
> the content on git.kernel.org are thin forks of linux.git, so matching by
> blobs/filenames/subjects across all of them would return too many hits and
> some kind of priority ordering would be required, I think.

Auto-grouping of coderepos should be possible by common root commit(s).
Config file ordering will be taken into account, of course;
and that's at the discretion of whoever controls $PI_CONFIG.

> Overall, though, I do agree that this would be really handy.

Yes, it's something I've wanted for years; but couldn't figure
out how to do it efficiently until extindex.



Re: extindex for git? [was: an even bigger git show than before...]

2022-08-26 Thread Konstantin Ryabitsev
On Thu, Aug 25, 2022 at 09:34:42PM +, Eric Wong wrote:
> > I wanted to add search to git repos ages ago, but it was silly
> > expensive in terms of space.  That was before extindex...
> > 
> > extindex ought to be able to offer space savings across forks
> > and similar documents (commits vs patch mails).
> > 
> > At least dfpre/dfpost/dfn/subject may be enough, even...
> 
> And I'm also thinking extindexing coderepos can make
> auto-assocation with inboxes possible.
> 
> Right now, configuring coderepos on a large scale is a huge PITA
> given the M:N associations between inboxes and coderepos.
> 
> Being able to do fuzzy JOIN-ish operations based on
> blobs/filenames/subjects would allow extindex to automatically
> associate coderepos with inboxes and vice-versa.

I wonder how well this would work in the presence of many forks? E.g. most of
the content on git.kernel.org are thin forks of linux.git, so matching by
blobs/filenames/subjects across all of them would return too many hits and
some kind of priority ordering would be required, I think.

Overall, though, I do agree that this would be really handy.

-K



[PATCH] www: fix unindexed v1 inboxes w/ public-inbox-httpd

2022-08-26 Thread Eric Wong
Unindexed v1 inboxes were leaving $smsg objects unpopulated when
using public-inbox-httpd (but not generic PSGI servers) and
causing missing HTML content and uninitialized value warnings.

Our existing tests for unindexed v1 inboxes only assumed generic
PSGI servers and synchronous blob retrieval.  Due to changes
several years ago to make git blob retrieval async for slow
storage using public-inbox-httpd, our tests were insufficient to
detect this regression.

So ensure $smsg->populate runs in a few places and rewrite
t/plack.t to test against both generic PSGI and -httpd
implementations.

Fortunately, unindexed v1 inboxes are uncommon, and this
bug was only (finally) discovered while developing other
features.

For ensuring we can test (and not blindly follow) redirects with
-httpd, we now provide our own LWP::UserAgent (used internally
by Plack::Test::ExternalServer) with redirect following
disabled to P:T:ES::test_psgi.
---
 lib/PublicInbox/Feed.pm  |   5 +-
 lib/PublicInbox/TestCommon.pm|   8 +-
 lib/PublicInbox/WwwAtomStream.pm |   1 +
 t/plack.t| 177 ---
 4 files changed, 80 insertions(+), 111 deletions(-)

diff --git a/lib/PublicInbox/Feed.pm b/lib/PublicInbox/Feed.pm
index ee579f6d..e0810420 100644
--- a/lib/PublicInbox/Feed.pm
+++ b/lib/PublicInbox/Feed.pm
@@ -51,7 +51,10 @@ sub new_html_i {
my ($ctx, $eml) = @_;
$ctx->zmore($ctx->html_top) if exists $ctx->{-html_tip};
 
-   $eml and return PublicInbox::View::eml_entry($ctx, $eml);
+   if ($eml) {
+   $ctx->{smsg}->populate($eml) if !$ctx->{ibx}->{over};
+   return PublicInbox::View::eml_entry($ctx, $eml);
+   }
my $smsg = shift @{$ctx->{msgs}} or
$ctx->zmore(PublicInbox::View::pagination_footer(
$ctx, './new.html'));
diff --git a/lib/PublicInbox/TestCommon.pm b/lib/PublicInbox/TestCommon.pm
index 04adede0..55d82fc0 100644
--- a/lib/PublicInbox/TestCommon.pm
+++ b/lib/PublicInbox/TestCommon.pm
@@ -740,14 +740,18 @@ sub test_httpd ($$;$) {
$env->{$_} or BAIL_OUT "$_ unset";
}
SKIP: {
-   require_mods(qw(Plack::Test::ExternalServer), $skip // 1);
+   require_mods(qw(Plack::Test::ExternalServer LWP::UserAgent),
+   $skip // 1);
my $sock = tcp_server() or die;
my ($out, $err) = map { "$env->{TMPDIR}/std$_.log" } qw(out 
err);
my $cmd = [ qw(-httpd -W0), "--stdout=$out", "--stderr=$err" ];
my $td = start_script($cmd, $env, { 3 => $sock });
my ($h, $p) = tcp_host_port($sock);
local $ENV{PLACK_TEST_EXTERNALSERVER_URI} = "http://$h:$p";;
-   Plack::Test::ExternalServer::test_psgi(client => $client);
+   my $ua = LWP::UserAgent->new;
+   $ua->max_redirect(0);
+   Plack::Test::ExternalServer::test_psgi(client => $client,
+   ua => $ua);
$td->join('TERM');
open my $fh, '<', $err or BAIL_OUT $!;
my $e = do { local $/; <$fh> };
diff --git a/lib/PublicInbox/WwwAtomStream.pm b/lib/PublicInbox/WwwAtomStream.pm
index 82895db6..7b7047ac 100644
--- a/lib/PublicInbox/WwwAtomStream.pm
+++ b/lib/PublicInbox/WwwAtomStream.pm
@@ -38,6 +38,7 @@ sub async_next ($) {
 sub async_eml { # for async_blob_cb
my ($ctx, $eml) = @_;
my $smsg = delete $ctx->{smsg};
+   $smsg->{mid} // $smsg->populate($eml);
$ctx->write(feed_entry($ctx, $smsg, $eml));
 }
 
diff --git a/t/plack.t b/t/plack.t
index a5fd54c9..20f5d8d5 100644
--- a/t/plack.t
+++ b/t/plack.t
@@ -9,6 +9,7 @@ my @mods = qw(HTTP::Request::Common Plack::Test URI::Escape);
 require_mods(@mods);
 foreach my $mod (@mods) { use_ok $mod; }
 ok(-f $psgi, "psgi example file found");
+my ($tmpdir, $for_destroy) = tmpdir();
 my $pfx = 'http://example.com/test';
 my $eml = eml_load('t/iso-2202-jp.eml');
 # ensure successful message deliveries
@@ -71,91 +72,74 @@ EOF
close $fh or BAIL_OUT "close: $!";
 });
 
-local $ENV{PI_CONFIG} = "$ibx->{inboxdir}/pi_config";
-my $app = require $psgi;
-test_psgi($app, sub {
+my $env = { PI_CONFIG => "$ibx->{inboxdir}/pi_config", TMPDIR => $tmpdir };
+local @ENV{keys %$env} = values %$env;
+my $c1 = sub {
my ($cb) = @_;
+   my $uri = $ENV{PLACK_TEST_EXTERNALSERVER_URI} // 'http://example.com';
+   $pfx = "$uri/test";
+
foreach my $u (qw(robots.txt favicon.ico .well-known/foo)) {
-   my $res = $cb->(GET("http://example.com/$u";));
+   my $res = $cb->(GET("$uri/$u"));
is($res->code, 404, "$u is missing");
}
-});
 
-test_psgi($app, sub {
-   my ($cb) = @_;
-   my $res = $cb->(GET('http://example.com/test/c...@example.com/'));
+   my $res = $cb->(GET("$uri/test/crlf\@example.com/")