Re: Proof of concept for counting messages in thread

2023-02-18 Thread Michael J Gruber
Am Di., 14. Feb. 2023 um 02:47 Uhr schrieb David Bremner :
>
> Michael J Gruber  writes:
>
> > That is really weird:
> > ```
> > xapian-delve -t G00021229 .
> > Posting List for term 'G00021229' (termfreq 115, collfreq 0,
> > wdf_max 0): 146259 ...
> > ```
> > with 115 record numbers, all different.
> > Doing `xapian-delve -1r` for each of them and grepping for the G-lines
> > gives 115 times that correct thread id.
> > Grepping for the Q-lines and notmuch-searching for the message ids
> > gives only 5 results (the expected ones). Apparantly, there are bogus
> > mail records which that thread points to.
>
> 1) Do those "bogus" records have a "Tghost" term? That would be for
> messages that are known via references, but not actually in the local
> database. This is a bug / feature of the current implementation, it
> counts all messages known, whether or not local copies exist.

Yes, the extra ones all are ghosts, and I slowly remember that they
scared me in the past already ...

These ghosts appear to be pretty common. It happens all the time that
I am joined to an existing discussion thread where I do not have all
references. I'd go as far as to say that counting ghosts as thread
members makes this useless for me. On the other hand, notmuch's own
count gets this right. And getting different counts is even more
confusing.

> 2) Do they have more than one G term? That suggests a bug somewhere. We
> actually have a test in the test suite [1] for that, but of course that is
> with a simple artificial database.

No, they all have one. But their sheer number looks suspicious: those
5 "real" e-mails have maybe 20 reference headers in total, and some of
them refer to some of those 5. Grepping the account store for those
references gives me around that number. Where do the 110 ghosts (90
extra) come from which this thread points to? Still scared by them ...
we need ghost busters!

Michael
___
notmuch mailing list -- notmuch@notmuchmail.org
To unsubscribe send an email to notmuch-le...@notmuchmail.org


[PATCH 5/6] WIP/test: (count ...) tests

2023-02-18 Thread David Bremner
The most interesting case is probably tags with a small number of
uses, since these could be typos or similar errors.
---
 test/T083-sexpr-count.sh | 8 
 1 file changed, 8 insertions(+)

diff --git a/test/T083-sexpr-count.sh b/test/T083-sexpr-count.sh
index 858aa8bf..1be3f62d 100755
--- a/test/T083-sexpr-count.sh
+++ b/test/T083-sexpr-count.sh
@@ -133,4 +133,12 @@ id:87iqd9rn3l.fsf@vertex.dottedmag
 EOF
 test_expect_equal_file EXPECTED OUTPUT
 
+test_begin_subtest "messages with tags used by 4 messages"
+output=$(notmuch count --output=messages --query=sexp '(tag (count 4))')
+test_expect_equal "${output}" "4"
+
+test_begin_subtest "no tag is used less than 4 times"
+output=$(notmuch count --output=messages --query=sexp '(tag (count * 3))')
+test_expect_equal "${output}" "0"
+
 test_done
-- 
2.39.1

___
notmuch mailing list -- notmuch@notmuchmail.org
To unsubscribe send an email to notmuch-le...@notmuchmail.org


[PATCH 4/6] WIP/test: pathname related tests

2023-02-18 Thread David Bremner
---
 test/T083-sexpr-count.sh | 21 +
 1 file changed, 21 insertions(+)

diff --git a/test/T083-sexpr-count.sh b/test/T083-sexpr-count.sh
index f3010d11..858aa8bf 100755
--- a/test/T083-sexpr-count.sh
+++ b/test/T083-sexpr-count.sh
@@ -112,4 +112,25 @@ notmuch@notmuchmail.org
 EOF
 test_expect_equal_file EXPECTED OUTPUT
 
+test_begin_subtest "attachment filenames with unique words"
+notmuch show --entire-thread=false --query=sexp '(attachment (count 1))' | \
+sed -n -e 's/, Content-type:.*$//' -e 's/.*Filename: //p' | sort > OUTPUT
+cat < EXPECTED
+0001-Deal-with-situation-where-sysconf-_SC_GETPW_R_SIZE_M.patch
+0001-Error-out-if-no-query-is-supplied-to-search-instead-.patch
+EOF
+test_expect_equal_file EXPECTED OUTPUT
+
+test_begin_subtest "messages in folders with several other messages"
+output=$(notmuch count --output=messages --query=sexp '(folder (count 28 *))')
+test_expect_equal "${output}" "28"
+
+test_begin_subtest "messages alone in a directory"
+notmuch search --output=messages --query=sexp '(path (count 1))' > OUTPUT
+cat < EXPECTED
+id:87lji4lx9v@yoom.home.cworth.org
+id:87iqd9rn3l.fsf@vertex.dottedmag
+EOF
+test_expect_equal_file EXPECTED OUTPUT
+
 test_done
-- 
2.39.1

___
notmuch mailing list -- notmuch@notmuchmail.org
To unsubscribe send an email to notmuch-le...@notmuchmail.org


[PATCH 2/6] WIP/lib: support count modifier in sexp queries

2023-02-18 Thread David Bremner
In this initial commit, support all term based fields, but only
document/test the thread size feature.
---
 lib/parse-sexp.cc| 65 ++--
 test/T083-sexpr-count.sh | 30 +++
 2 files changed, 79 insertions(+), 16 deletions(-)
 create mode 100755 test/T083-sexpr-count.sh

diff --git a/lib/parse-sexp.cc b/lib/parse-sexp.cc
index 9cadbc13..efe564c7 100644
--- a/lib/parse-sexp.cc
+++ b/lib/parse-sexp.cc
@@ -34,6 +34,8 @@ typedef enum {
 SEXP_FLAG_ORPHAN   = 1 << 8,
 SEXP_FLAG_RANGE= 1 << 9,
 SEXP_FLAG_PATHNAME = 1 << 10,
+SEXP_FLAG_COUNT= 1 << 11,
+SEXP_FLAG_MODIFIER = 1 << 12,
 } _sexp_flag_t;
 
 /*
@@ -65,24 +67,28 @@ static _sexp_prefix_t prefixes[] =
 { "and",Xapian::Query::OP_AND,  
Xapian::Query::MatchAll,
   SEXP_FLAG_NONE },
 { "attachment", Xapian::Query::OP_AND,  
Xapian::Query::MatchAll,
-  SEXP_FLAG_FIELD | SEXP_FLAG_WILDCARD | SEXP_FLAG_EXPAND },
+  SEXP_FLAG_FIELD | SEXP_FLAG_WILDCARD | SEXP_FLAG_EXPAND | 
SEXP_FLAG_COUNT},
 { "body",   Xapian::Query::OP_AND,  
Xapian::Query::MatchAll,
   SEXP_FLAG_FIELD },
 { "date",   Xapian::Query::OP_INVALID,  
Xapian::Query::MatchAll,
-  SEXP_FLAG_RANGE },
+  SEXP_FLAG_FIELD | SEXP_FLAG_RANGE },
+{ "count",  Xapian::Query::OP_INVALID,  
Xapian::Query::MatchAll,
+  SEXP_FLAG_MODIFIER | SEXP_FLAG_RANGE },
 { "from",   Xapian::Query::OP_AND,  
Xapian::Query::MatchAll,
-  SEXP_FLAG_FIELD | SEXP_FLAG_WILDCARD | SEXP_FLAG_REGEX | 
SEXP_FLAG_EXPAND },
+  SEXP_FLAG_FIELD | SEXP_FLAG_WILDCARD | SEXP_FLAG_REGEX | SEXP_FLAG_EXPAND
+  | SEXP_FLAG_COUNT },
 { "folder", Xapian::Query::OP_OR,   
Xapian::Query::MatchNothing,
-  SEXP_FLAG_FIELD | SEXP_FLAG_BOOLEAN | SEXP_FLAG_WILDCARD | 
SEXP_FLAG_REGEX | SEXP_FLAG_EXPAND |
-  SEXP_FLAG_PATHNAME },
+  SEXP_FLAG_FIELD | SEXP_FLAG_BOOLEAN | SEXP_FLAG_WILDCARD | 
SEXP_FLAG_REGEX
+  | SEXP_FLAG_EXPAND | SEXP_FLAG_PATHNAME | SEXP_FLAG_COUNT },
 { "id", Xapian::Query::OP_OR,   
Xapian::Query::MatchNothing,
   SEXP_FLAG_FIELD | SEXP_FLAG_BOOLEAN | SEXP_FLAG_WILDCARD | 
SEXP_FLAG_REGEX },
 { "infix",  Xapian::Query::OP_INVALID,  
Xapian::Query::MatchAll,
   SEXP_FLAG_SINGLE | SEXP_FLAG_ORPHAN },
 { "is", Xapian::Query::OP_AND,  
Xapian::Query::MatchAll,
-  SEXP_FLAG_FIELD | SEXP_FLAG_BOOLEAN | SEXP_FLAG_WILDCARD | 
SEXP_FLAG_REGEX | SEXP_FLAG_EXPAND },
+  SEXP_FLAG_FIELD | SEXP_FLAG_BOOLEAN | SEXP_FLAG_WILDCARD |
+  SEXP_FLAG_REGEX | SEXP_FLAG_EXPAND  | SEXP_FLAG_COUNT },
 { "lastmod",   Xapian::Query::OP_INVALID,  
Xapian::Query::MatchAll,
-  SEXP_FLAG_RANGE },
+  SEXP_FLAG_FIELD | SEXP_FLAG_RANGE },
 { "matching",   Xapian::Query::OP_AND,  
Xapian::Query::MatchAll,
   SEXP_FLAG_DO_EXPAND },
 { "mid",Xapian::Query::OP_OR,   
Xapian::Query::MatchNothing,
@@ -97,9 +103,10 @@ static _sexp_prefix_t prefixes[] =
   SEXP_FLAG_NONE },
 { "path",   Xapian::Query::OP_OR,   
Xapian::Query::MatchNothing,
   SEXP_FLAG_FIELD | SEXP_FLAG_BOOLEAN | SEXP_FLAG_WILDCARD | 
SEXP_FLAG_REGEX |
-  SEXP_FLAG_PATHNAME },
+  SEXP_FLAG_PATHNAME | SEXP_FLAG_COUNT},
 { "property",   Xapian::Query::OP_AND,  
Xapian::Query::MatchAll,
-  SEXP_FLAG_FIELD | SEXP_FLAG_BOOLEAN | SEXP_FLAG_WILDCARD | 
SEXP_FLAG_REGEX | SEXP_FLAG_EXPAND },
+  SEXP_FLAG_FIELD | SEXP_FLAG_BOOLEAN | SEXP_FLAG_WILDCARD |
+  SEXP_FLAG_REGEX | SEXP_FLAG_EXPAND | SEXP_FLAG_COUNT },
 { "query",  Xapian::Query::OP_INVALID,  
Xapian::Query::MatchNothing,
   SEXP_FLAG_SINGLE | SEXP_FLAG_ORPHAN },
 { "regex",  Xapian::Query::OP_INVALID,  
Xapian::Query::MatchAll,
@@ -109,13 +116,16 @@ static _sexp_prefix_t prefixes[] =
 { "starts-with",Xapian::Query::OP_WILDCARD, 
Xapian::Query::MatchAll,
   SEXP_FLAG_SINGLE },
 { "subject",Xapian::Query::OP_AND,  
Xapian::Query::MatchAll,
-  SEXP_FLAG_FIELD | SEXP_FLAG_WILDCARD | SEXP_FLAG_REGEX | 
SEXP_FLAG_EXPAND },
+  SEXP_FLAG_FIELD | SEXP_FLAG_WILDCARD | SEXP_FLAG_REGEX | SEXP_FLAG_EXPAND
+  | SEXP_FLAG_COUNT },
 { "tag",Xapian::Query::OP_AND,  
Xapian::Query::MatchAll,
-  SEXP_FLAG_FIELD | SEXP_FLAG_BOOLEAN | SEXP_FLAG_WILDCARD | 
SEXP_FLAG_REGEX | SEXP_FLAG_EXPAND },
+  SEXP_FLAG_FIELD | SEXP_FLAG_BOOLEAN | SEXP_FLAG_WILDCARD | 
SEXP_FLAG_REGEX
+  | SEXP_FLAG_EXPAND | SEXP_FLAG_COUNT},
 { "thread", Xapian::Query::OP_OR,   
Xapian::Query::MatchNothing,
-  SEXP_FLAG_FIELD | SEXP_FLAG_BOOLEAN | SEXP_FLAG_WILDCARD | 
SEXP_FLAG_REGEX | SEXP_FLAG_EXPAND },
+  SEXP_FLAG_FIELD | SEXP_FLAG_BOOLEAN | SEXP_FLAG_WILDCARD | 
SEXP_FLAG_REGEX |
+

[PATCH 3/6] WIP/test: (count ...) tests for to / from

2023-02-18 Thread David Bremner
---
 test/T083-sexpr-count.sh | 85 
 1 file changed, 85 insertions(+)

diff --git a/test/T083-sexpr-count.sh b/test/T083-sexpr-count.sh
index e825ef3d..f3010d11 100755
--- a/test/T083-sexpr-count.sh
+++ b/test/T083-sexpr-count.sh
@@ -27,4 +27,89 @@ cat  OUTPUT
+cat  OUTPUT
+cat  OUTPUT
+cat OUTPUT
+cat  OUTPUT
+cat  OUTPUT
+cat 

[PATCH 6/6] WIP/tests: (count ...) tests for subject

2023-02-18 Thread David Bremner
---
 test/T083-sexpr-count.sh | 8 
 1 file changed, 8 insertions(+)

diff --git a/test/T083-sexpr-count.sh b/test/T083-sexpr-count.sh
index 1be3f62d..b1c0a3ac 100755
--- a/test/T083-sexpr-count.sh
+++ b/test/T083-sexpr-count.sh
@@ -141,4 +141,12 @@ test_begin_subtest "no tag is used less than 4 times"
 output=$(notmuch count --output=messages --query=sexp '(tag (count * 3))')
 test_expect_equal "${output}" "0"
 
+test_begin_subtest "subjects with unique words"
+notmuch search --query=sexp '(and (from gusarov) (subject (count 1)))' | 
notmuch_search_sanitize > OUTPUT
+cat 

WIP: add a (count ...) modifier for sexp-queries

2023-02-18 Thread David Bremner
This updates and obsoletes the series at [1]. The backend that
constructs queries is unmodified from that series, but the parser now
allows the use of count modifier in several more places. Basically
there is not much more code to maintain to do it for any field based
on terms (notably not date), but the utility is a bit unclear in some
cases.

In (probably) decreasing order of compelling use case

1) (and (from bob) (thread (count 1))) # find messages from bob nobody replied 
to yet.

2) (and (subject init-systems) (thread (count 200 *))) # find me a mega thread 
on some topic

3) (and (to bremner) (from (count (2 *  # find people that sent me at least 
2 messages.

4) (tag (count 1)) # find tags used only once 
   
5) (path (count 1)) # find messages alone in a directory

6) (subject (count 1)) # find words used only once in subjects

[1]: id:20230213122631.2088558-1-da...@tethera.net


___
notmuch mailing list -- notmuch@notmuchmail.org
To unsubscribe send an email to notmuch-le...@notmuchmail.org


[PATCH 1/6] WIP/lib: add count query backend

2023-02-18 Thread David Bremner
---
 lib/Makefile.local |  3 +-
 lib/count-query.cc | 62 ++
 lib/database-private.h |  6 
 3 files changed, 70 insertions(+), 1 deletion(-)
 create mode 100644 lib/count-query.cc

diff --git a/lib/Makefile.local b/lib/Makefile.local
index 4e766305..cc646946 100644
--- a/lib/Makefile.local
+++ b/lib/Makefile.local
@@ -66,7 +66,8 @@ libnotmuch_cxx_srcs = \
$(dir)/init.cc  \
$(dir)/parse-sexp.cc\
$(dir)/sexp-fp.cc   \
-   $(dir)/lastmod-fp.cc
+   $(dir)/lastmod-fp.cc\
+   $(dir)/count-query.cc
 
 libnotmuch_modules := $(libnotmuch_c_srcs:.c=.o) $(libnotmuch_cxx_srcs:.cc=.o)
 
diff --git a/lib/count-query.cc b/lib/count-query.cc
new file mode 100644
index ..5d258880
--- /dev/null
+++ b/lib/count-query.cc
@@ -0,0 +1,62 @@
+/* count-query.cc - generate queries for terms on few / many messages.
+ *
+ * This file is part of notmuch.
+ *
+ * Copyright © 2023 David Bremner
+ *
+ * This program is free software: you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation, either version 3 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program.  If not, see https://www.gnu.org/licenses/ .
+ *
+ * Author: David Bremner 
+ */
+
+#include "database-private.h"
+
+notmuch_status_t
+_notmuch_count_strings_to_query (notmuch_database_t *notmuch, std::string 
field,
+const std::string &from, const std::string &to,
+Xapian::Query &output, std::string &msg)
+{
+
+long from_idx = 0, to_idx = LONG_MAX;
+std::string term_prefix = _find_prefix (field.c_str ());
+std::vector terms;
+
+if (! from.empty ()) {
+   try {
+   from_idx = std::stol(from);
+   } catch (std::logic_error &e) {
+   msg = "bad 'from' count: '" + from + "'";
+   return NOTMUCH_STATUS_BAD_QUERY_SYNTAX;
+   }
+}
+
+if (! to.empty ()) {
+   try {
+   to_idx = std::stod(to);
+   } catch (std::logic_error &e) {
+   msg = "bad 'to' count: '" + to + "'";
+   return NOTMUCH_STATUS_BAD_QUERY_SYNTAX;
+   }
+}
+
+for (Xapian::TermIterator it = notmuch->xapian_db->allterms_begin 
(term_prefix);
+it != notmuch->xapian_db->allterms_end (); ++it) {
+   Xapian::doccount freq = it.get_termfreq();
+   if (from_idx <= freq && freq <= to_idx)
+   terms.push_back (*it);
+}
+
+output = Xapian::Query (Xapian::Query::OP_OR, terms.begin (), terms.end 
());
+return NOTMUCH_STATUS_SUCCESS;
+}
diff --git a/lib/database-private.h b/lib/database-private.h
index b9be4e22..ba96a93c 100644
--- a/lib/database-private.h
+++ b/lib/database-private.h
@@ -387,5 +387,11 @@ notmuch_status_t
 _notmuch_lastmod_strings_to_query (notmuch_database_t *notmuch,
   const std::string &from, const std::string 
&to,
   Xapian::Query &output, std::string &msg);
+
+/* count-query.cc */
+notmuch_status_t
+_notmuch_count_strings_to_query (notmuch_database_t *notmuch, std::string 
field,
+const std::string &from, const std::string &to,
+Xapian::Query &output, std::string &msg);
 #endif
 #endif
-- 
2.39.1

___
notmuch mailing list -- notmuch@notmuchmail.org
To unsubscribe send an email to notmuch-le...@notmuchmail.org