On 8 January 2017 at 16:49, John D. Ament <[email protected]> wrote:
> Sebb,
>
> I'm not sure what additional setup has to be done.  Perhaps with some of
> the changes of whats getting indexed and what mappings are applied has
> caused it to behave differently in 0.10?

No, ES 5.x cannot use text mappings by default [1]

Changing this would require changing the subject mapping in mbox and
notifications which also uses subject.

> Since Ponymail is a publicly consumed product, we shouldn't disable or
> break a feature just because it isn't used in the ASF hosted version.  I
> can see it being useful at some point in the ASF, so maybe it'll be
> reconsidered at a later point.

It's already disabled in lists.a.o; I assume since the move to ES 5.x
as I'm pretty sure it worked originally.

> Looking at the docs for significant_terms, its memory use is relative to
> the index size and the actual search being performed.  The notion of a
> large index isn't defined well, but I suspect its going to be much larger
> than what is currently deployed.

That's not the only reason why I think significant terms is the wrong choice.

Significant terms works best when used against a corpus that properly
represents the general emails.

However the ASF index contains dev, user, commit, issue and other
lists which all have different basic subject characteristics.
It does not make sense to say that particular dev subject terms are
significant when there are more non-dev lists than dev lists.
Similarly with user, commits etc.

If it were available for free, I would say why not?
But it is not free, so it has to pay its way.
I don't see how it helps, certainly for the ASF case.

For other installations, it may or may not make sense.
But even if it does make sense, the choice of how to derive the terms
(significant or not, and the algorithhm used - e.g. should it be
chi-squared or something else?) needs to be defined by the
installation.
That's not possible currently.

Also, why only analyse the subject?
Are e-mail subjects really a good indicator of the mail content?

I'm sorry, it looks nice, but I just don't see the point, given that
it is expensive.

[1] 
https://www.elastic.co/guide/en/elasticsearch/reference/5.1/fielddata.html#_fielddata_is_disabled_on_literal_text_literal_fields_by_default

> John
>
>
>
> On Sun, Jan 8, 2017 at 6:19 AM sebb <[email protected]> wrote:
>
>> Although the feature looks neat, it is not free (can use a lot of
>> memory), and requires additional configuration to run under ES 5.x.
>>
>> It's not currently enabled on lists.a.o AFAICT, probably because of
>> the need to reconfigure one of the mappings.
>>
>> So the first thing to establish is whether the feature gives enough
>> benefit to be worth the extra cost in memory and processing?
>>
>> Does it really help to know what words are used in email subjects?
>> I'm not sure it does, beyond curiousity.
>>
>> If there is a cost-benefit, then next thing to establish is the best
>> measure to use.
>>
>>
>> On 8 January 2017 at 00:13, sebb <[email protected]> wrote:
>> > I see hot topics as being the topics which are most common in the set
>> > of messages under consideration.
>> >
>> > Significant terms are those which are unusual against the background,
>> > which in this case is the entire ASF mail corpus.
>> >
>> > There are so many different projects with different topics that I
>> > don't see how it helps to use that corpus as the background.
>> >
>> > In theory it is possible to change the background for each request,
>> > but that is likely to be very expensive.
>> >
>> > On 8 January 2017 at 00:02, Daniel Gruno <[email protected]> wrote:
>> >> Forwarding to dev list, which is where it belongs :)
>> >>
>> >>
>> >> -------- Forwarded Message --------
>> >> Subject: Re: incubator-ponymail git commit: 'hot topics' feature should
>> >> use terms, not significant_terms
>> >> Date: Sun, 8 Jan 2017 01:01:07 +0100
>> >> From: Daniel Gruno <[email protected]>
>> >> Reply-To: [email protected]
>> >> To: [email protected]
>> >>
>> >> On 01/08/2017 12:55 AM, [email protected] wrote:
>> >>> Repository: incubator-ponymail
>> >>> Updated Branches:
>> >>>   refs/heads/master e153c4abc -> 2ebf5e7a7
>> >>>
>> >>>
>> >>> 'hot topics' feature should use terms, not significant_terms
>> >>>
>> >>> This fixes #329
>> >>>
>> >>> Project:
>> http://git-wip-us.apache.org/repos/asf/incubator-ponymail/repo
>> >>> Commit:
>> http://git-wip-us.apache.org/repos/asf/incubator-ponymail/commit/2ebf5e7a
>> >>> Tree:
>> http://git-wip-us.apache.org/repos/asf/incubator-ponymail/tree/2ebf5e7a
>> >>> Diff:
>> http://git-wip-us.apache.org/repos/asf/incubator-ponymail/diff/2ebf5e7a
>> >>>
>> >>> Branch: refs/heads/master
>> >>> Commit: 2ebf5e7a735f54042c6c59d80a932bb4bc6a96cd
>> >>> Parents: e153c4a
>> >>> Author: Sebb <[email protected]>
>> >>> Authored: Sat Jan 7 23:55:21 2017 +0000
>> >>> Committer: Sebb <[email protected]>
>> >>> Committed: Sat Jan 7 23:55:21 2017 +0000
>> >>>
>> >>> ----------------------------------------------------------------------
>> >>>  CHANGELOG.md       | 1 +
>> >>>  site/api/stats.lua | 6 ++++--
>> >>>  tools/setup.py     | 1 +
>> >>>  3 files changed, 6 insertions(+), 2 deletions(-)
>> >>> ----------------------------------------------------------------------
>> >>>
>> >>>
>> >>>
>> http://git-wip-us.apache.org/repos/asf/incubator-ponymail/blob/2ebf5e7a/CHANGELOG.md
>> >>> ----------------------------------------------------------------------
>> >>> diff --git a/CHANGELOG.md b/CHANGELOG.md
>> >>> index b440207..14e9c30 100644
>> >>> --- a/CHANGELOG.md
>> >>> +++ b/CHANGELOG.md
>> >>> @@ -109,6 +109,7 @@
>> >>>  - absolute URLs must be prefixed with URLBase in JS files (#327)
>> >>>  - cannot use absolute URLs in HTML pages (#328)
>> >>>  - setup.py now prompts for shard and replica counts when creating the
>> index (#313)
>> >>> +- 'hot topics' feature should use terms, not significant_terms (#329)
>> >>>
>> >>>  ## CHANGES in 0.9b:
>> >>>
>> >>>
>> >>>
>> http://git-wip-us.apache.org/repos/asf/incubator-ponymail/blob/2ebf5e7a/site/api/stats.lua
>> >>> ----------------------------------------------------------------------
>> >>> diff --git a/site/api/stats.lua b/site/api/stats.lua
>> >>> index a8f11ec..62da9f1 100644
>> >>> --- a/site/api/stats.lua
>> >>> +++ b/site/api/stats.lua
>> >>> @@ -30,6 +30,8 @@ local days = {
>> >>>  }
>> >>>
>> >>>  local BODY_MAXLEN = config.stats_maxBody or 200
>> >>> +-- words to exclude from word cloud:
>> >>> +local EXCLUDE = config.stats_wordExclude or ".|..|..."
>> >>>
>> >>>  local function sortEmail(thread)
>> >>>      if thread.children and type(thread.children) == "table" then
>> >>> @@ -411,10 +413,10 @@ function handle(r)
>> >>>              terminate_after = 100,
>> >>>              aggs = {
>> >>>                  cloud = {
>> >>> -                    significant_terms =  {
>> >>> +                    terms =  {
>> >>>                          field =  "subject",
>> >>>                          size = 10,
>> >>> -                        chi_square = {}
>> >>> +                        exclude = EXCLUDE
>> >>>                      }
>> >>>                  }
>> >>>              },
>> >>
>> >> Exqueeze me? significant_terms is specifically used, so, for instance,
>> >> apache lists, don't get "apache" as a hot topic. It has to be a topic
>> >> that is "trending in this query, but not in general", not "what do we
>> >> have most of around here". If we just use terms, the result becomes
>> >> utterly useless, as it does not take into account how common those terms
>> >> are in, let's say, the ASF in general.
>> >>
>> >> Consider this a -1 to that commit unless you can convince me otherwise.
>> >>
>> >> With regards,
>> >> Daniel.
>> >>
>> >>>
>> >>>
>> http://git-wip-us.apache.org/repos/asf/incubator-ponymail/blob/2ebf5e7a/tools/setup.py
>> >>> ----------------------------------------------------------------------
>> >>> diff --git a/tools/setup.py b/tools/setup.py
>> >>> index c02116a..19d4bd5 100755
>> >>> --- a/tools/setup.py
>> >>> +++ b/tools/setup.py
>> >>> @@ -501,6 +501,7 @@ local config = {
>> >>>      full_headers = false,
>> >>>      maxResults = 5000, -- max emails to return in one go. Might need
>> to be bumped for large lists
>> >>>  --  stats_maxBody = 200, -- max size of body snippet returned by
>> stats.lua
>> >>> +--  stats_wordExclude = ".|..|...", -- patterns to exclude from word
>> cloud generated by stats.lua
>> >>>      admin_oauth = {}, -- list of domains that may do administrative
>> oauth (private list access)
>> >>>                       -- add 'www.googleapis.com' to the list for
>> google oauth to decide, for instance.
>> >>>      oauth_fields = { -- used for specifying individual oauth handling
>> parameters.
>> >>>
>> >>
>>

Reply via email to