Re: Index not_analysed for fields used as ids?

sebb Mon, 07 Nov 2016 08:13:01 -0800

On 7 November 2016 at 15:07, John D. Ament <[email protected]> wrote:
> On Mon, Nov 7, 2016 at 9:54 AM sebb <[email protected]> wrote:
>
>> On 7 November 2016 at 14:36, John D. Ament <[email protected]> wrote:
>> > On Mon, Nov 7, 2016 at 9:23 AM sebb <[email protected]> wrote:
>> >
>> >> On 7 November 2016 at 01:36, John D. Ament <[email protected]>
>> wrote:
>> >> > On Sun, Nov 6, 2016 at 8:22 PM sebb <[email protected]> wrote:
>> >> >
>> >> >> On 6 November 2016 at 14:37, John D. Ament <[email protected]>
>> >> wrote:
>> >> >> > On Sun, Nov 6, 2016 at 9:27 AM Daniel Gruno <[email protected]>
>> >> >> wrote:
>> >> >> >
>> >> >> >> On 11/06/2016 03:18 PM, sebb wrote:
>> >> >> >> > Fields such as message-id are stored as text strings, but they
>> are
>> >> >> >> > only really intended to be used as ids. They don't contain
>> >> independent
>> >> >> >> > text parts.
>> >> >> >> >
>> >> >> >> > From what I have understood so far from reading the ES docs,
>> such
>> >> >> >> > fields should be tagged as
>> >> >> >> >
>> >> >> >> > "index": "not_analyzed"
>> >> >> >> >
>> >> >> >> > AIUI this reduces the analysis overhead and storage
>> requirements,
>> >> and
>> >> >> >> > also makes it harder to find fields with
>> >> >> >> > This probably applies to other fields in "mbox":
>> >> >> >> >
>> >> >> >> > mid
>> >> >> >> > possibly in-reply-to
>> >> >> >> > also references
>> >> >> >> >
>> >> >> >> > And of course the auto-created fields such as attachments
>> >> >> >> >
>> >> >> >> > Likewise the doc types currently missing from setup.py:
>> >> >> >> >
>> >> >> >> > notifications
>> >> >> >> > account
>> >> >> >> > mailinglists
>> >> >> >> >
>> >> >> >> > These are internal use only so are not intended for searching.
>> >> >> >> >
>> >> >> >> > Or have I got this completely wrong?
>> >> >> >> >
>> >> >> >>
>> >> >> >> message-id is set to not be analyzed, by the setup script (it's in
>> >> the
>> >> >> >> mappings it sends to ES when creating the index). mid and
>> in-reply-to
>> >> >> >> should probably also be not analyzed, although mid is really a
>> copy
>> >> of
>> >> >> >> the doc ID, IIRC. the list ID is also not analyzed by default (as
>> >> >> >> list_raw), neither is the raw from address
>> >> >> >>
>> >> >> >
>> >> >> > So I notice the query process is an arbitrary full text query,
>> which
>> >> runs
>> >> >> > against _all.
>> >> >> >
>> >> >>
>> >>
>> https://github.com/apache/incubator-ponymail/blob/master/site/api/lib/elastic.lua#L44
>> >> >>
>> >> >> Huh?
>> >> >>
>> >> >> The query starts:
>> >> >>
>> >> >> local url = config.es_url .. doc .. "/_search?q="..query
>> >> >>
>> >> >> where
>> >> >>
>> >> >> es_url = "http://localhost:9200/ponymail/";
>> >> >>
>> >> >> and
>> >> >>
>> >> >> doc = "mbox" by default.
>> >> >>
>> >> >> Where does the _all come in?
>> >> >>
>> >> >
>> >> > When you do a query string query in elastic search (reference:
>> >> >
>> >>
>> https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-query-string-query.html
>> >> )
>> >> > the default field unless specified is "_all".  I can't find anything
>> in
>> >> the
>> >> > pony code that changes this field.  As a result, its going to search
>> _all
>> >> > by default.
>> >>
>> >> stats.lua changes the generic query into:
>> >>
>> >> "query_string": {
>> >>   "default_field": "subject",
>> >>   "query": "(from:\"QUERY\") OR (subject:\"QUERY\") OR (body:\"QUERY\")"
>> >> }
>> >>
>> >> Which does not use the _all field AFAICT
>> >>
>> >
>> > Ok, this is what I was looking for ( but couldn't find ).  But to
>> reiterate
>> > my notes from above - this means that the only mappings that matter are
>> > these fields.  Other field mappings don't matter.
>> >
>>
>> Surely all the text fields 'matter' - i.e. need to have a mapping?
>> Otherwise the default is to analyse them.
>>
>>
> Not based on the query in use.  The only three fields being searched are
> "from", "subject" and "body" - so only their mappings matter when doing
> search.


AFAICT they aren't the only fields that are searched by the code.

They are the only ones used by the search function, but internally,
the code also searches the mbox type on at the following at least:

message-id
mid
in-reply-to

It also searches notifications on:
recipient, seen
and mailinglists on:
name

(search *.lua for 'elastic.find')

> One of the concepts behind ES is that your model your index based on the
> queries you want to execute.  There's two points of view on that, only
> store the things that are relevant, or make everything relevant.

In this case, ES is also being used as a general-purpose database
(account, notifications, mailinglists)
These are in the same index, so there is no one set of queries that
applies to all doc types.

So I suspect neither point of view is completely appropriate here.

>
>> It's just a question of whether a field is used for searching, and if
>> so, what type(s) of searches are done.
>>
>> It looks like from/subject/body need to support word matching, so need
>> to be analysed.
>>
>
> We may want to consider things like partial match as well - fuzziness
> ranking, ngrams, etc.

Well yes, but does that affect the mapping choice?

>
>>
>> However message id and many other fields need only support keyword
>> matching.
>> So these only need to be indexed.
>>
>
> Yes and no.  ES 5 introduced the concept of an enum type which may be what
> message-id should be pointing to.

I cannot find a reference to an 'enum' type.
Do you mean 'keyword'? [1]

> Email message IDs include some of the
> stop characters in there "-" which need to be treated specially in queries.

Surely stop characters only apply to fields that are analyzed?
Which is why such fields need to be set up as not_analyzed (or keyword in 5.0)
This allows searching by exact value; no need to use a special query.

[1] https://www.elastic.co/guide/en/elasticsearch/reference/5.0/keyword.html

>
>>
>> >>
>> >> >
>> >> >>
>> >> >> > unless
>> >> >> > I need to dig into it a bit further to see if there's something
>> >> building
>> >> >> up
>> >> >> > query a bit different.
>> >> >> >
>> >> >> > So... that means most of these mappings are moot.
>> >> >>
>> >>
>>

Re: Index not_analysed for fields used as ids?

Reply via email to