On 7 November 2016 at 02:21, John D. Ament <[email protected]> wrote:
> On Sun, Nov 6, 2016 at 9:03 PM sebb <[email protected]> wrote:
>
>> On 7 November 2016 at 01:36, John D. Ament <[email protected]> wrote:
>> > On Sun, Nov 6, 2016 at 8:22 PM sebb <[email protected]> wrote:
>> >
>> >> On 6 November 2016 at 14:37, John D. Ament <[email protected]>
>> wrote:
>> >> > On Sun, Nov 6, 2016 at 9:27 AM Daniel Gruno <[email protected]>
>> >> wrote:
>> >> >
>> >> >> On 11/06/2016 03:18 PM, sebb wrote:
>> >> >> > Fields such as message-id are stored as text strings, but they are
>> >> >> > only really intended to be used as ids. They don't contain
>> independent
>> >> >> > text parts.
>> >> >> >
>> >> >> > From what I have understood so far from reading the ES docs, such
>> >> >> > fields should be tagged as
>> >> >> >
>> >> >> > "index": "not_analyzed"
>> >> >> >
>> >> >> > AIUI this reduces the analysis overhead and storage requirements,
>> and
>> >> >> > also makes it harder to find fields with
>> >> >> > This probably applies to other fields in "mbox":
>> >> >> >
>> >> >> > mid
>> >> >> > possibly in-reply-to
>> >> >> > also references
>> >> >> >
>> >> >> > And of course the auto-created fields such as attachments
>> >> >> >
>> >> >> > Likewise the doc types currently missing from setup.py:
>> >> >> >
>> >> >> > notifications
>> >> >> > account
>> >> >> > mailinglists
>> >> >> >
>> >> >> > These are internal use only so are not intended for searching.
>> >> >> >
>> >> >> > Or have I got this completely wrong?
>> >> >> >
>> >> >>
>> >> >> message-id is set to not be analyzed, by the setup script (it's in
>> the
>> >> >> mappings it sends to ES when creating the index). mid and in-reply-to
>> >> >> should probably also be not analyzed, although mid is really a copy
>> of
>> >> >> the doc ID, IIRC. the list ID is also not analyzed by default (as
>> >> >> list_raw), neither is the raw from address
>> >> >>
>> >> >
>> >> > So I notice the query process is an arbitrary full text query, which
>> runs
>> >> > against _all.
>> >> >
>> >>
>> https://github.com/apache/incubator-ponymail/blob/master/site/api/lib/elastic.lua#L44
>> >>
>> >> Huh?
>> >>
>> >> The query starts:
>> >>
>> >> local url = config.es_url .. doc .. "/_search?q="..query
>> >>
>> >> where
>> >>
>> >> es_url = "http://localhost:9200/ponymail/";
>> >>
>> >> and
>> >>
>> >> doc = "mbox" by default.
>> >>
>> >> Where does the _all come in?
>> >>
>> >
>> > When you do a query string query in elastic search (reference:
>> >
>> https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-query-string-query.html
>> )
>> > the default field unless specified is "_all".  I can't find anything in
>> the
>> > pony code that changes this field.  As a result, its going to search _all
>> > by default.
>> >
>>
>> Sorry, I thought you were referring to the _all doc type.
>>
>> But I'm not sure what this has to do with my original e-mail about
>> which fields should be indexed, and which should not.
>>
>
> Everything actually.

I assume you mean everything should *not* be indexed?
That will surely depend on whether there are any specific field searches,
e.g. Subject and From are shown as separate fields in the Advanced search.

> https://www.elastic.co/guide/en/elasticsearch/reference/current/mapping-all-field.html

In which case we should disable the _all field for all but the mbox mapping.

Most of those will not have many documents, apart from mbox_source,
and that does not have many text fields.
So maybe it won't make much difference.

> Basically, the mappings we use are moot on the individual fields (except
> for the epoch field) since all searches are performed against the _all
> field's value, which is just a big lob of everything smushed together.

Since epoch is double (why is it not long?), not a string, it's not
analysed anyway.

> Although the interesting thing, I just tried searching by message ID, and
> that doesn't seem to work on the ASF version out there -
> https://lists.apache.org/[email protected]:lte=1M:%[email protected]%3E

message-id is flagged as not_analysed: maybe that excludes it from _all

> John
>
>
>>
>> >>
>> >> > unless
>> >> > I need to dig into it a bit further to see if there's something
>> building
>> >> up
>> >> > query a bit different.
>> >> >
>> >> > So... that means most of these mappings are moot.
>> >>
>>

Reply via email to