Re: [Dovecot] New full text search indexer

2007-11-14 Thread Timo Sirainen

On 14.11.2007, at 17.20, Adam McDougall wrote:

Thanks for this list of steps, I've been intending to test it and  
was just about
getting ready to ask the same question.  Your email would be mice  
content to

throw on the dovecot wiki under fts (currently empty)


I wrote something there now. Also about Squat's background and how it  
works and Lucene.




PGP.sig
Description: This is a digitally signed message part


Re: [Dovecot] New full text search indexer

2007-11-14 Thread Asheesh Laroia

On Wed, 14 Nov 2007, Adam McDougall wrote:

Thanks for this list of steps, I've been intending to test it and was 
just about getting ready to ask the same question.  Your email would be 
mice content to throw on the dovecot wiki under fts (currently empty)


Good point, but you do it, it's bedtime here in Japan. (-:

-- Asheesh.

--
One good reason why computers can do more work than people is that they
never have to stop and answer the phone.


Re: [Dovecot] New full text search indexer

2007-11-14 Thread Adam McDougall
On Thu, Nov 15, 2007 at 12:07:15AM +0900, Asheesh Laroia wrote:

  On Wed, 14 Nov 2007, Daniel Watts wrote:
  
>> Timo - we were just having a conversation about how we might be able to 
>> provide full body indexed search for our clients and I realised it might 
>> be worth checking the Dovecot list to see if this has been done already...
>> 
>> And then I find this thread!
>> 
>> What is the latest? Searches in the headers work fantastically but any 
>> search in a sizeable folder (1000's messages) often just time out.
  
  FWIW in 1.1beta releases, you can enable the "squat" plugin and get 
  full-text indexed search already.
  
  Timo's working on improvements, as you can see, and a replacement, but 
  don't forget that Squat is real and pretty fast.
  
  To enable it, do:
  
  1. In the "protocol imap {" section of the config, add this line to the 
  end:
  
 mail_plugins = fts fts_squat
  
  Do that just before the close curly brace.
  
  2. If you have a "plugin {" section of the dovecot.conf, do 2(a). 
  Otherwise, do 2(b).
  
  2a. In the "plugin {" section, add this line to the end:
  
fts = squat
  
  2b. At the end of the file, write this:
  
plugin {
fts = squat
}
  
  That should be all!  SEARCH TEXT will be, shall we say, accelerated. (Note 
  that indexes have to be generated sometime, so the way things are now 
  they're generated at the first SEARCH TEXT.)
  
  -- Asheesh.
  
Thanks for this list of steps, I've been intending to test it and was just about
getting ready to ask the same question.  Your email would be mice content to
throw on the dovecot wiki under fts (currently empty)


Re: [Dovecot] New full text search indexer

2007-11-14 Thread Asheesh Laroia

On Wed, 14 Nov 2007, Daniel Watts wrote:

Timo - we were just having a conversation about how we might be able to 
provide full body indexed search for our clients and I realised it might 
be worth checking the Dovecot list to see if this has been done 
already...


And then I find this thread!

What is the latest? Searches in the headers work fantastically but any 
search in a sizeable folder (1000's messages) often just time out.


FWIW in 1.1beta releases, you can enable the "squat" plugin and get 
full-text indexed search already.


Timo's working on improvements, as you can see, and a replacement, but 
don't forget that Squat is real and pretty fast.


To enable it, do:

1. In the "protocol imap {" section of the config, add this line to the 
end:


 mail_plugins = fts fts_squat

Do that just before the close curly brace.

2. If you have a "plugin {" section of the dovecot.conf, do 2(a). 
Otherwise, do 2(b).


2a. In the "plugin {" section, add this line to the end:

  fts = squat

2b. At the end of the file, write this:

  plugin {
  fts = squat
  }

That should be all!  SEARCH TEXT will be, shall we say, accelerated. 
(Note that indexes have to be generated sometime, so the way things are 
now they're generated at the first SEARCH TEXT.)


-- Asheesh.

--
Is it clean in other dimensions?


Re: [Dovecot] New full text search indexer

2007-11-14 Thread Daniel Watts

Timo Sirainen wrote:

On Fri, 2007-04-06 at 00:34 +0300, Timo Sirainen wrote:

Squat-like 4 byte substrings (but can answer 1-3 char queries also):


Indexing a 1,4GB Linux kernel mailing list mbox with 367919 messages:

UID count: 367919
Index time: 129.86 CPU seconds (10.43MB/CPUs), 132.47 seconds
(10.23MB/s)
Memory: UID list 143162 kB, heap total 263996 kB
Node space: 9182910 bytes (8967 kB) in 47989 nodes
UID space: 357344706 bytes (348969 kB) in 7404211 lists
Total space: 366527616 / 1420611590 bytes = 25.80%

So the process takes 260MB of memory in the end. I think that's OK. If I
removed memory limits, it would use 1,4GB of memory and index file size
would drop to 17,96%. That could be achieved also by post-processing the
indexes.

gzip compression makes the uidlist still 25% smaller (total space
19,50%). It'd have to be used to compress the file in smaller blocks
because zlib doesn't support quickly seeking inside the file. That would
probably waste some space. I don't think it's worth the extra CPU time
and complexity.

Currently the uidlists are stored either as "packed UID ranges" or
bitmasks, whichever takes less space. I haven't yet tried how much space
could be saved by using a combination of these within uidlists.

I'd still have to figure out how to do incremental updates without
wasting a lot of space. And support the actual searching. :)

http://dovecot.org/tmp/new-indexer.c updated.



Timo - we were just  having a conversation about how we might be able to 
provide full body indexed search for our clients and I realised it might 
be worth checking the Dovecot list to see if this has been done already...


And then I find this thread!

What is the latest? Searches in the headers work fantastically but any 
search in a sizeable folder (1000's messages) often just time out.


Would love to offer a full text search so let me know how far you have 
gotten.


Will the search be able to offer phrase searching? Eg "Search for this 
string" as an exact match?


Fantastic - very happy to see you're working on adding this.

Daniel



Re: [Dovecot] New full text search indexer

2007-04-13 Thread Ben Winslow
On Fri, 13 Apr 2007 14:07:58 +0300
Timo Sirainen <[EMAIL PROTECTED]> wrote:

> gzip compression makes the uidlist still 25% smaller (total space
> 19,50%). It'd have to be used to compress the file in smaller blocks
> because zlib doesn't support quickly seeking inside the file. That would
> probably waste some space. I don't think it's worth the extra CPU time
> and complexity.

If you want to try to glue in another implementation as a quick test,
dictd (http://sourceforge.net/projects/dict) has a seekable gzip
implementation (that produces reverse-compatible files to boot.)  I
haven't looked at the actual code to see how hairy it is or how
tightly it's intertwined with the rest of dictd, though.

The manual page for dictzip has some details on the implementation.

--
Ben Winslow <[EMAIL PROTECTED]>


pgpbzoHlsVIaO.pgp
Description: PGP signature


Re: [Dovecot] New full text search indexer

2007-04-13 Thread Timo Sirainen
On Fri, 2007-04-06 at 00:34 +0300, Timo Sirainen wrote:
> Squat-like 4 byte substrings (but can answer 1-3 char queries also):

Indexing a 1,4GB Linux kernel mailing list mbox with 367919 messages:

UID count: 367919
Index time: 129.86 CPU seconds (10.43MB/CPUs), 132.47 seconds
(10.23MB/s)
Memory: UID list 143162 kB, heap total 263996 kB
Node space: 9182910 bytes (8967 kB) in 47989 nodes
UID space: 357344706 bytes (348969 kB) in 7404211 lists
Total space: 366527616 / 1420611590 bytes = 25.80%

So the process takes 260MB of memory in the end. I think that's OK. If I
removed memory limits, it would use 1,4GB of memory and index file size
would drop to 17,96%. That could be achieved also by post-processing the
indexes.

gzip compression makes the uidlist still 25% smaller (total space
19,50%). It'd have to be used to compress the file in smaller blocks
because zlib doesn't support quickly seeking inside the file. That would
probably waste some space. I don't think it's worth the extra CPU time
and complexity.

Currently the uidlists are stored either as "packed UID ranges" or
bitmasks, whichever takes less space. I haven't yet tried how much space
could be saved by using a combination of these within uidlists.

I'd still have to figure out how to do incremental updates without
wasting a lot of space. And support the actual searching. :)

http://dovecot.org/tmp/new-indexer.c updated.



signature.asc
Description: This is a digitally signed message part


Re: [Dovecot] New full text search indexer

2007-04-06 Thread Charles Marcus

Timo Sirainen wrote:

As described earlier
(http://dovecot.org/list/dovecot/2006-December/018055.html), Dovecot
nowadays has full text search indexing support in CVS HEAD.





So it takes somewhat more space, but definitely less than having both
Squat + Lucene. 


No substring indexing, words up to 10 chars (Lucene replacement):

Index time: 1.39 CPU seconds (22.28MB/CPUs)
Node memory: 3297663 bytes (3220 kB) in 548540 nodes
UID memory: 1803735 bytes (1761 kB) in 213554 lists
Total: 5101398 / 32474899 bytes = 15.71%

Is there a need for Lucene anymore with this kind of a speed and disk
space usage? :)




Wow - as usual impressive work, Timo! I wish I was skilled enough to 
help you stress test this, but alas, I'm not... I'll have to just wait 
and rely on those more capable. But it looks to make a huge difference 
in Clients that use the new 'Virtual Folder' technology, where the 
folders are based on multiple search criteria. I manage one Courier-imap 
server (am still trying to convince the boss to let me switch to 
dovecot, but he wants to wait until 1.0 has been released and out for a 
while, and on that server, people that have lots of messages in a single 
folder (they use Thunderbird) and create simple Virtual Folders - well, 
lets just say that they can't use them. Clicking on the Virtual Folder 
always takes forever and more often than not results in a time-out error.


Dovecot is already better than Courier in this respect, but it sounds 
like this will make it even better...


Thanks again!

--

Best regards,

Charles


Re: [Dovecot] New full text search indexer

2007-04-06 Thread Timo Sirainen
On Fri, 2007-04-06 at 10:42 +0200, DINH Viêt Hoà wrote:
> An other problem with squat is that we can't remove items from the
> index. (the version of Cyrus). Is that still the case ?

No. Dovecot's Squat is almost completely different from Cyrus. I just
kept the name because the basic ideas are the same. So Dovecot already
allows incremental updates and it can remove expunged messages.



signature.asc
Description: This is a digitally signed message part


Re: [Dovecot] New full text search indexer

2007-04-06 Thread DINH Viêt Hoà

On 4/5/07, Timo Sirainen <[EMAIL PROTECTED]> wrote:

As described earlier
(http://dovecot.org/list/dovecot/2006-December/018055.html), Dovecot
nowadays has full text search indexing support in CVS HEAD.

Currently there are two backends: Lucene and Squat. Lucene's problem is
that standard IMAP SEARCH command can't be used with it without breaking
IMAP RFC. So Lucene can be used only with non-standard X-BODY-FAST and
X-TEXT-FAST search commands.

Squat's code then is quite complex and it's pretty slow. It works with 4
character blocks, so it can answer:

 - Search string < 4 characters: Can't answer anything
 - Search string = 4 characters: Definite yes/no
 - Search string > 4 characters: Probable yes / definite no

Since search strings are usually more than 4 characters, Dovecot uses
the index only to filter out messages which definitely won't match and
then does regular searching for the non-filtered messages. It works ok,
but it could work better.

The Squat indexes take maybe 30-50% of the mailbox's size. Most of the
space goes to UID lists (list of messages matching each 4 character
block). For example one UID list could be "2,4-10,20-30". This is
already stored in a compressed format by storing the UID numbers
relative to previous UID, which allows storing them usually with a
single byte per UID, even if the UIDs are large.

So, this morning I had an idea how this could maybe be improved with two
changes:

1. Don't use 4 character blocks. Instead allow variable sized paths, and
keep track of the UIDs for each node in the path. So for example adding
"abc" to UID 1 will place the UID 1 to node "a", "ab" and "abc".

2. Use UIDs relative to parent's UIDs. For example if node "a" has UIDs
1,3,5 and "ab" is added for UID 5, the UID is changed to 2 (3th in 1,3,5
list - 1). This can help the compression a lot. The best case scenario
for the above example is if "ab" contains also 1,3,5 UIDs. Then it can
be written with a single 0-2 range.

With these changes, the indexer can be configured to perform in
different ways. Text can be indexed in two ways: full words, and
substrings within those words (as required by IMAP). I think a useful
combination of these would be to index all 4 character substrings and
all full words up to 10 characters or so. This kind of an index can
answer:

 - Search string <= 4 characters: Definite yes/no
 - For X-BODY-FAST / X-TEXT-FAST:
   - Search string 5-10 characters: Definite yes/no
   - Search string 11+ characters: Probable yes / definite no
 - For standard IMAP SEARCH:
   - Search string 5-10 characters: Definite yes if message contains the
key as non-substring, otherwise same as below. This helps only if user
does a search that matches a lot of messages.
   - Search string 11+ characters: Probable yes / definite no

UTF-8 text is indexed as bytes, but the path depth can be counted as
characters instead of as bytes.

I've a test program finished, and unless there are some bugs left I
think it shows that this can work pretty well:

Squat-like 4 byte substrings (but can answer 1-3 char queries also):

Message count: 7495
Index time: 4.19 CPU seconds (7.39MB/CPUs)
Node memory: 3014072 bytes (2943 kB) in 90550 nodes
UID memory: 8277212 bytes (8083 kB) in 352778 lists
Total: 11291284 / 32474899 bytes = 34.77%

Indexing the same mailbox file with squat-test gives:

 - Index time: 9.55 CPU seconds, 9.63 real seconds (3.24MB/CPUs)
 - 4145192 bytes in 102208 nodes (12.76%)
 - 7286058 bytes in 305501 uid_lists (22.44%)
 - 11431250 bytes total of 32474899 (35.20%)

So the new indexer is over twice as fast and uses slightly less space.

Indexing 4 char substrings and words up to 10 chars:

Index time: 5.13 CPU seconds (6.04MB/CPUs)
Node memory: 6002804 bytes (5862 kB) in 615501 nodes
UID memory: 9044610 bytes (8832 kB) in 519488 lists
Total: 15047414 / 32474899 bytes = 46.34%

So it takes somewhat more space, but definitely less than having both
Squat + Lucene.

No substring indexing, words up to 10 chars (Lucene replacement):

Index time: 1.39 CPU seconds (22.28MB/CPUs)
Node memory: 3297663 bytes (3220 kB) in 548540 nodes
UID memory: 1803735 bytes (1761 kB) in 213554 lists
Total: 5101398 / 32474899 bytes = 15.71%

Is there a need for Lucene anymore with this kind of a speed and disk
space usage? :)

The above tests are run with everything being indexed except space and
control chars (ie. practically space + tab). If only alphanumeric
characters are indexed, the total lines are (in same order as above):

Total: 7048664 / 32474899 bytes = 21.70% (-13.07%)
Total: 9610543 / 32474899 bytes = 29.59% (-16.75%)
Total: 3963335 / 32474899 bytes = 12.20% (-3.51%)

There are probably still a few optimizations that can be done to get the
disk space usage even lower. I did spent quite a lot of time making it
use less CPU though. Even tried a bit different coding style variations
in the critical functions to get gcc to create faster code..

One interesting thing I noticed was that binary searching a ch

[Dovecot] New full text search indexer

2007-04-05 Thread Timo Sirainen
As described earlier
(http://dovecot.org/list/dovecot/2006-December/018055.html), Dovecot
nowadays has full text search indexing support in CVS HEAD.

Currently there are two backends: Lucene and Squat. Lucene's problem is
that standard IMAP SEARCH command can't be used with it without breaking
IMAP RFC. So Lucene can be used only with non-standard X-BODY-FAST and
X-TEXT-FAST search commands.

Squat's code then is quite complex and it's pretty slow. It works with 4
character blocks, so it can answer:

 - Search string < 4 characters: Can't answer anything
 - Search string = 4 characters: Definite yes/no
 - Search string > 4 characters: Probable yes / definite no

Since search strings are usually more than 4 characters, Dovecot uses
the index only to filter out messages which definitely won't match and
then does regular searching for the non-filtered messages. It works ok,
but it could work better.

The Squat indexes take maybe 30-50% of the mailbox's size. Most of the
space goes to UID lists (list of messages matching each 4 character
block). For example one UID list could be "2,4-10,20-30". This is
already stored in a compressed format by storing the UID numbers
relative to previous UID, which allows storing them usually with a
single byte per UID, even if the UIDs are large.

So, this morning I had an idea how this could maybe be improved with two
changes:

1. Don't use 4 character blocks. Instead allow variable sized paths, and
keep track of the UIDs for each node in the path. So for example adding
"abc" to UID 1 will place the UID 1 to node "a", "ab" and "abc".

2. Use UIDs relative to parent's UIDs. For example if node "a" has UIDs
1,3,5 and "ab" is added for UID 5, the UID is changed to 2 (3th in 1,3,5
list - 1). This can help the compression a lot. The best case scenario
for the above example is if "ab" contains also 1,3,5 UIDs. Then it can
be written with a single 0-2 range.

With these changes, the indexer can be configured to perform in
different ways. Text can be indexed in two ways: full words, and
substrings within those words (as required by IMAP). I think a useful
combination of these would be to index all 4 character substrings and
all full words up to 10 characters or so. This kind of an index can
answer:

 - Search string <= 4 characters: Definite yes/no
 - For X-BODY-FAST / X-TEXT-FAST:
   - Search string 5-10 characters: Definite yes/no
   - Search string 11+ characters: Probable yes / definite no
 - For standard IMAP SEARCH:
   - Search string 5-10 characters: Definite yes if message contains the
key as non-substring, otherwise same as below. This helps only if user
does a search that matches a lot of messages.
   - Search string 11+ characters: Probable yes / definite no

UTF-8 text is indexed as bytes, but the path depth can be counted as
characters instead of as bytes.

I've a test program finished, and unless there are some bugs left I
think it shows that this can work pretty well:

Squat-like 4 byte substrings (but can answer 1-3 char queries also):

Message count: 7495
Index time: 4.19 CPU seconds (7.39MB/CPUs)
Node memory: 3014072 bytes (2943 kB) in 90550 nodes
UID memory: 8277212 bytes (8083 kB) in 352778 lists
Total: 11291284 / 32474899 bytes = 34.77%

Indexing the same mailbox file with squat-test gives:

 - Index time: 9.55 CPU seconds, 9.63 real seconds (3.24MB/CPUs)
 - 4145192 bytes in 102208 nodes (12.76%)
 - 7286058 bytes in 305501 uid_lists (22.44%)
 - 11431250 bytes total of 32474899 (35.20%)

So the new indexer is over twice as fast and uses slightly less space.

Indexing 4 char substrings and words up to 10 chars:

Index time: 5.13 CPU seconds (6.04MB/CPUs)
Node memory: 6002804 bytes (5862 kB) in 615501 nodes
UID memory: 9044610 bytes (8832 kB) in 519488 lists
Total: 15047414 / 32474899 bytes = 46.34%

So it takes somewhat more space, but definitely less than having both
Squat + Lucene. 

No substring indexing, words up to 10 chars (Lucene replacement):

Index time: 1.39 CPU seconds (22.28MB/CPUs)
Node memory: 3297663 bytes (3220 kB) in 548540 nodes
UID memory: 1803735 bytes (1761 kB) in 213554 lists
Total: 5101398 / 32474899 bytes = 15.71%

Is there a need for Lucene anymore with this kind of a speed and disk
space usage? :)

The above tests are run with everything being indexed except space and
control chars (ie. practically space + tab). If only alphanumeric
characters are indexed, the total lines are (in same order as above):

Total: 7048664 / 32474899 bytes = 21.70% (-13.07%)
Total: 9610543 / 32474899 bytes = 29.59% (-16.75%)
Total: 3963335 / 32474899 bytes = 12.20% (-3.51%)

There are probably still a few optimizations that can be done to get the
disk space usage even lower. I did spent quite a lot of time making it
use less CPU though. Even tried a bit different coding style variations
in the critical functions to get gcc to create faster code..

One interesting thing I noticed was that binary searching a character
from a char array was slower than just sequen