Re: [Dovecot] FTS Plugin design

2009-09-09 Thread Timo Sirainen
On Tue, 2009-09-08 at 15:47 +0100, rui.carne...@portugalmail.net wrote:
> Now I am trying to find a way to know the mime part id of the parts  
> used on fts_build_mail. Is that already possible or I need to do that  
> by my own?

If you already get the MIME structure, then I guess you have struct
message_parts. You could do it similar to the way IMAP does things:

   1  TEXT/PLAIN
   2  APPLICATION/OCTET-STREAM
   3  MESSAGE/RFC822
   3.HEADER   ([RFC-2822] header of the message)
   3.TEXT ([RFC-2822] text body of the message) MULTIPART/MIXED
   3.1TEXT/PLAIN
   3.2APPLICATION/OCTET-STREAM
   4  MULTIPART/MIXED
   4.1IMAGE/GIF
   4.1.MIME   ([MIME-IMB] header for the IMAGE/GIF)
   4.2MESSAGE/RFC822
   4.2.HEADER ([RFC-2822] header of the message)
   4.2.TEXT   ([RFC-2822] text body of the message) MULTIPART/MIXED
   4.2.1  TEXT/PLAIN
   4.2.2  MULTIPART/ALTERNATIVE
   4.2.2.1TEXT/PLAIN
   4.2.2.2TEXT/RICHTEXT

So you get the root message_part. Its first child is "1", second child
"2", 3rd child "3", 3rd child's first child "3.1", etc.


signature.asc
Description: This is a digitally signed message part


Re: [Dovecot] FTS Plugin design

2009-09-08 Thread rui . carneiro

Hi again!

After sometime using my changes on this plugin I found one major  
problem. When a message have two attachments with same name or one  
with content-type equal to "message/*", my solr schema design does not  
work because as attachment unique identifier I used attachment's name  
what is not correct.


Now I am trying to find a way to know the mime part id of the parts  
used on fts_build_mail. Is that already possible or I need to do that  
by my own?


Thank you in advance,
Rui Carneiro


Re: [Dovecot] FTS Plugin design

2009-05-26 Thread Rui Carneiro
Citando Timo Sirainen :
> 
> At least for now. Memory leaks don't cause crashes.

Ok.

> gdb -p `pidof imap`
> cont
> 
> bt full

I think it won't be necessary. It is not crashing anymore. Maybe it was a bug 
in my code.

Tomorrow (or in the next day) I will send you the code.

Thank you for all the support!

Regards,
Rui Carneiro
-- 
Portugalmail, Comunicações S.A.
www.portugalmail.net


Re: [Dovecot] FTS Plugin design

2009-05-26 Thread Timo Sirainen

On May 26, 2009, at 5:46 AM, Rui Carneiro wrote:


Citando Timo Sirainen :

So valgrind didn't find anything wrong.


We should ignore LEAK SUMMARY?


At least for now. Memory leaks don't cause crashes.


What does gdb show as the backtrace?


My gdb is not writing where he should (or not writing at all). This  
shouldn't be enough?


mail_executable = /usr/local/libexec/dovecot/gdbhelper /usr/local/ 
libexec/dovecot/imap


It's not writing /tmp/gdbhelper* files when crashing? Anyway there's  
also one guaranteed way to get backtrace. Remove the gdbhelper and  
then run:


gdb -p `pidof imap`
cont

bt full



Re: [Dovecot] FTS Plugin design

2009-05-26 Thread Rui Carneiro
Citando Timo Sirainen :
> So valgrind didn't find anything wrong. 

We should ignore LEAK SUMMARY?

> What does gdb show as the backtrace?

My gdb is not writing where he should (or not writing at all). This shouldn't 
be enough?

mail_executable = /usr/local/libexec/dovecot/gdbhelper 
/usr/local/libexec/dovecot/imap

The crash occurs after indexing all stuff and when imap is returning the result.

Thank you,
Rui Carneiro
-- 
Portugalmail, Comunicações S.A.
www.portugalmail.net


Re: [Dovecot] FTS Plugin design

2009-05-25 Thread Timo Sirainen
On Mon, 2009-05-25 at 14:20 +0100, Rui Carneiro wrote:
> Citando Timo Sirainen :
> > I guess it works around some other bug then. If it's a memory-related
> > bug you could also see if valgrind complains something:
> > 
> > protocol imap {
> >   ..
> >   mail_executable = /usr/bin/valgrind /usr/local/libexec/dovecot/imap
> > }
> 
> Here is the output (I cloned the http://hg.dovecot.org/dovecot-1.2 and made 
> no changes to this test):
> 
> ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 123 from 2)

So valgrind didn't find anything wrong. What does gdb show as the
backtrace?



signature.asc
Description: This is a digitally signed message part


Re: [Dovecot] FTS Plugin design

2009-05-25 Thread Rui Carneiro
Citando Timo Sirainen :
> I guess it works around some other bug then. If it's a memory-related
> bug you could also see if valgrind complains something:
> 
> protocol imap {
>   ..
>   mail_executable = /usr/bin/valgrind /usr/local/libexec/dovecot/imap
> }

Here is the output (I cloned the http://hg.dovecot.org/dovecot-1.2 and made no 
changes to this test):

ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 123 from 2)
malloc/free: in use at exit: 94,040 bytes in 1,032 blocks.
malloc/free: 1,704 allocs, 672 frees, 1,042,476 bytes allocated.
For counts of detected errors, rerun with: -v
searching for pointers to 1,032 not-freed blocks.
checked 111,072 bytes.

88,161 (328 direct, 87,833 indirect) bytes in 1 blocks are definitely lost in 
loss record 30 of 45
   at 0x4C24384: calloc (vg_replace_malloc.c:397)
   by 0x4AF165: pool_system_malloc (mempool-system.c:77)
   by 0x63E0DA2: ???
   by 0x63DF91D: ???
   by 0x5DBAF27: ???
   by 0x5DBBE50: ???
   by 0x46BBFF: mailbox_transaction_begin (mail-storage.c:794)
   by 0x42976F: imap_search_start (imap-search.c:540)
   by 0x4206D7: cmd_search (cmd-search.c:50)
   by 0x4232CB: client_command_input (client.c:608)
   by 0x423389: client_command_input (client.c:657)
   by 0x4239F4: client_handle_input (client.c:698)

LEAK SUMMARY:
   definitely lost: 328 bytes in 1 blocks.
   indirectly lost: 87,833 bytes in 1,016 blocks.
 possibly lost: 0 bytes in 0 blocks.
   still reachable: 5,879 bytes in 15 blocks.
suppressed: 0 bytes in 0 blocks.
Reachable blocks (those to which a pointer was found) are not shown.
To see them, rerun with: --leak-check=full --show-reachable=yes


Re: [Dovecot] FTS Plugin design

2009-05-22 Thread Timo Sirainen
On Fri, 2009-05-22 at 18:57 +0100, Rui Carneiro wrote:
> Citando Timo Sirainen :
> > The problem is something else. The Solr code simply tries to keep the
> > send buffer smaller than that, nothing would break if you sent a larger
> > buffer. Show gdb backtrace of the crash?
> > 
> 
> I said it was from the buff size because when I increased it Dovecot didn't 
> crash. 

I guess it works around some other bug then. If it's a memory-related
bug you could also see if valgrind complains something:

protocol imap {
  ..
  mail_executable = /usr/bin/valgrind /usr/local/libexec/dovecot/imap
}



signature.asc
Description: This is a digitally signed message part


Re: [Dovecot] FTS Plugin design

2009-05-22 Thread Rui Carneiro
Citando Timo Sirainen :
> The problem is something else. The Solr code simply tries to keep the
> send buffer smaller than that, nothing would break if you sent a larger
> buffer. Show gdb backtrace of the crash?
> 

I said it was from the buff size because when I increased it Dovecot didn't 
crash. 

It's Friday and I will not be able to do the gdb backtrace on weekend but it 
will be the first thing I will do Monday morning.

Regards,
Rui Carneiro
-- 
Portugalmail, Comunicações S.A.
www.portugalmail.net


Re: [Dovecot] FTS Plugin design

2009-05-22 Thread Timo Sirainen
On Fri, 2009-05-22 at 18:24 +0100, Rui Carneiro wrote:
> Hi Timo,
> 
> I almost finish the changes on fts plugin. By now, it seems to work fine with 
> attachments (extracting and sending them to Solr). I only have a problem with 
> the max size of the command (cmd) that we can send to Solr:
> 
> #define SOLR_CMDBUF_SIZE (1024*64)
> 
> By now, if we send some message bigger than this value the fts-plugin crash.

The problem is something else. The Solr code simply tries to keep the
send buffer smaller than that, nothing would break if you sent a larger
buffer. Show gdb backtrace of the crash?



signature.asc
Description: This is a digitally signed message part


Re: [Dovecot] FTS Plugin design

2009-05-22 Thread Rui Carneiro
Hi Timo,

I almost finish the changes on fts plugin. By now, it seems to work fine with 
attachments (extracting and sending them to Solr). I only have a problem with 
the max size of the command (cmd) that we can send to Solr:

#define SOLR_CMDBUF_SIZE (1024*64)

By now, if we send some message bigger than this value the fts-plugin crash.

There is anything in your TODO-List that solves this problem?

Regards,
Rui Carneiro

PS: asap I will send you my code for your approval :)

-- 
Portugalmail, Comunicações S.A.
www.portugalmail.net


Re: [Dovecot] FTS Plugin design

2009-05-20 Thread Rui Carneiro
Now, with attachment.
/* Copyright (c) 2006-2009 Dovecot authors, see the included COPYING file */

#include "lib.h"
#include "buffer.h"
#include "base64.h"
#include "str.h"
#include "unichar.h"
#include "charset-utf8.h"
#include "quoted-printable.h"
#include "rfc822-parser.h"
#include "rfc2231-parser.h"
#include "message-parser.h"
#include "message-header-decode.h"
#include "message-decoder.h"

enum content_type {
	CONTENT_TYPE_UNKNOWN = 0,
	CONTENT_TYPE_BINARY,
	CONTENT_TYPE_QP,
	CONTENT_TYPE_BASE64
};

/* base64 takes max 4 bytes per character, q-p takes max 3. */
#define MAX_ENCODING_BUF_SIZE 3

/* UTF-8 takes max 5 bytes per character. Not sure about others, but I'd think
   10 is more than enough for everyone.. */
#define MAX_TRANSLATION_BUF_SIZE 10

struct message_decoder_context {
	enum message_decoder_flags flags;
	struct message_part *prev_part;

	struct message_header_line hdr;
	buffer_t *buf, *buf2;

	char *charset_trans_charset;
	struct charset_translation *charset_trans;
	char translation_buf[MAX_TRANSLATION_BUF_SIZE];
	unsigned int translation_size;

	char encoding_buf[MAX_ENCODING_BUF_SIZE];
	unsigned int encoding_size;

	char *content_charset;
	enum content_type content_type;

	unsigned int charset_utf8:1;
	unsigned int binary_input:1;
};

struct message_decoder_context *
message_decoder_init(enum message_decoder_flags flags)
{
	struct message_decoder_context *ctx;

	ctx = i_new(struct message_decoder_context, 1);
	ctx->flags = flags;
	ctx->buf = buffer_create_dynamic(default_pool, 8192);
	ctx->buf2 = buffer_create_dynamic(default_pool, 8192);
	return ctx;
}

void message_decoder_deinit(struct message_decoder_context **_ctx)
{
	struct message_decoder_context *ctx = *_ctx;

	*_ctx = NULL;

	if (ctx->charset_trans != NULL)
		charset_to_utf8_end(&ctx->charset_trans);

	buffer_free(&ctx->buf);
	buffer_free(&ctx->buf2);
	i_free(ctx->charset_trans_charset);
	i_free(ctx->content_charset);
	i_free(ctx);
}

static void
parse_content_transfer_encoding(struct message_decoder_context *ctx,
struct message_header_line *hdr)
{
	struct rfc822_parser_context parser;
	string_t *value;

	value = t_str_new(64);
	rfc822_parser_init(&parser, hdr->full_value, hdr->full_value_len, NULL);

	(void)rfc822_skip_lwsp(&parser);
	(void)rfc822_parse_mime_token(&parser, value);

	ctx->content_type = CONTENT_TYPE_UNKNOWN;
	switch (str_len(value)) {
	case 4:
		if (i_memcasecmp(str_data(value), "7bit", 4) == 0 ||
		i_memcasecmp(str_data(value), "8bit", 4) == 0)
			ctx->content_type = CONTENT_TYPE_BINARY;
		break;
	case 6:
		if (i_memcasecmp(str_data(value), "base64", 6) == 0)
			ctx->content_type = CONTENT_TYPE_BASE64;
		else if (i_memcasecmp(str_data(value), "binary", 6) == 0)
			ctx->content_type = CONTENT_TYPE_BINARY;
		break;
	case 16:
		if (i_memcasecmp(str_data(value), "quoted-printable", 16) == 0)
			ctx->content_type = CONTENT_TYPE_QP;
		break;
	}
}

static void
parse_content_type(struct message_decoder_context *ctx,
		   struct message_header_line *hdr)
{
	struct rfc822_parser_context parser;
	const char *const *results;
	string_t *str;

	if (ctx->content_charset != NULL)
		return;

	rfc822_parser_init(&parser, hdr->full_value, hdr->full_value_len, NULL);
	(void)rfc822_skip_lwsp(&parser);
	str = t_str_new(64);
	if (rfc822_parse_content_type(&parser, str) <= 0)
		return;

	(void)rfc2231_parse(&parser, &results);
	for (; *results != NULL; results += 2) {
		if (strcasecmp(results[0], "charset") == 0) {
			ctx->content_charset = i_strdup(results[1]);
			ctx->charset_utf8 = charset_is_utf8(results[1]);
			break;
		}
	}
}

static bool message_decode_header(struct message_decoder_context *ctx,
  struct message_header_line *hdr,
  struct message_block *output)
{
	bool dtcase = (ctx->flags & MESSAGE_DECODER_FLAG_DTCASE) != 0;
	size_t value_len;

	if (hdr->continues) {
		hdr->use_full_value = TRUE;
		return FALSE;
	}

	T_BEGIN {
		if (hdr->name_len == 12 &&
		strcasecmp(hdr->name, "Content-Type") == 0)
			parse_content_type(ctx, hdr);
		if (hdr->name_len == 25 &&
		strcasecmp(hdr->name, "Content-Transfer-Encoding") == 0)
			parse_content_transfer_encoding(ctx, hdr);
	} T_END;

	buffer_set_used_size(ctx->buf, 0);
	message_header_decode_utf8(hdr->full_value, hdr->full_value_len,
   ctx->buf, dtcase);
	value_len = ctx->buf->used;

	if (dtcase) {
		(void)uni_utf8_to_decomposed_titlecase(hdr->name, hdr->name_len,
		   ctx->buf);
		buffer_append_c(ctx->buf, '\0');
	}

	ctx->hdr = *hdr;
	ctx->hdr.full_value = ctx->buf->data;
	ctx->hdr.full_value_len = value_len;
	ctx->hdr.value_len = 0;
	if (dtcase) {
		ctx->hdr.name = CONST_PTR_OFFSET(ctx->buf->data,
		 ctx->hdr.full_value_len);
		ctx->hdr.name_len = ctx->buf->used - 1 - value_len;
	}

	output->hdr = &ctx->hdr;
	return TRUE;
}

static void translation_buf_decode(struct message_decoder_context *ctx,
   const unsigned char **data, size_t *size)
{
	unsigned char trans_buf[MAX_TRANSLATION_BUF_SIZE+1];
	unsigned int data_wanted, skip;
	size_t 

Re: [Dovecot] FTS Plugin design

2009-05-19 Thread Rui Carneiro
On Tue, May 19, 2009 at 8:51 PM, Timo Sirainen  wrote:

> You forgot the attachment.
>

Oh Sorry, I am not at the office now (almost 10pm here) I will send it
tomorrow morning.

Rui Carneiro
---
Portugalmail, Comunicações S.A.
www.portugalmail.net


Re: [Dovecot] FTS Plugin design

2009-05-19 Thread Timo Sirainen
On Tue, 2009-05-19 at 14:40 +0100, Rui Carneiro wrote:
> > http://hg.dovecot.org/dovecot-1.2/rev/44548a7fb10d
> > 
> 
> It is working now but I needed to do some changes on your code.

OK.

> Please see the attachment to checked any problem that may exist.

You forgot the attachment.



signature.asc
Description: This is a digitally signed message part


Re: [Dovecot] FTS Plugin design

2009-05-19 Thread Rui Carneiro
Citando Timo Sirainen :
> All the data comes from lib-mail/message-decoder.c. Hmm. Looks like it
> tries to force giving only valid UTF-8 output. I guess it should have
> some flag or something that makes it do that only for text/* parts, not
> for binary parts. OK, implemented, see if it works with this and using
> the flag:
> 
> http://hg.dovecot.org/dovecot-1.2/rev/44548a7fb10d
> 

It is working now but I needed to do some changes on your code.

When you check charset_utf8 and charset_trans you have a problem on attachments 
case. Attachments part do not have any charset defined on headers so, by 
default, charset_utf8=1 and charset_trans is garbage (I have no idea where that 
garbage came from).

To avoid this problem swap the some lines of code that set ctx->binary_input to 
the function's beginning.

Please see the attachment to checked any problem that may exist.

Thank you,
Rui Carneiro
---
Portugalmail, Comunicações S.A.
www.portugalmail.net


Re: [Dovecot] FTS Plugin design

2009-05-18 Thread Timo Sirainen
On Mon, 2009-05-18 at 17:35 +0100, Rui Carneiro wrote:

> I think binary data is being corrupted anywhere before 
> fts_backend_build_more() and I don't have any idea where.

All the data comes from lib-mail/message-decoder.c. Hmm. Looks like it
tries to force giving only valid UTF-8 output. I guess it should have
some flag or something that makes it do that only for text/* parts, not
for binary parts. OK, implemented, see if it works with this and using
the flag:

http://hg.dovecot.org/dovecot-1.2/rev/44548a7fb10d


signature.asc
Description: This is a digitally signed message part


Re: [Dovecot] FTS Plugin design

2009-05-18 Thread Rui Carneiro
Citando Timo Sirainen :
> Nope. If you still see corruption, try with some simple test mails and
> see if it's adding garbage, losing contents or adding more content.

I tried something more advanced than that. I hexdumped my pdf test file and on 
the first line I get:

  25 50 44 46 2d 31 2e 33  0a 25 e2 e3 cf d3 0a 31

Where "e2 e3 cf d3" is binary data. When I do the same for my copied file I get:

  25 50 44 46 2d 31 2e 33  0a 25 ef bf bd 0a 31 20

It is weird but the binary data changed.

Further, I print to logs the 11 character from the first block.data just before 
fts_backend_build_more() and the value is EF (the correct one would be E2).

I think binary data is being corrupted anywhere before fts_backend_build_more() 
and I don't have any idea where.

Any help would be appreciated.

Thank you,
Rui Carneiro

-- 
Portugalmail, Comunicações S.A.
www.portugalmail.net


Re: [Dovecot] FTS Plugin design

2009-05-18 Thread Timo Sirainen

On May 18, 2009, at 6:42 AM, Rui Carneiro wrote:

I am having some troubles sending all data to a file. When I finish  
to send all data to a file, I tried to open it and the file is  
corrupted.


The first think I noticed is that all chars are capitalized what  
destroy all the file format.


Where are the chars capitalized?


Hmm. I'll see about getting it fixed in a better way, but for now you  
could just change:


decoder = message_decoder_init(TRUE);

to

decoder = message_decoder_init(FALSE);

I'm thinking about making message_decoder uppercase only text/* body  
parts.



Any other idea why files are getting corrupted?


Nope. If you still see corruption, try with some simple test mails and  
see if it's adding garbage, losing contents or adding more content.


Re: [Dovecot] FTS Plugin design

2009-05-18 Thread Rui Carneiro
Hi again,

I am having some troubles sending all data to a file. When I finish to send all 
data to a file, I tried to open it and the file is corrupted.

The first think I noticed is that all chars are capitalized what destroy all 
the file format.

Where are the chars capitalized?
Any other idea why files are getting corrupted?

Thank you,
Rui Carneiro
-- 
Portugalmail, Comunicações S.A.
www.portugalmail.net


Re: [Dovecot] FTS Plugin design

2009-05-15 Thread Rui Carneiro
Citando Timo Sirainen :
> 1. You notice a non-text/* content-type and initialize text extraction
> for the MIME part. Like:
> 
> struct attachment_extract_context *
> attachment_extract_init(const char *content_type);
> 
> 2. After this you feed all the input belonging to that MIME part to:
> 
> int attachment_extract_add(struct attachment_extract_context *ctx,
> const struct message_block *input);
> 
> Don't output anything to FTS backend at this point. The
> attachment_extract_add() would probably just basically write to a
> temporary file.
> 
> 3. Finally you'll notice that the MIME part ends (either you get headers
> for the next MIME part or the entire message ends). Then finish the
> extraction, which actually executes the whatever conversion binaries:
> 
> int attachment_extract_finish(struct attachment_extract_context *ctx);
> 
> 4. Get the resulting text to fts_backend_build_more() somehow. Either
> some attachment_extract_add_to_fts() which internally adds it or some
> kind of an iterator that returns the text in smaller blocks. Either
> would work..
> 
> That kind of an API would also make it possible to pretty easily modify
> in future to not write temporary files for specific content types if
> it's not required.
> 

I tried your approach and I think it is working pretty well. Now I only need to 
look carefully to the output of external programs and build the XML correctly 
to send to Solr.

Thanks Timo

Regards,
Rui Carneiro

-- 
Portugalmail, Comunicações S.A.
www.portugalmail.net


Re: [Dovecot] FTS Plugin design

2009-05-13 Thread Timo Sirainen
On Tue, 2009-05-05 at 12:08 +0100, Rui Carneiro wrote:
> >  - fts_build_mail() indexes a single mail. It parses the messages and
> > returns the data in small blocks. For text/* and message/rfc822 parts
> > those blocks are currently sent to FTS backend. This is where I think
> > you should look into hooking your attachment parsing. Change
> > fts_build_want_index_part() to look for more content-types that you're
> > interested in and then before feeding the blocks to FTS backend put them
> > through your own converter function, something like:
> >
> > int attachment_extract_text(struct attachment_extract_context *ctx,
> > const struct message_block *input, struct message_block *output);
> 
> 
> Let's take the example of an application-pdf content-type. Before I
> converter all pdf data to text I need to gather all data before. The actual
> process is feeding FTS backend with small parts of data and appending them
> on "build_more" functions (e.g. fts_backend_solr_build_more()).

Right.

> So where should I call attachment_extract_text()? In
> fts_backend_solr_build_more() and not making append to cmd until data is
> extracted? Or gather all information before (e.g. fts_build_mail()) and send
> all in once to FTS backend?

Since others already mentioned that many formats pretty much require
having the entire file available, I guess it's better to just save all
the attachments to file at some point. So if I wrote the code it would
probably work something like:

1. You notice a non-text/* content-type and initialize text extraction
for the MIME part. Like:

struct attachment_extract_context *
attachment_extract_init(const char *content_type);

2. After this you feed all the input belonging to that MIME part to:

int attachment_extract_add(struct attachment_extract_context *ctx,
const struct message_block *input);

Don't output anything to FTS backend at this point. The
attachment_extract_add() would probably just basically write to a
temporary file.

3. Finally you'll notice that the MIME part ends (either you get headers
for the next MIME part or the entire message ends). Then finish the
extraction, which actually executes the whatever conversion binaries:

int attachment_extract_finish(struct attachment_extract_context *ctx);

4. Get the resulting text to fts_backend_build_more() somehow. Either
some attachment_extract_add_to_fts() which internally adds it or some
kind of an iterator that returns the text in smaller blocks. Either
would work..

That kind of an API would also make it possible to pretty easily modify
in future to not write temporary files for specific content types if
it's not required.


signature.asc
Description: This is a digitally signed message part


Re: [Dovecot] FTS Plugin design

2009-05-05 Thread Rui Carneiro
Hi again,

On Wed, Apr 15, 2009 at 11:23 PM, Timo Sirainen  wrote:

>  - fts_build_mail() indexes a single mail. It parses the messages and
> returns the data in small blocks. For text/* and message/rfc822 parts
> those blocks are currently sent to FTS backend. This is where I think
> you should look into hooking your attachment parsing. Change
> fts_build_want_index_part() to look for more content-types that you're
> interested in and then before feeding the blocks to FTS backend put them
> through your own converter function, something like:
>
> int attachment_extract_text(struct attachment_extract_context *ctx,
> const struct message_block *input, struct message_block *output);


Let's take the example of an application-pdf content-type. Before I
converter all pdf data to text I need to gather all data before. The actual
process is feeding FTS backend with small parts of data and appending them
on "build_more" functions (e.g. fts_backend_solr_build_more()).

So where should I call attachment_extract_text()? In
fts_backend_solr_build_more() and not making append to cmd until data is
extracted? Or gather all information before (e.g. fts_build_mail()) and send
all in once to FTS backend?

I hope I've made myself clear.

Regards,
Rui Carneiro
-- 
Portugalmail, Comunicações S.A.
www.portugalmail.net


Re: [Dovecot] FTS Plugin design

2009-04-23 Thread tomas
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On Thu, Apr 23, 2009 at 12:27:47PM +0100, rui.carne...@portugalmail.net wrote:
> On Thu, Apr 23, 2009 at 5:47 AM,  wrote:
> 
> Note that some formats might require to seek to some point in the file [1]

[...]

> I hadn't thought on that before but I think you are right. The only question 
> here is writing data to memory or hd.

Taking into account what Steffen Kaiser noted in this thread, a good old
temp file (and boxing the subprocess within robust ulimits!) seems to be
adequate. I can well imagine some crappy converter running amok :-/

Regards
- -- tomás
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.6 (GNU/Linux)

iD8DBQFJ8H46Bcgs9XrR2kYRAgdDAJ9j8Q4ueGg07TAJLemB1Cbhd81VEgCeJq/2
esWd4Nh9l08o6fYMJVRqT7Q=
=lMu4
-END PGP SIGNATURE-


Re: [Dovecot] FTS Plugin design

2009-04-23 Thread rui . carneiro
On Thu, Apr 23, 2009 at 5:47 AM,  wrote:

Note that some formats might require to seek to some point in the file [1]
(typically the end), so reading from stdin is awkward (it would require
stdin to be seekable, so either the app or the caller would have to put
the whole file somewhere anyway).

[1] Notably PDF has some index tables at EOF - 1k if I remember
correctly.

I hadn't thought on that before but I think you are right. The only question 
here is writing data to memory or hd.

Thank you all,
Rui Carneiro

--
Portugalmail, Comunicações S.A.
www.portugalmail.net


Re: [Dovecot] FTS Plugin design

2009-04-23 Thread Steffen Kaiser

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On Wed, 22 Apr 2009, Rui Carneiro wrote:



I will talk with the developers of those applications about the possibility
of supporting stdin input (if not supported yet).

I think the API that fts plugin uses to do the conversion should be

generic enough that both approaches would work. Then it would be easier
to implement one or another or both eventually.


I think I will try the external applications approach. My developing time
available is not to much.


Actually, if I consider what the xls-to-HTML converter did lately to our 
webmail frontend, I suggest to index "alien" formats asynchroneously, 
maybe in low-priority process, not only to prevent potential long 
conversation time and resource requirement, but also to prevent MUAs 
re-initate the search and force the IMAP server to index the same file 
simultaneously.


Bye,

- -- 
Steffen Kaiser

-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.6 (GNU/Linux)

iQEVAwUBSfBGBXWSIuGy1ktrAQKrRwgAll5KRqG0tMwPYgt21cKR5F4r8mrnA9nJ
5zvdQgFGXJoT4NegpzJ15+V8l7a28Uaxx79hzrubRpJSTNI5gU08TkzdNkJwWLTu
IA8gK/ZwQnnMqpQByF/pf7ERzMroZv3ZpYpkbEbI64MMSYOrI2hT92t3KSSnJ39f
TUSdRN9sUhdA69uWlKCFMofhAEfaoP+U8N3pg1b/kc14+HzmTqrx/SWNHZkzU5qm
clUmfa/uGMuv+gq+bKSEtos79Q1QOTqH9qRSRbNsxOVISM75C7dTpqIlcqz53iIg
RsRHDxCtyIv/UJrfE9fniOYE6l/xs8iLgG69fOGUCzwmLjVx2j9dKA==
=7O9D
-END PGP SIGNATURE-


Re: [Dovecot] FTS Plugin design

2009-04-22 Thread tomas
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On Wed, Apr 22, 2009 at 03:51:45PM +0100, Rui Carneiro wrote:

[...]

> Cons:
> - Some programs to parse special formats (p.e. catppt and pdftotext) do not
> accept input from stdin (we need to create temporary files).

[from the peanut gallery here]

Note that some formats might require to seek to some point in the file [1]
(typically the end), so reading from stdin is awkward (it would require
stdin to be seekable, so either the app or the caller would have to put
the whole file somewhere anyway).

[1] Notably PDF has some index tables at EOF - 1k if I remember
correctly.

Regards
- -- tomás
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.6 (GNU/Linux)

iD8DBQFJ7/LeBcgs9XrR2kYRAqG+AJ48Lg3W65h6E0LAda/Q0O8RE9s15ACfSrOS
t2AUOrB+A0CXQYZAHFI/Qks=
=Dtcc
-END PGP SIGNATURE-


Re: [Dovecot] FTS Plugin design

2009-04-22 Thread Rui Carneiro
On Wed, Apr 22, 2009 at 5:38 PM, Timo Sirainen  wrote:

> Maybe those programs could be changed and just require the newer
> versions?..


I will talk with the developers of those applications about the possibility
of supporting stdin input (if not supported yet).

I think the API that fts plugin uses to do the conversion should be
> generic enough that both approaches would work. Then it would be easier
> to implement one or another or both eventually.


I think I will try the external applications approach. My developing time
available is not to much.
I will develop the API  as much as generic I can for possible improvements
in the future.

Regards,
Rui Carneiro


Re: [Dovecot] FTS Plugin design

2009-04-22 Thread Timo Sirainen
On Wed, 2009-04-22 at 15:51 +0100, Rui Carneiro wrote:
> Hi,
> 
> Almost full text search engines (C/C++) I looked (Swish-E, Wumpus,
> Lemur and Xapian) do not use any kind of library or parser. Instead,
> they use other applications like pdftotext, catdoc, catppt (etc) and
> call them with execvp (or equivalent). Using this approach on my
> project have some pros and cons:
> 
> Pros:
> - The existing libraries to extract the content of pdf, doc (etc) are
> not very stable.
> - Easier to handle errors (even if those applications crash dovecot
> will be still running)

Hmm. I hadn't thought of this before. Yeah, if they're not stable it's
probably not a good idea to run in the same process as the rest of
Dovecot. But I guess there could be some kind of a separate text
extracting process that fts plugin would talk to. If that process dies
it could get restarted automatically and fts could maybe retry and if it
it dies again log it and just skip over it.

> - Some programs to parse special formats (p.e. catppt and pdftotext)
> do not accept input from stdin (we need to create temporary files).

Maybe those programs could be changed and just require the newer
versions?..

> What approach would be better? Using applications like pdftotext and
> catdoc or, on the other hand, use their libraries and do it almost
> from scratch?

I think the API that fts plugin uses to do the conversion should be
generic enough that both approaches would work. Then it would be easier
to implement one or another or both eventually.


signature.asc
Description: This is a digitally signed message part


Re: [Dovecot] FTS Plugin design

2009-04-22 Thread Rui Carneiro
Hi,

Almost full text search engines (C/C++) I looked (Swish-E, Wumpus, Lemur and
Xapian) do not use any kind of library or parser. Instead, they use other
applications like pdftotext, catdoc, catppt (etc) and call them with execvp
(or equivalent). Using this approach on my project have some pros and cons:

Pros:
- The existing libraries to extract the content of pdf, doc (etc) are not
very stable.
- Easier to handle errors (even if those applications crash dovecot will be
still running)
- Less developing time

Cons:
- Some programs to parse special formats (p.e. catppt and pdftotext) do not
accept input from stdin (we need to create temporary files).

What approach would be better? Using applications like pdftotext and catdoc
or, on the other hand, use their libraries and do it almost from scratch?

Regards
Rui Carneiro

On Tue, Apr 21, 2009 at 5:52 PM, Rui Carneiro  wrote:

> Great idea!
>
> I will give news soon.
>
>
> On Tue, Apr 21, 2009 at 5:32 PM, Timo Sirainen  wrote:
>
>> I've no idea, but you could at least look at some of the other full text
>> search engines. I remember them advertising indexing support for all kinds
>> of formats. Maybe they're using some specific library or maybe it would be
>> easy to extract their parsing code.
>>
>


-- 
mobile: +351 963446125
mail: rui@gmail.com
mail: ei04...@fe.up.pt
website: http://paginas.fe.up.pt/~ei04073


Re: [Dovecot] FTS Plugin design

2009-04-21 Thread Rui Carneiro
Great idea!

I will give news soon.

On Tue, Apr 21, 2009 at 5:32 PM, Timo Sirainen  wrote:

> I've no idea, but you could at least look at some of the other full text
> search engines. I remember them advertising indexing support for all kinds
> of formats. Maybe they're using some specific library or maybe it would be
> easy to extract their parsing code.
>


Re: [Dovecot] FTS Plugin design

2009-04-21 Thread Timo Sirainen

On Apr 21, 2009, at 6:25 AM, Rui Carneiro wrote:

Anyone know some good libraries to handle the content of files like  
pdf,
ppt, doc, etc? I am already indexing attachments all I need now is  
extract

the text of them.


I've no idea, but you could at least look at some of the other full  
text search engines. I remember them advertising indexing support for  
all kinds of formats. Maybe they're using some specific library or  
maybe it would be easy to extract their parsing code.


Re: [Dovecot] FTS Plugin design

2009-04-21 Thread Rui Carneiro
Hi again,

Anyone know some good libraries to handle the content of files like pdf,
ppt, doc, etc? I am already indexing attachments all I need now is extract
the text of them.

Regards,
Rui Carneiro

On Mon, Apr 20, 2009 at 3:29 PM, Rui Carneiro  wrote:

> Hi,
>
> The problem was on the flag. My hexa to binary conversions was wrong.
>
> Regards,
> Rui Carneiro
>
>
>
> On Fri, Apr 17, 2009 at 10:03 AM, Rui Carneiro  wrote:
>
>> Thank you for all tips. The design look more clear to me now.
>>
>> I have one more question. I looked into fts_build_want_index_part() and I
>> saw that I need to add some flags to message_part_flags, what values should
>> I choose? My first approach was to follow your schema and set
>> MESSAGE_PART_FLAG_ATTACHMENT = 0x16. There is any problem with this?
>>
>> I already had changed parse_content_type() to set ctx->part->flags
>> correctly but if i choose my custom flag dovecot assume that all attachment
>> lines are headers. I already tried to set those ctx->part->flags as TEXT and
>> the fts_backend was feeded correctly with all attachment lines.
>>
>> I don't know if this is related with the value of
>> MESSAGE_PART_FLAG_ATTACHMENT or if I am missing something (like setting
>> block.hdr = NULL or some more code to handle new flags).
>>
>> Thank you,
>> Rui Carneiro
>>
>>
>> On Wed, Apr 15, 2009 at 11:23 PM, Timo Sirainen  wrote:
>>
>>> On Mon, 2009-04-13 at 11:18 +0100, Rui Carneiro wrote:
>>> > I didn't understood yet what is the plugin's design and how the plugins
>>> are
>>> > called from the core system and I was wondering if anyone could help me
>>> with
>>> > that.
>>>
>>> fts-storage.c hooks into all the functions in mail-storage API that it
>>> needs to. Currently indexing isn't done while messages are being saved,
>>> but instead just before searching. The searching functions are:
>>>
>>>  - fts_mailbox_search_init() tries to figure out if FTS can optimize the
>>> search. If it does, it tries to figure out if FTS index is up-to-date
>>> and if not, starts the search.
>>>
>>>  - fts_mailbox_search_next_nonblock() continues the indexing (or
>>> searching after indexing) for a while. The idea is that IMAP connection
>>> is able to process other commands while doing a long-running search. So
>>> fts plugin indexes FTS_SEARCH_NONBLOCK_COUNT (50) messages at a time. It
>>> would be nice if that value was dynamically calculated and also based on
>>> bytes instead of messages, but that's maybe too much trouble.
>>>
>>>  - fts_mailbox_search_next_update_seq() uses the fts search results and
>>> updates mail-storage's search stuff so that it doesn't go through
>>> messages that don't match.
>>>
>>>  - fts_build_mail() indexes a single mail. It parses the messages and
>>> returns the data in small blocks. For text/* and message/rfc822 parts
>>> those blocks are currently sent to FTS backend. This is where I think
>>> you should look into hooking your attachment parsing. Change
>>> fts_build_want_index_part() to look for more content-types that you're
>>> interested in and then before feeding the blocks to FTS backend put them
>>> through your own converter function, something like:
>>>
>>> int attachment_extract_text(struct attachment_extract_context *ctx,
>>> const struct message_block *input, struct message_block *output);
>>>
>>>
>>>
>>
>>
>> --
>> mobile: +351 963446125
>> mail: rui@gmail.com
>> mail: ei04...@fe.up.pt
>> website: http://paginas.fe.up.pt/~ei04073
>>
>
>
>
> --
> mobile: +351 963446125
> mail: rui@gmail.com
> mail: ei04...@fe.up.pt
> website: http://paginas.fe.up.pt/~ei04073
>



-- 
mobile: +351 963446125
mail: rui@gmail.com
mail: ei04...@fe.up.pt
website: http://paginas.fe.up.pt/~ei04073


Re: [Dovecot] FTS Plugin design

2009-04-20 Thread Rui Carneiro
Hi,

The problem was on the flag. My hexa to binary conversions was wrong.

Regards,
Rui Carneiro


On Fri, Apr 17, 2009 at 10:03 AM, Rui Carneiro  wrote:

> Thank you for all tips. The design look more clear to me now.
>
> I have one more question. I looked into fts_build_want_index_part() and I
> saw that I need to add some flags to message_part_flags, what values should
> I choose? My first approach was to follow your schema and set
> MESSAGE_PART_FLAG_ATTACHMENT = 0x16. There is any problem with this?
>
> I already had changed parse_content_type() to set ctx->part->flags
> correctly but if i choose my custom flag dovecot assume that all attachment
> lines are headers. I already tried to set those ctx->part->flags as TEXT and
> the fts_backend was feeded correctly with all attachment lines.
>
> I don't know if this is related with the value of
> MESSAGE_PART_FLAG_ATTACHMENT or if I am missing something (like setting
> block.hdr = NULL or some more code to handle new flags).
>
> Thank you,
> Rui Carneiro
>
>
> On Wed, Apr 15, 2009 at 11:23 PM, Timo Sirainen  wrote:
>
>> On Mon, 2009-04-13 at 11:18 +0100, Rui Carneiro wrote:
>> > I didn't understood yet what is the plugin's design and how the plugins
>> are
>> > called from the core system and I was wondering if anyone could help me
>> with
>> > that.
>>
>> fts-storage.c hooks into all the functions in mail-storage API that it
>> needs to. Currently indexing isn't done while messages are being saved,
>> but instead just before searching. The searching functions are:
>>
>>  - fts_mailbox_search_init() tries to figure out if FTS can optimize the
>> search. If it does, it tries to figure out if FTS index is up-to-date
>> and if not, starts the search.
>>
>>  - fts_mailbox_search_next_nonblock() continues the indexing (or
>> searching after indexing) for a while. The idea is that IMAP connection
>> is able to process other commands while doing a long-running search. So
>> fts plugin indexes FTS_SEARCH_NONBLOCK_COUNT (50) messages at a time. It
>> would be nice if that value was dynamically calculated and also based on
>> bytes instead of messages, but that's maybe too much trouble.
>>
>>  - fts_mailbox_search_next_update_seq() uses the fts search results and
>> updates mail-storage's search stuff so that it doesn't go through
>> messages that don't match.
>>
>>  - fts_build_mail() indexes a single mail. It parses the messages and
>> returns the data in small blocks. For text/* and message/rfc822 parts
>> those blocks are currently sent to FTS backend. This is where I think
>> you should look into hooking your attachment parsing. Change
>> fts_build_want_index_part() to look for more content-types that you're
>> interested in and then before feeding the blocks to FTS backend put them
>> through your own converter function, something like:
>>
>> int attachment_extract_text(struct attachment_extract_context *ctx,
>> const struct message_block *input, struct message_block *output);
>>
>>
>>
>
>
> --
> mobile: +351 963446125
> mail: rui@gmail.com
> mail: ei04...@fe.up.pt
> website: http://paginas.fe.up.pt/~ei04073
>



-- 
mobile: +351 963446125
mail: rui@gmail.com
mail: ei04...@fe.up.pt
website: http://paginas.fe.up.pt/~ei04073


Re: [Dovecot] FTS Plugin design

2009-04-17 Thread Rui Carneiro
Thank you for all tips. The design look more clear to me now.

I have one more question. I looked into fts_build_want_index_part() and I
saw that I need to add some flags to message_part_flags, what values should
I choose? My first approach was to follow your schema and set
MESSAGE_PART_FLAG_ATTACHMENT = 0x16. There is any problem with this?

I already had changed parse_content_type() to set ctx->part->flags correctly
but if i choose my custom flag dovecot assume that all attachment lines are
headers. I already tried to set those ctx->part->flags as TEXT and the
fts_backend was feeded correctly with all attachment lines.

I don't know if this is related with the value of
MESSAGE_PART_FLAG_ATTACHMENT or if I am missing something (like setting
block.hdr = NULL or some more code to handle new flags).

Thank you,
Rui Carneiro

On Wed, Apr 15, 2009 at 11:23 PM, Timo Sirainen  wrote:

> On Mon, 2009-04-13 at 11:18 +0100, Rui Carneiro wrote:
> > I didn't understood yet what is the plugin's design and how the plugins
> are
> > called from the core system and I was wondering if anyone could help me
> with
> > that.
>
> fts-storage.c hooks into all the functions in mail-storage API that it
> needs to. Currently indexing isn't done while messages are being saved,
> but instead just before searching. The searching functions are:
>
>  - fts_mailbox_search_init() tries to figure out if FTS can optimize the
> search. If it does, it tries to figure out if FTS index is up-to-date
> and if not, starts the search.
>
>  - fts_mailbox_search_next_nonblock() continues the indexing (or
> searching after indexing) for a while. The idea is that IMAP connection
> is able to process other commands while doing a long-running search. So
> fts plugin indexes FTS_SEARCH_NONBLOCK_COUNT (50) messages at a time. It
> would be nice if that value was dynamically calculated and also based on
> bytes instead of messages, but that's maybe too much trouble.
>
>  - fts_mailbox_search_next_update_seq() uses the fts search results and
> updates mail-storage's search stuff so that it doesn't go through
> messages that don't match.
>
>  - fts_build_mail() indexes a single mail. It parses the messages and
> returns the data in small blocks. For text/* and message/rfc822 parts
> those blocks are currently sent to FTS backend. This is where I think
> you should look into hooking your attachment parsing. Change
> fts_build_want_index_part() to look for more content-types that you're
> interested in and then before feeding the blocks to FTS backend put them
> through your own converter function, something like:
>
> int attachment_extract_text(struct attachment_extract_context *ctx,
> const struct message_block *input, struct message_block *output);
>
>
>


-- 
mobile: +351 963446125
mail: rui@gmail.com
mail: ei04...@fe.up.pt
website: http://paginas.fe.up.pt/~ei04073


Re: [Dovecot] FTS Plugin design

2009-04-15 Thread Timo Sirainen
On Mon, 2009-04-13 at 11:18 +0100, Rui Carneiro wrote:
> I didn't understood yet what is the plugin's design and how the plugins are
> called from the core system and I was wondering if anyone could help me with
> that.

fts-storage.c hooks into all the functions in mail-storage API that it
needs to. Currently indexing isn't done while messages are being saved,
but instead just before searching. The searching functions are:

 - fts_mailbox_search_init() tries to figure out if FTS can optimize the
search. If it does, it tries to figure out if FTS index is up-to-date
and if not, starts the search.

 - fts_mailbox_search_next_nonblock() continues the indexing (or
searching after indexing) for a while. The idea is that IMAP connection
is able to process other commands while doing a long-running search. So
fts plugin indexes FTS_SEARCH_NONBLOCK_COUNT (50) messages at a time. It
would be nice if that value was dynamically calculated and also based on
bytes instead of messages, but that's maybe too much trouble.

 - fts_mailbox_search_next_update_seq() uses the fts search results and
updates mail-storage's search stuff so that it doesn't go through
messages that don't match.

 - fts_build_mail() indexes a single mail. It parses the messages and
returns the data in small blocks. For text/* and message/rfc822 parts
those blocks are currently sent to FTS backend. This is where I think
you should look into hooking your attachment parsing. Change
fts_build_want_index_part() to look for more content-types that you're
interested in and then before feeding the blocks to FTS backend put them
through your own converter function, something like:

int attachment_extract_text(struct attachment_extract_context *ctx,
const struct message_block *input, struct message_block *output);




signature.asc
Description: This is a digitally signed message part