Re: [Dovecot] FTS Plugin design

2009-09-09 Thread Timo Sirainen
On Tue, 2009-09-08 at 15:47 +0100, rui.carne...@portugalmail.net wrote: Now I am trying to find a way to know the mime part id of the parts used on fts_build_mail. Is that already possible or I need to do that by my own? If you already get the MIME structure, then I guess you have struct

Re: [Dovecot] FTS Plugin design

2009-09-08 Thread rui . carneiro
Hi again! After sometime using my changes on this plugin I found one major problem. When a message have two attachments with same name or one with content-type equal to message/*, my solr schema design does not work because as attachment unique identifier I used attachment's name what is

Re: [Dovecot] FTS Plugin design

2009-05-26 Thread Rui Carneiro
Citando Timo Sirainen t...@iki.fi: So valgrind didn't find anything wrong. We should ignore LEAK SUMMARY? What does gdb show as the backtrace? My gdb is not writing where he should (or not writing at all). This shouldn't be enough? mail_executable = /usr/local/libexec/dovecot/gdbhelper

Re: [Dovecot] FTS Plugin design

2009-05-26 Thread Timo Sirainen
On May 26, 2009, at 5:46 AM, Rui Carneiro wrote: Citando Timo Sirainen t...@iki.fi: So valgrind didn't find anything wrong. We should ignore LEAK SUMMARY? At least for now. Memory leaks don't cause crashes. What does gdb show as the backtrace? My gdb is not writing where he should (or

Re: [Dovecot] FTS Plugin design

2009-05-26 Thread Rui Carneiro
Citando Timo Sirainen t...@iki.fi: At least for now. Memory leaks don't cause crashes. Ok. gdb -p `pidof imap` cont make it crash bt full I think it won't be necessary. It is not crashing anymore. Maybe it was a bug in my code. Tomorrow (or in the next day) I will send you the code.

Re: [Dovecot] FTS Plugin design

2009-05-25 Thread Rui Carneiro
Citando Timo Sirainen t...@iki.fi: I guess it works around some other bug then. If it's a memory-related bug you could also see if valgrind complains something: protocol imap { .. mail_executable = /usr/bin/valgrind /usr/local/libexec/dovecot/imap } Here is the output (I cloned the

Re: [Dovecot] FTS Plugin design

2009-05-25 Thread Timo Sirainen
On Mon, 2009-05-25 at 14:20 +0100, Rui Carneiro wrote: Citando Timo Sirainen t...@iki.fi: I guess it works around some other bug then. If it's a memory-related bug you could also see if valgrind complains something: protocol imap { .. mail_executable = /usr/bin/valgrind

Re: [Dovecot] FTS Plugin design

2009-05-22 Thread Rui Carneiro
Hi Timo, I almost finish the changes on fts plugin. By now, it seems to work fine with attachments (extracting and sending them to Solr). I only have a problem with the max size of the command (cmd) that we can send to Solr: #define SOLR_CMDBUF_SIZE (1024*64) By now, if we send some message

Re: [Dovecot] FTS Plugin design

2009-05-22 Thread Timo Sirainen
On Fri, 2009-05-22 at 18:24 +0100, Rui Carneiro wrote: Hi Timo, I almost finish the changes on fts plugin. By now, it seems to work fine with attachments (extracting and sending them to Solr). I only have a problem with the max size of the command (cmd) that we can send to Solr: #define

Re: [Dovecot] FTS Plugin design

2009-05-22 Thread Rui Carneiro
Citando Timo Sirainen t...@iki.fi: The problem is something else. The Solr code simply tries to keep the send buffer smaller than that, nothing would break if you sent a larger buffer. Show gdb backtrace of the crash? I said it was from the buff size because when I increased it Dovecot

Re: [Dovecot] FTS Plugin design

2009-05-22 Thread Timo Sirainen
On Fri, 2009-05-22 at 18:57 +0100, Rui Carneiro wrote: Citando Timo Sirainen t...@iki.fi: The problem is something else. The Solr code simply tries to keep the send buffer smaller than that, nothing would break if you sent a larger buffer. Show gdb backtrace of the crash? I said it

Re: [Dovecot] FTS Plugin design

2009-05-20 Thread Rui Carneiro
Now, with attachment. /* Copyright (c) 2006-2009 Dovecot authors, see the included COPYING file */ #include lib.h #include buffer.h #include base64.h #include str.h #include unichar.h #include charset-utf8.h #include quoted-printable.h #include rfc822-parser.h #include rfc2231-parser.h #include

Re: [Dovecot] FTS Plugin design

2009-05-19 Thread Rui Carneiro
Citando Timo Sirainen t...@iki.fi: All the data comes from lib-mail/message-decoder.c. Hmm. Looks like it tries to force giving only valid UTF-8 output. I guess it should have some flag or something that makes it do that only for text/* parts, not for binary parts. OK, implemented, see if it

Re: [Dovecot] FTS Plugin design

2009-05-19 Thread Timo Sirainen
On Tue, 2009-05-19 at 14:40 +0100, Rui Carneiro wrote: http://hg.dovecot.org/dovecot-1.2/rev/44548a7fb10d It is working now but I needed to do some changes on your code. OK. Please see the attachment to checked any problem that may exist. You forgot the attachment. signature.asc

Re: [Dovecot] FTS Plugin design

2009-05-19 Thread Rui Carneiro
On Tue, May 19, 2009 at 8:51 PM, Timo Sirainen t...@iki.fi wrote: You forgot the attachment. Oh Sorry, I am not at the office now (almost 10pm here) I will send it tomorrow morning. Rui Carneiro --- Portugalmail, Comunicações S.A. www.portugalmail.net

Re: [Dovecot] FTS Plugin design

2009-05-18 Thread Rui Carneiro
Hi again, I am having some troubles sending all data to a file. When I finish to send all data to a file, I tried to open it and the file is corrupted. The first think I noticed is that all chars are capitalized what destroy all the file format. Where are the chars capitalized? Any other idea

Re: [Dovecot] FTS Plugin design

2009-05-18 Thread Timo Sirainen
On May 18, 2009, at 6:42 AM, Rui Carneiro wrote: I am having some troubles sending all data to a file. When I finish to send all data to a file, I tried to open it and the file is corrupted. The first think I noticed is that all chars are capitalized what destroy all the file format.

Re: [Dovecot] FTS Plugin design

2009-05-18 Thread Rui Carneiro
Citando Timo Sirainen t...@iki.fi: Nope. If you still see corruption, try with some simple test mails and see if it's adding garbage, losing contents or adding more content. I tried something more advanced than that. I hexdumped my pdf test file and on the first line I get: 25 50 44

Re: [Dovecot] FTS Plugin design

2009-05-18 Thread Timo Sirainen
On Mon, 2009-05-18 at 17:35 +0100, Rui Carneiro wrote: I think binary data is being corrupted anywhere before fts_backend_build_more() and I don't have any idea where. All the data comes from lib-mail/message-decoder.c. Hmm. Looks like it tries to force giving only valid UTF-8 output. I guess

Re: [Dovecot] FTS Plugin design

2009-05-15 Thread Rui Carneiro
Citando Timo Sirainen t...@iki.fi: 1. You notice a non-text/* content-type and initialize text extraction for the MIME part. Like: struct attachment_extract_context * attachment_extract_init(const char *content_type); 2. After this you feed all the input belonging to that MIME part to:

Re: [Dovecot] FTS Plugin design

2009-05-13 Thread Timo Sirainen
On Tue, 2009-05-05 at 12:08 +0100, Rui Carneiro wrote: - fts_build_mail() indexes a single mail. It parses the messages and returns the data in small blocks. For text/* and message/rfc822 parts those blocks are currently sent to FTS backend. This is where I think you should look into

Re: [Dovecot] FTS Plugin design

2009-05-05 Thread Rui Carneiro
Hi again, On Wed, Apr 15, 2009 at 11:23 PM, Timo Sirainen t...@iki.fi wrote: - fts_build_mail() indexes a single mail. It parses the messages and returns the data in small blocks. For text/* and message/rfc822 parts those blocks are currently sent to FTS backend. This is where I think you

Re: [Dovecot] FTS Plugin design

2009-04-23 Thread Steffen Kaiser
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On Wed, 22 Apr 2009, Rui Carneiro wrote: I will talk with the developers of those applications about the possibility of supporting stdin input (if not supported yet). I think the API that fts plugin uses to do the conversion should be generic

Re: [Dovecot] FTS Plugin design

2009-04-23 Thread rui . carneiro
On Thu, Apr 23, 2009 at 5:47 AM, to...@tuxteam.de wrote: Note that some formats might require to seek to some point in the file [1] (typically the end), so reading from stdin is awkward (it would require stdin to be seekable, so either the app or the caller would have to put the

Re: [Dovecot] FTS Plugin design

2009-04-23 Thread tomas
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On Thu, Apr 23, 2009 at 12:27:47PM +0100, rui.carne...@portugalmail.net wrote: On Thu, Apr 23, 2009 at 5:47 AM, to...@tuxteam.de wrote: Note that some formats might require to seek to some point in the file [1] [...] I hadn't thought on

Re: [Dovecot] FTS Plugin design

2009-04-22 Thread Rui Carneiro
Hi, Almost full text search engines (C/C++) I looked (Swish-E, Wumpus, Lemur and Xapian) do not use any kind of library or parser. Instead, they use other applications like pdftotext, catdoc, catppt (etc) and call them with execvp (or equivalent). Using this approach on my project have some pros

Re: [Dovecot] FTS Plugin design

2009-04-22 Thread Timo Sirainen
On Wed, 2009-04-22 at 15:51 +0100, Rui Carneiro wrote: Hi, Almost full text search engines (C/C++) I looked (Swish-E, Wumpus, Lemur and Xapian) do not use any kind of library or parser. Instead, they use other applications like pdftotext, catdoc, catppt (etc) and call them with execvp (or

Re: [Dovecot] FTS Plugin design

2009-04-22 Thread Rui Carneiro
On Wed, Apr 22, 2009 at 5:38 PM, Timo Sirainen t...@iki.fi wrote: Maybe those programs could be changed and just require the newer versions?.. I will talk with the developers of those applications about the possibility of supporting stdin input (if not supported yet). I think the API that

Re: [Dovecot] FTS Plugin design

2009-04-22 Thread tomas
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On Wed, Apr 22, 2009 at 03:51:45PM +0100, Rui Carneiro wrote: [...] Cons: - Some programs to parse special formats (p.e. catppt and pdftotext) do not accept input from stdin (we need to create temporary files). [from the peanut gallery here]

Re: [Dovecot] FTS Plugin design

2009-04-21 Thread Rui Carneiro
Hi again, Anyone know some good libraries to handle the content of files like pdf, ppt, doc, etc? I am already indexing attachments all I need now is extract the text of them. Regards, Rui Carneiro On Mon, Apr 20, 2009 at 3:29 PM, Rui Carneiro rui@gmail.com wrote: Hi, The problem was on

Re: [Dovecot] FTS Plugin design

2009-04-21 Thread Timo Sirainen
On Apr 21, 2009, at 6:25 AM, Rui Carneiro wrote: Anyone know some good libraries to handle the content of files like pdf, ppt, doc, etc? I am already indexing attachments all I need now is extract the text of them. I've no idea, but you could at least look at some of the other full text

Re: [Dovecot] FTS Plugin design

2009-04-21 Thread Rui Carneiro
Great idea! I will give news soon. On Tue, Apr 21, 2009 at 5:32 PM, Timo Sirainen t...@iki.fi wrote: I've no idea, but you could at least look at some of the other full text search engines. I remember them advertising indexing support for all kinds of formats. Maybe they're using some

Re: [Dovecot] FTS Plugin design

2009-04-20 Thread Rui Carneiro
Hi, The problem was on the flag. My hexa to binary conversions was wrong. Regards, Rui Carneiro On Fri, Apr 17, 2009 at 10:03 AM, Rui Carneiro rui@gmail.com wrote: Thank you for all tips. The design look more clear to me now. I have one more question. I looked into

Re: [Dovecot] FTS Plugin design

2009-04-17 Thread Rui Carneiro
Thank you for all tips. The design look more clear to me now. I have one more question. I looked into fts_build_want_index_part() and I saw that I need to add some flags to message_part_flags, what values should I choose? My first approach was to follow your schema and set

Re: [Dovecot] FTS Plugin design

2009-04-15 Thread Timo Sirainen
On Mon, 2009-04-13 at 11:18 +0100, Rui Carneiro wrote: I didn't understood yet what is the plugin's design and how the plugins are called from the core system and I was wondering if anyone could help me with that. fts-storage.c hooks into all the functions in mail-storage API that it needs to.

[Dovecot] FTS Plugin design

2009-04-13 Thread Rui Carneiro
Hi all, Currently I am developing some changes on the solr plugin. I want this plugin indexing also the attachment's content. I have already started to look on plugin's source but I am having some problems understanding how it works. I didn't understood yet what is the plugin's design and how