Re: [Dovecot] [PATCH] Indexing mail attachments with Dovecot + Solr

2011-08-31 Thread Timo Sirainen
On Mon, 2011-05-23 at 13:11 +0200, Antonio Perez-Aranda wrote:
 Indexing mail attachments with Dovecot + Solr.

I've been looking at this and wondering about a few things:

The example solrconfig.xml contains:

   requestHandler name=/update/extract 
 class=org.apache.solr.handler.extraction.ExtractingRequestHandler 
 startup=lazy
 ..
   !-- capture link hrefs but ignore div attributes --
   str name=captureAttrtrue/str
   str name=fmap.alinks/str
   str name=fmap.divignored_/str
 /lst

To me it looks like this requires that there exists a links field that
is used for.. I guess content between a../a tags? Or also for the
href URLS? In any case there's no links field in the schema.xml so I
don't think this works?

Similarly it looks like stuff between div../div is ignored here,
which doesn't seem like a good idea.

 There is a new property for the section plugin to filter the mimetypes
 that you want to index.
  * fts_solr_mimetype
files with this mimetype will be sent to solr.

In v2.1 I've added a generic fts decoder script that can handle
attachment decoding. The script contains stuff like:

formats='application/pdf pdf
application/x-pdf pdf
application/msword doc
..

So there already exists a place which can list supported MIME types and
also what filename extensions they have, so if there's
application/octet-stream with filename=foo.pdf, Dovecot's fts code can
change the MIME type to application/pdf. This sounds like it could be
useful for the Solr attachments too. Maybe instead of fts_solr_mimetype
setting the script could be modified a bit so that it would even allow
mixed Solr/script attachment extraction. For example:

formats='+application/pdf pdf
+application/x-pdf pdf
application/msword doc'

The + prefix could tell that the FTS backend (Solr) handles the MIME
type instead of the script. So with above config Solr would
decode .pdfs, but the script would decode .docs.

I was also thinking that the attachment documents could contain some
description fields as well, which could be useful if you're searching
the Solr index directly instead of via Dovecot. Maybe fields like
attachment_filename (parsed from Content-Disposition: header) and
attachment_description (parsed from Content-Description: header). They
could of course be empty if those fields don't exist (and probably
should be optional anyway).

Also there should be attachment_part field that would contain the IMAP
MIME part number of the attachment (e.g. 2.1.3), so it would be easy
to find and fetch the attachment. This could also be used as part of the
ID string instead of the attachment_count.



Re: [Dovecot] [PATCH] Indexing mail attachments with Dovecot + Solr

2011-05-27 Thread Antonio Perez-Aranda
I can confirm that this patch is running against Dovecot 2.0.13

2011/5/23 Antonio Perez-Aranda aperezara...@yaco.es:
 Yes and I have it in my TODO, but we are using this version on a
 production system. And it is our base system for development.

 2011/5/23 Charles Marcus cmar...@media-brokers.com:
 On 2011-05-23 7:11 AM, Antonio Perez-Aranda wrote:
 Indexing mail attachments with Dovecot + Solr.

 This patch has been tested with these versions:
  * dovecot 2.0.9
  * apache-solr 1.4.1

 Isn't it customary - and logical - to always test/patch against the
 current stable RELEASE version (ie, 2.0.13)?

 --

 Best regards,

 Charles




 --
 Antonio Pérez-Aranda Alcaide
 aperezara...@yaco.es

 Yaco Sistemas S.L.
 http://www.yaco.es/
 C/ Rioja 5, 41001 Sevilla
 Teléfono +34 954 50 00 57
 Fax      +34 954 50 09 29




-- 
Antonio Pérez-Aranda Alcaide
aperezara...@yaco.es

Yaco Sistemas S.L.
http://www.yaco.es/
C/ Rioja 5, 41001 Sevilla
Teléfono +34 954 50 00 57
Fax      +34 954 50 09 29


[Dovecot] [PATCH] Indexing mail attachments with Dovecot + Solr

2011-05-23 Thread Antonio Perez-Aranda
Indexing mail attachments with Dovecot + Solr.

This patch has been tested with these versions:
 * dovecot 2.0.9
 * apache-solr 1.4.1

This is a patch for the fts-solr plugin (that indexes mail messages
for Dovecot with Solr). In main stream, the plugin does not index
attachments; With this patch, you can index mails and their
attachments (pdf, docs, openoffice docs...) . You can get others
goodies with this patch and the Solr
Config provided, like Synonyms and Stemming (Spanish by default).

Attachment indexing is provided by Solr Cell and Tika (ExtractingRequestHandler)
 * http://wiki.apache.org/solr/ExtractingRequestHandler

Synonyms and Stemming are provided by SnowballPorterFilterFactory from
Solr Language Analysis:
 * http://wiki.apache.org/solr/LanguageAnalysis

We have tested Solr with Tomcat and Jetty. Tomcat is better to handle
UTF-8 and bigger POSTS.

Attachments file format supported
 * http://tika.apache.org/0.9/formats.html

At present, attachments in attachments (like, for example, attachments
in fordwarded eml attachments) are not indexed. Also, keep in mind
that there are many types of files, and many variants of the same file
type. Per Example, some pdf files are not readable by solr pdf
reader.

Config:

There are two new options added to fts_solr property:
 * index-attachments
   Enable attachments indexing.
 * manual-update
   Avoid index on user search. You can trigger indexing using
doveadm search or doveadm index commands.

There is a new property for the section plugin to filter the mimetypes
that you want to index.
 * fts_solr_mimetype
   files with this mimetype will be sent to solr.

After integrating solr directory in your solr config, and building
Dovecot with fts-solr support and with fts-solr-attachments-r885.patch
applied, you can update your dovecot config by adding to your
dovecot.conf:

...
mail_pluings = $mail_plugins fts fts_solr

plugin {
   fts = solr
   fts_solr = url=http://solrhost:8983/solr/ break-imap-search
index-attachments
   fts_solr_mimetype = application/x-pdf
application/vnd.openxmlformats-officedocument.wordprocessingml.document
}
...



-- 
Antonio Pérez-Aranda Alcaide
aperezara...@yaco.es

Yaco Sistemas S.L.
http://www.yaco.es/
C/ Rioja 5, 41001 Sevilla
Teléfono +34 954 50 00 57
Fax      +34 954 50 09 29


Re: [Dovecot] [PATCH] Indexing mail attachments with Dovecot + Solr

2011-05-23 Thread Antonio Perez-Aranda
Sorry, I forgot to include the attachment.

2011/5/23 Antonio Perez-Aranda aperezara...@yaco.es:
 Indexing mail attachments with Dovecot + Solr.

 This patch has been tested with these versions:
  * dovecot 2.0.9
  * apache-solr 1.4.1

 This is a patch for the fts-solr plugin (that indexes mail messages
 for Dovecot with Solr). In main stream, the plugin does not index
 attachments; With this patch, you can index mails and their
 attachments (pdf, docs, openoffice docs...) . You can get others
 goodies with this patch and the Solr
 Config provided, like Synonyms and Stemming (Spanish by default).

 Attachment indexing is provided by Solr Cell and Tika 
 (ExtractingRequestHandler)
  * http://wiki.apache.org/solr/ExtractingRequestHandler

 Synonyms and Stemming are provided by SnowballPorterFilterFactory from
 Solr Language Analysis:
  * http://wiki.apache.org/solr/LanguageAnalysis

 We have tested Solr with Tomcat and Jetty. Tomcat is better to handle
 UTF-8 and bigger POSTS.

 Attachments file format supported
  * http://tika.apache.org/0.9/formats.html

 At present, attachments in attachments (like, for example, attachments
 in fordwarded eml attachments) are not indexed. Also, keep in mind
 that there are many types of files, and many variants of the same file
 type. Per Example, some pdf files are not readable by solr pdf
 reader.

 Config:

 There are two new options added to fts_solr property:
  * index-attachments
       Enable attachments indexing.
  * manual-update
       Avoid index on user search. You can trigger indexing using
 doveadm search or doveadm index commands.

 There is a new property for the section plugin to filter the mimetypes
 that you want to index.
  * fts_solr_mimetype
       files with this mimetype will be sent to solr.

 After integrating solr directory in your solr config, and building
 Dovecot with fts-solr support and with fts-solr-attachments-r885.patch
 applied, you can update your dovecot config by adding to your
 dovecot.conf:

 ...
 mail_pluings = $mail_plugins fts fts_solr

 plugin {
   fts = solr
   fts_solr = url=http://solrhost:8983/solr/ break-imap-search
 index-attachments
   fts_solr_mimetype = application/x-pdf
 application/vnd.openxmlformats-officedocument.wordprocessingml.document
 }
 ...



 --
 Antonio Pérez-Aranda Alcaide
 aperezara...@yaco.es

 Yaco Sistemas S.L.
 http://www.yaco.es/
 C/ Rioja 5, 41001 Sevilla
 Teléfono +34 954 50 00 57
 Fax      +34 954 50 09 29




-- 
Antonio Pérez-Aranda Alcaide
aperezara...@yaco.es

Yaco Sistemas S.L.
http://www.yaco.es/
C/ Rioja 5, 41001 Sevilla
Teléfono +34 954 50 00 57
Fax      +34 954 50 09 29


fts-solr-attachments-r885.tar.gz
Description: GNU Zip compressed data


Re: [Dovecot] [PATCH] Indexing mail attachments with Dovecot + Solr

2011-05-23 Thread Charles Marcus
On 2011-05-23 7:11 AM, Antonio Perez-Aranda wrote:
 Indexing mail attachments with Dovecot + Solr.
 
 This patch has been tested with these versions:
  * dovecot 2.0.9
  * apache-solr 1.4.1

Isn't it customary - and logical - to always test/patch against the
current stable RELEASE version (ie, 2.0.13)?

-- 

Best regards,

Charles


Re: [Dovecot] [PATCH] Indexing mail attachments with Dovecot + Solr

2011-05-23 Thread Antonio Perez-Aranda
Yes and I have it in my TODO, but we are using this version on a
production system. And it is our base system for development.

2011/5/23 Charles Marcus cmar...@media-brokers.com:
 On 2011-05-23 7:11 AM, Antonio Perez-Aranda wrote:
 Indexing mail attachments with Dovecot + Solr.

 This patch has been tested with these versions:
  * dovecot 2.0.9
  * apache-solr 1.4.1

 Isn't it customary - and logical - to always test/patch against the
 current stable RELEASE version (ie, 2.0.13)?

 --

 Best regards,

 Charles




-- 
Antonio Pérez-Aranda Alcaide
aperezara...@yaco.es

Yaco Sistemas S.L.
http://www.yaco.es/
C/ Rioja 5, 41001 Sevilla
Teléfono +34 954 50 00 57
Fax      +34 954 50 09 29