On 06.05.2018 13:13, André Rodier wrote: > Hello again, > > I have created a parser script, a little bit more advanced than the > one provided with Dovecot. The main feature is probably to index > documents inside zip/rar/tgz archives... > > I am using Ansible, swaks and doveadm to run automatic tests for each > supported content. For specific reasons, I am not yet able to add > Apache Tika to the distribution. However, I already made some tests > with it. For now, I want to talk about the indexing script. > > I also have noticed a few weird behaviours. I will mention them at the > end, albeit I am not 100% sure where they are coming from. I realised > last week that using QEMU snapshots was not working as expected, so I > am now more careful with this feature. > > For the developers or users who would be interested and the Dovecot > team members to understand my questions, here how the tests are working: > > To run my tests, I have a set of files in various formats, with a UUID > inside. They are office files, text files, or even archives with a > text file inside... > > The first test I am running is the script alone. I check that the > script can convert the file to text, and then I use grep to check the > UUID is present. This works *perfectly* for all the content, except > ppt, but it's minor. > > The second test is full: > - I use swaks to send the email with an attachment and the appropriate > mime type. > - I then refresh the index using doveadm rescan. > - I check that fts search returns a line, with doveadm fts search. > - I then expunge the mailbox to be sure that the next test is valid. > > For the second test, it works almost all the time, except in the > following situations: > - When the attachment is an email (mime type message/rfc822) > - RTF (could be a bug in my script) > - Text file in UTF16 (Even if this file is converted to UTF8) > > *Questions:* > 1 - Is there any limitation or special case for the mime message/rfc822
Not that I can see in decoder. > 2 - Is the mime type received coming from the email headers? Mime type received comes from mail header, unless it's "application/octet-stream", in which case autodetection is attempted based on file suffix. > 3 - When the script is called without arguments, what is the purpose > of the extension at the end of each supported mime types? The idea is to provide mappings for decoder, so that if the content type is "application/octet-stream", autodetection can be performed. > 4 - Can I return a wildcard in the supported mime types, for instance > "text/* *" ? Content type matching is done with strcmp, which is probably bit suboptimal. Have to take a note of this. > 5 - I would like to handle attachments of types > application/octet-stream. I have added "application/octet-stream *", > but I am not sure if dovecot will pass the attachments with these mime > type or not. > application/octet-stream is already handled in code. > *Notes:* > 1 - I used netcat to monitor the solr server. I realise that > sometimes, the data sent to the solr server only contains the headers > of the email, not the text returned by the parser. Especially with > rfc822 messages. I will do more tests. > 2 - I just finished to write the script, it's not yet refactored, but > at list it is well documented. I will do a full security audit later. > I am actually testing an associated AppArmor profile. > 3 - I will do more intensive test on the script on bigger mail boxes > with more attachments. > 4 - I may rewrite the script in Python > 5 - Suggestions welcome. > > I initially attached the current version of the script, but the email > is probably pending for review...In this case, the last development > version is on Github: > https://github.com/progmaticltd/homebox/blob/dev/install/playbooks/roles/dovecot/files/fts/decode2text > The configuration of supported mime types is a simple file, accessible > on github as well: > https://github.com/progmaticltd/homebox/blob/dev/install/playbooks/roles/dovecot/templates/fts/mime-supported.conf > > Thanks for your advices or suggestions. Aki Tuomi Dovecot oy