On Fri, Jan 2, 2009 at 3:42 PM, Ben Johnson <ben.john...@jandpconsulting.co.uk> wrote: > Thanks Paul and Preetam. A couple of further things: > > - How do you envisage this functionality being used? I can see indexing all > emails for all users as part of a one-off system setup/migration process, > but also as a core feature to ensure all emails received by a > company/organisation are indexed (and stored). This could be done either by > the end-user, who controls what should be indexed (i.e. certain work-related > emails only) or directly from the mail server, where all emails would be > indexed (including personal emails, which could later be deleted from the > index if desired) to ensure no important emails get missed. Is this the > sort of thing you had in mind? There is also the issue of not > indexing/storing the same email from multiple users' mailboxes (haven't > worked that one out yet, possibly via a hash). > > - Is the mailbox 'configuration' (<entity> tag) stored in data-config.xml on > the Solr server? If so, this would seem to have quite a lot of Do you wish all users mails to be indexed into single index ? it is possible by passing on the username password as request parameters .
> administrative overhead - how do you manage a system with 5000+ users? How > are the accounts/passwords maintained? Are the passwords stored in plain > text? > > - Minor typo: *conectTimeout* should be *connectTimeout* > > - A few real-world scenarios I've encountered are: > - be able to handle an email sent to over 5000 recipients (in the 'To:' > field) > - be able to handle an email with a 'long' subject line (240+ characters) > - be able to handle an email with 100 attachments > - be able to handle an email with attachments with 'long' names (240+ > characters) > > This caused several problems in the software I was using at the time (a > proprietary system, not Solr-based), either memory-related issues or file > system errors when running on Windows where the file system or its API > limited file names to 255 characters, including the path. > > Thanks very much! > Ben > > -------------------------------------------------- > From: "Noble Paul നോബിള് नोब्ळ्" <noble.p...@gmail.com> > Sent: Friday, January 02, 2009 5:02 AM > To: <solr-dev@lucene.apache.org> > Subject: Re: [jira] Commented: (SOLR-934) Enable importing of mails into a > solr index through DIH. > >> Hi Ben, >> You can take a look at the wiki page for DIH >> http://wiki.apache.org/solr/DataImportHandler >> >> It helps you index mostly structured data into Solr from db, xml etc . >> It can be considered as an ETL tool >> (http://en.wikipedia.org/wiki/Extract,_transform,_load ) for Solr. >> >> Adding mail support means you can index your emails into Sols with a >> few lines of configuration >> --Noble >> >> On Fri, Jan 2, 2009 at 4:30 AM, Ben Johnson >> <ben.john...@jandpconsulting.co.uk> wrote: >>> >>> I'm watching this issue with interest, but I'm having trouble >>> understanding >>> the bigger picture. I am prototyping a system that uses Restlet to store >>> and index objects (mainly MS Office and OpenOffice documents and emails), >>> so >>> I am planning to use Solr with Tika to index the objects. >>> >>> I know nothing about DIH (Distributed Index Handler?), so I'm not sure >>> what >>> role it plays with Solr. Is it a vendor-specific technology (from >>> Autonomy)? What does it do? Do you give it objects to index and it >>> handles >>> them by passing it to one or more Solr/Tika indexing servers? And are >>> you >>> thinking that this would therefore be a good place to not only index the >>> objects, but also pass the information about the digital content to >>> DROID? >>> >>> Reading a bit about DROID (from TNA, The National Archives), it seems >>> like >>> it is used to capture information about the digital content of objects >>> stored in a content repository. How does this fit with Solr? I thought >>> Solr with Tika just did the indexing of text-based objects, but the >>> actual >>> storage of the objects would be elsewhere (probably in the file system). >>> From what I can tell, DROID would operate on the file system objects, not >>> the indexing information. Have I got this right? >>> >>> Ideally, I would also like to convert any suitable content into PDF/A >>> format >>> for long-term archival - probably not relevant to this issue, but I >>> thought >>> I'd mention it in case you see an application of this as part of email >>> and >>> attachment storage. >>> >>> Sorry for all the questions, but hopefully someone could clarify this for >>> me! >>> >>> Thanks very much >>> Ben Johnson >>> >>> -------------------------------------------------- >>> From: "Grant Ingersoll (JIRA)" <j...@apache.org> >>> Sent: Thursday, January 01, 2009 7:07 PM >>> To: <solr-dev@lucene.apache.org> >>> Subject: [jira] Commented: (SOLR-934) Enable importing of mails into a >>> solr >>> index through DIH. >>> >>>> >>>> [ >>>> >>>> https://issues.apache.org/jira/browse/SOLR-934?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12660210#action_12660210 >>>> ] >>>> >>>> Grant Ingersoll commented on SOLR-934: >>>> -------------------------------------- >>>> >>>> Would it make more sense for DIH to farm out it's content acquisition to >>>> a >>>> library like Droids? Then, we could have real crawling, etc. all >>>> through a >>>> pluggable connector framework. >>>> >>>>> Enable importing of mails into a solr index through DIH. >>>>> -------------------------------------------------------- >>>>> >>>>> Key: SOLR-934 >>>>> URL: https://issues.apache.org/jira/browse/SOLR-934 >>>>> Project: Solr >>>>> Issue Type: New Feature >>>>> Components: contrib - DataImportHandler >>>>> Affects Versions: 1.4 >>>>> Reporter: Preetam Rao >>>>> Assignee: Shalin Shekhar Mangar >>>>> Fix For: 1.4 >>>>> >>>>> Attachments: SOLR-934.patch, SOLR-934.patch >>>>> >>>>> Original Estimate: 24h >>>>> Remaining Estimate: 24h >>>>> >>>>> Enable importing of mails into solr through DIH. Take one or more >>>>> mailbox >>>>> credentials, download and index their content along with the content >>>>> from >>>>> attachments. The folders to fetch can be made configurable based on >>>>> various >>>>> criteria. Apache Tika is used for extracting content from different >>>>> kinds of >>>>> attachments. JavaMail is used for mail box related operations like >>>>> fetching >>>>> mails, filtering them etc. >>>>> The basic configuration for one mail box is as below: >>>>> {code:xml} >>>>> <document> >>>>> <entity processor="MailEntityProcessor" user="someb...@gmail.com" >>>>> password="something" host="imap.gmail.com" >>>>> protocol="imaps"/> >>>>> </document> >>>>> {code} >>>>> The below is the list of all configuration available: >>>>> {color:green}Required{color} >>>>> --------- >>>>> *user* >>>>> *pwd* >>>>> *protocol* (only "imaps" supported now) >>>>> *host* >>>>> {color:green}Optional{color} >>>>> --------- >>>>> *folders* - comma seperated list of folders. >>>>> If not specified, default folder is used. Nested folders can be >>>>> specified >>>>> like a/b/c >>>>> *recurse* - index subfolders. Defaults to true. >>>>> *exclude* - comma seperated list of patterns. >>>>> *include* - comma seperated list of patterns. >>>>> *batchSize* - mails to fetch at once in a given folder. >>>>> Only headers can be prefetched in Javamail IMAP. >>>>> *readTimeout* - defaults to 60000ms >>>>> *conectTimeout* - defaults to 30000ms >>>>> *fetchSize* - IMAP config. 32KB default >>>>> *fetchMailsSince* - >>>>> date/time in miliiseconds, mails received after which will be fetched. >>>>> Useful for delta import. >>>>> *customFilter* - class name. >>>>> {code} >>>>> import javax.mail.Folder; >>>>> import javax.mail.SearchTerm; >>>>> clz implements MailEntityProcessor.CustomFilter() { >>>>> public SearchTerm getCustomSearch(Folder folder); >>>>> } >>>>> {code} >>>>> *processAttachement* - defaults to true >>>>> The below are the indexed fields. >>>>> {code} >>>>> // Fields To Index >>>>> // single valued >>>>> private static final String SUBJECT = "subject"; >>>>> private static final String FROM = "from"; >>>>> private static final String SENT_DATE = "sentDate"; >>>>> private static final String XMAILER = "xMailer"; >>>>> // multi valued >>>>> private static final String TO_CC_BCC = "allTo"; >>>>> private static final String FLAGS = "flags"; >>>>> private static final String CONTENT = "content"; >>>>> private static final String ATTACHMENT = "attachement"; >>>>> private static final String ATTACHMENT_NAMES = "attachementNames"; >>>>> // flag values >>>>> private static final String FLAG_ANSWERED = "answered"; >>>>> private static final String FLAG_DELETED = "deleted"; >>>>> private static final String FLAG_DRAFT = "draft"; >>>>> private static final String FLAG_FLAGGED = "flagged"; >>>>> private static final String FLAG_RECENT = "recent"; >>>>> private static final String FLAG_SEEN = "seen"; >>>>> {code} >>>> >>>> -- >>>> This message is automatically generated by JIRA. >>>> - >>>> You can reply to this email to add a comment to the issue online. >>>> >>> >> >> >> >> -- >> --Noble Paul > > -- --Noble Paul