Re: [jira] Commented: (SOLR-934) Enable importing of mails into a solr index through DIH.

Noble Paul നോബിള്‍ नोब्ळ् Fri, 02 Jan 2009 03:21:53 -0800

On Fri, Jan 2, 2009 at 3:42 PM, Ben Johnson
<ben.john...@jandpconsulting.co.uk> wrote:
> Thanks Paul and Preetam.  A couple of further things:
>
> - How do you envisage this functionality being used?  I can see indexing all
> emails for all users as part of a one-off system setup/migration process,
> but also as a core feature to ensure all emails received by a
> company/organisation are indexed (and stored).  This could be done either by
> the end-user, who controls what should be indexed (i.e. certain work-related
> emails only) or directly from the mail server, where all emails would be
> indexed (including personal emails, which could later be deleted from the
> index if desired) to ensure no important emails get missed.  Is this the
> sort of thing you had in mind?  There is also the issue of not
> indexing/storing the same email from multiple users' mailboxes (haven't
> worked that one out yet, possibly via a hash).
>
> - Is the mailbox 'configuration' (<entity> tag) stored in data-config.xml on
> the Solr server?  If so, this would seem to have quite a lot of
Do you wish all users mails to be indexed into single index ? it is
possible by passing on the username password as request parameters .



> administrative overhead - how do you manage a system with 5000+ users?  How
> are the accounts/passwords maintained?  Are the passwords stored in plain
> text?
>
> - Minor typo: *conectTimeout* should be *connectTimeout*
>
> - A few real-world scenarios I've encountered are:
>   - be able to handle an email sent to over 5000 recipients (in the 'To:'
> field)
>   - be able to handle an email with a 'long' subject line (240+ characters)
>   - be able to handle an email with 100 attachments
>   - be able to handle an email with attachments with 'long' names (240+
> characters)
>
> This caused several problems in the software I was using at the time (a
> proprietary system, not Solr-based), either memory-related issues or file
> system errors when running on Windows where the file system or its API
> limited file names to 255 characters, including the path.
>
> Thanks very much!
> Ben
>
> --------------------------------------------------
> From: "Noble Paul നോബിള്‍ नोब्ळ्" <noble.p...@gmail.com>
> Sent: Friday, January 02, 2009 5:02 AM
> To: <solr-dev@lucene.apache.org>
> Subject: Re: [jira] Commented: (SOLR-934) Enable importing of mails into a
> solr index through DIH.
>
>> Hi Ben,
>> You can take a look at the wiki page for DIH
>> http://wiki.apache.org/solr/DataImportHandler
>>
>> It helps you index mostly structured data into Solr from db, xml etc .
>> It can be considered as an ETL tool
>> (http://en.wikipedia.org/wiki/Extract,_transform,_load ) for Solr.
>>
>> Adding mail support means you can index your emails into Sols with a
>> few lines of configuration
>> --Noble
>>
>> On Fri, Jan 2, 2009 at 4:30 AM, Ben Johnson
>> <ben.john...@jandpconsulting.co.uk> wrote:
>>>
>>> I'm watching this issue with interest, but I'm having trouble
>>> understanding
>>> the bigger picture.  I am prototyping a system that uses Restlet to store
>>> and index objects (mainly MS Office and OpenOffice documents and emails),
>>> so
>>> I am planning to use Solr with Tika to index the objects.
>>>
>>> I know nothing about DIH (Distributed Index Handler?), so I'm not sure
>>> what
>>> role it plays with Solr.  Is it a vendor-specific technology (from
>>> Autonomy)?  What does it do?  Do you give it objects to index and it
>>> handles
>>> them by passing it to one or more Solr/Tika indexing servers?  And are
>>> you
>>> thinking that this would therefore be a good place to not only index the
>>> objects, but also pass the information about the digital content to
>>> DROID?
>>>
>>> Reading a bit about DROID (from TNA, The National Archives), it seems
>>> like
>>> it is used to capture information about the digital content of objects
>>> stored in a content repository.  How does this fit with Solr?  I thought
>>> Solr with Tika just did the indexing of text-based objects, but the
>>> actual
>>> storage of the objects would be elsewhere (probably in the file system).
>>> From what I can tell, DROID would operate on the file system objects, not
>>> the indexing information.  Have I got this right?
>>>
>>> Ideally, I would also like to convert any suitable content into PDF/A
>>> format
>>> for long-term archival - probably not relevant to this issue, but I
>>> thought
>>> I'd mention it in case you see an application of this as part of email
>>> and
>>> attachment storage.
>>>
>>> Sorry for all the questions, but hopefully someone could clarify this for
>>> me!
>>>
>>> Thanks very much
>>> Ben Johnson
>>>
>>> --------------------------------------------------
>>> From: "Grant Ingersoll (JIRA)" <j...@apache.org>
>>> Sent: Thursday, January 01, 2009 7:07 PM
>>> To: <solr-dev@lucene.apache.org>
>>> Subject: [jira] Commented: (SOLR-934) Enable importing of mails into a
>>> solr
>>> index through DIH.
>>>
>>>>
>>>>  [
>>>>
>>>> https://issues.apache.org/jira/browse/SOLR-934?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12660210#action_12660210
>>>> ]
>>>>
>>>> Grant Ingersoll commented on SOLR-934:
>>>> --------------------------------------
>>>>
>>>> Would it make more sense for DIH to farm out it's content acquisition to
>>>> a
>>>> library like Droids?  Then, we could have real crawling, etc. all
>>>> through a
>>>> pluggable connector framework.
>>>>
>>>>> Enable importing of mails into a solr index through DIH.
>>>>> --------------------------------------------------------
>>>>>
>>>>>               Key: SOLR-934
>>>>>               URL: https://issues.apache.org/jira/browse/SOLR-934
>>>>>           Project: Solr
>>>>>        Issue Type: New Feature
>>>>>        Components: contrib - DataImportHandler
>>>>>  Affects Versions: 1.4
>>>>>          Reporter: Preetam Rao
>>>>>          Assignee: Shalin Shekhar Mangar
>>>>>           Fix For: 1.4
>>>>>
>>>>>       Attachments: SOLR-934.patch, SOLR-934.patch
>>>>>
>>>>>  Original Estimate: 24h
>>>>>  Remaining Estimate: 24h
>>>>>
>>>>> Enable importing of mails into solr through DIH. Take one or more
>>>>> mailbox
>>>>> credentials, download and index their content along with the content
>>>>> from
>>>>> attachments. The folders to fetch can be made configurable based on
>>>>> various
>>>>> criteria. Apache Tika is used for extracting content from different
>>>>> kinds of
>>>>> attachments. JavaMail is used for mail box related operations like
>>>>> fetching
>>>>> mails, filtering them etc.
>>>>> The basic configuration for one mail box is as below:
>>>>> {code:xml}
>>>>> <document>
>>>>>  <entity processor="MailEntityProcessor" user="someb...@gmail.com"
>>>>>               password="something" host="imap.gmail.com"
>>>>> protocol="imaps"/>
>>>>> </document>
>>>>> {code}
>>>>> The below is the list of all configuration available:
>>>>> {color:green}Required{color}
>>>>> ---------
>>>>> *user*
>>>>> *pwd*
>>>>> *protocol*  (only "imaps" supported now)
>>>>> *host*
>>>>> {color:green}Optional{color}
>>>>> ---------
>>>>> *folders* - comma seperated list of folders.
>>>>> If not specified, default folder is used. Nested folders can be
>>>>> specified
>>>>> like a/b/c
>>>>> *recurse* - index subfolders. Defaults to true.
>>>>> *exclude* - comma seperated list of patterns.
>>>>> *include* - comma seperated list of patterns.
>>>>> *batchSize* - mails to fetch at once in a given folder.
>>>>> Only headers can be prefetched in Javamail IMAP.
>>>>> *readTimeout* - defaults to 60000ms
>>>>> *conectTimeout* - defaults to 30000ms
>>>>> *fetchSize* - IMAP config. 32KB default
>>>>> *fetchMailsSince* -
>>>>> date/time in miliiseconds, mails received after which will be fetched.
>>>>> Useful for delta import.
>>>>> *customFilter* - class name.
>>>>> {code}
>>>>> import javax.mail.Folder;
>>>>> import javax.mail.SearchTerm;
>>>>> clz implements MailEntityProcessor.CustomFilter() {
>>>>> public SearchTerm getCustomSearch(Folder folder);
>>>>> }
>>>>> {code}
>>>>> *processAttachement* - defaults to true
>>>>> The below are the indexed fields.
>>>>> {code}
>>>>>  // Fields To Index
>>>>>  // single valued
>>>>>  private static final String SUBJECT = "subject";
>>>>>  private static final String FROM = "from";
>>>>>  private static final String SENT_DATE = "sentDate";
>>>>>  private static final String XMAILER = "xMailer";
>>>>>  // multi valued
>>>>>  private static final String TO_CC_BCC = "allTo";
>>>>>  private static final String FLAGS = "flags";
>>>>>  private static final String CONTENT = "content";
>>>>>  private static final String ATTACHMENT = "attachement";
>>>>>  private static final String ATTACHMENT_NAMES = "attachementNames";
>>>>>  // flag values
>>>>>  private static final String FLAG_ANSWERED = "answered";
>>>>>  private static final String FLAG_DELETED = "deleted";
>>>>>  private static final String FLAG_DRAFT = "draft";
>>>>>  private static final String FLAG_FLAGGED = "flagged";
>>>>>  private static final String FLAG_RECENT = "recent";
>>>>>  private static final String FLAG_SEEN = "seen";
>>>>> {code}
>>>>
>>>> --
>>>> This message is automatically generated by JIRA.
>>>> -
>>>> You can reply to this email to add a comment to the issue online.
>>>>
>>>
>>
>>
>>
>> --
>> --Noble Paul
>
>



-- 
--Noble Paul

Re: [jira] Commented: (SOLR-934) Enable importing of mails into a solr index through DIH.

Reply via email to