[jira] Commented: (SOLR-934) Enable importing of mails into a solr index through DIH.

2009-04-14 Thread Shalin Shekhar Mangar (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-934?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12698676#action_12698676
 ] 

Shalin Shekhar Mangar commented on SOLR-934:


Committed revision 764691.

 Enable importing of mails into a solr index through DIH.
 

 Key: SOLR-934
 URL: https://issues.apache.org/jira/browse/SOLR-934
 Project: Solr
  Issue Type: New Feature
  Components: contrib - DataImportHandler
Affects Versions: 1.4
Reporter: Preetam Rao
Assignee: Shalin Shekhar Mangar
 Fix For: 1.4

 Attachments: SOLR-934.patch, SOLR-934.patch, SOLR-934.patch, 
 SOLR-934.patch, SOLR-934.patch, SOLR-934.patch, SOLR-934.patch

   Original Estimate: 24h
  Remaining Estimate: 24h

 Enable importing of mails into solr through DIH. Take one or more mailbox 
 credentials, download and index their content along with the content from 
 attachments. The folders to fetch can be made configurable based on various 
 criteria. Apache Tika is used for extracting content from different kinds of 
 attachments. JavaMail is used for mail box related operations like fetching 
 mails, filtering them etc.
 The basic configuration for one mail box is as below:
 {code:xml}
 document
entity processor=MailEntityProcessor user=someb...@gmail.com 
 password=something host=imap.gmail.com protocol=imaps/
 /document
 {code}
 The below is the list of all configuration available:
 {color:green}Required{color}
 -
 *user* 
 *pwd* 
 *protocol*  (only imaps supported now)
 *host* 
 {color:green}Optional{color}
 -
 *folders* - comma seperated list of folders. 
 If not specified, default folder is used. Nested folders can be specified 
 like a/b/c
 *recurse* - index subfolders. Defaults to true.
 *exclude* - comma seperated list of patterns. 
 *include* - comma seperated list of patterns.
 *batchSize* - mails to fetch at once in a given folder. 
 Only headers can be prefetched in Javamail IMAP.
 *readTimeout* - defaults to 6ms
 *conectTimeout* - defaults to 3ms
 *fetchSize* - IMAP config. 32KB default
 *fetchMailsSince* -
 date/time in -MM-dd HH:mm:ss format, mails received after which will be 
 fetched. Useful for delta import.
 *customFilter* - class name.  
 {code}
 import javax.mail.Folder;
 import javax.mail.SearchTerm;
 clz implements MailEntityProcessor.CustomFilter() {
 public SearchTerm getCustomSearch(Folder folder);
 }
 {code}
 *processAttachement* - defaults to true
 The below are the indexed fields.
 {code}
   // Fields To Index
   // single valued
   private static final String SUBJECT = subject;
   private static final String FROM = from;
   private static final String SENT_DATE = sentDate;
   private static final String XMAILER = xMailer;
   // multi valued
   private static final String TO_CC_BCC = allTo;
   private static final String FLAGS = flags;
   private static final String CONTENT = content;
   private static final String ATTACHMENT = attachement;
   private static final String ATTACHMENT_NAMES = attachementNames;
   // flag values
   private static final String FLAG_ANSWERED = answered;
   private static final String FLAG_DELETED = deleted;
   private static final String FLAG_DRAFT = draft;
   private static final String FLAG_FLAGGED = flagged;
   private static final String FLAG_RECENT = recent;
   private static final String FLAG_SEEN = seen;
 {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: [jira] Commented: (SOLR-934) Enable importing of mails into a solr index through DIH.

2009-04-08 Thread jm
I work on an email archiving product and i have yet to see a server
that either sets no message-id or generates non unique ones (the only
instance I saw it was exim sending the same message-id as the original
for the non-delivered notification, but that was like 6 years ago), so
in practice I would consider them unique, just my opinion.

Anyways Hoss is right, not mandatory by the rfc, good to know lol

On Wed, Apr 8, 2009 at 2:44 AM, Ryan McKinley (JIRA) j...@apache.org wrote:

    [ 
 https://issues.apache.org/jira/browse/SOLR-934?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12696849#action_12696849
  ]

 Ryan McKinley commented on SOLR-934:
 

 bq. FWIW: Message-ID while common is not mandatory (see sec3.6 and sec3.6.4 
 of RFCs #2822 and #5322)

 In practice you can not rely on the the Message-ID to be unique.  Most 
 modern mail servers do a good job making sure each value is unique, but some 
 old MS mail servers sent the same message ID for *every* message!

 Enable importing of mails into a solr index through DIH.
 

                 Key: SOLR-934
                 URL: https://issues.apache.org/jira/browse/SOLR-934
             Project: Solr
          Issue Type: New Feature
          Components: contrib - DataImportHandler
    Affects Versions: 1.4
            Reporter: Preetam Rao
            Assignee: Shalin Shekhar Mangar
             Fix For: 1.4

         Attachments: SOLR-934.patch, SOLR-934.patch, SOLR-934.patch, 
 SOLR-934.patch, SOLR-934.patch

   Original Estimate: 24h
  Remaining Estimate: 24h

 Enable importing of mails into solr through DIH. Take one or more mailbox 
 credentials, download and index their content along with the content from 
 attachments. The folders to fetch can be made configurable based on various 
 criteria. Apache Tika is used for extracting content from different kinds of 
 attachments. JavaMail is used for mail box related operations like fetching 
 mails, filtering them etc.
 The basic configuration for one mail box is as below:
 {code:xml}
 document
    entity processor=MailEntityProcessor user=someb...@gmail.com
                 password=something host=imap.gmail.com protocol=imaps/
 /document
 {code}
 The below is the list of all configuration available:
 {color:green}Required{color}
 -
 *user*
 *pwd*
 *protocol*  (only imaps supported now)
 *host*
 {color:green}Optional{color}
 -
 *folders* - comma seperated list of folders.
 If not specified, default folder is used. Nested folders can be specified 
 like a/b/c
 *recurse* - index subfolders. Defaults to true.
 *exclude* - comma seperated list of patterns.
 *include* - comma seperated list of patterns.
 *batchSize* - mails to fetch at once in a given folder.
 Only headers can be prefetched in Javamail IMAP.
 *readTimeout* - defaults to 6ms
 *conectTimeout* - defaults to 3ms
 *fetchSize* - IMAP config. 32KB default
 *fetchMailsSince* -
 date/time in -MM-dd HH:mm:ss format, mails received after which will 
 be fetched. Useful for delta import.
 *customFilter* - class name.
 {code}
 import javax.mail.Folder;
 import javax.mail.SearchTerm;
 clz implements MailEntityProcessor.CustomFilter() {
 public SearchTerm getCustomSearch(Folder folder);
 }
 {code}
 *processAttachement* - defaults to true
 The below are the indexed fields.
 {code}
   // Fields To Index
   // single valued
   private static final String SUBJECT = subject;
   private static final String FROM = from;
   private static final String SENT_DATE = sentDate;
   private static final String XMAILER = xMailer;
   // multi valued
   private static final String TO_CC_BCC = allTo;
   private static final String FLAGS = flags;
   private static final String CONTENT = content;
   private static final String ATTACHMENT = attachement;
   private static final String ATTACHMENT_NAMES = attachementNames;
   // flag values
   private static final String FLAG_ANSWERED = answered;
   private static final String FLAG_DELETED = deleted;
   private static final String FLAG_DRAFT = draft;
   private static final String FLAG_FLAGGED = flagged;
   private static final String FLAG_RECENT = recent;
   private static final String FLAG_SEEN = seen;
 {code}

 --
 This message is automatically generated by JIRA.
 -
 You can reply to this email to add a comment to the issue online.




[jira] Commented: (SOLR-934) Enable importing of mails into a solr index through DIH.

2009-04-07 Thread Hoss Man (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-934?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12696842#action_12696842
 ] 

Hoss Man commented on SOLR-934:
---

bq. One question, what is the uniqueKey that we should use when indexing 
emails? I couldn't figure out so I removed the uniqueKey from my schema to try 
this out.

FWIW: Message-ID while common is not mandatory (see sec3.6 and sec3.6.4 of 
RFCs #2822 and #5322)


 Enable importing of mails into a solr index through DIH.
 

 Key: SOLR-934
 URL: https://issues.apache.org/jira/browse/SOLR-934
 Project: Solr
  Issue Type: New Feature
  Components: contrib - DataImportHandler
Affects Versions: 1.4
Reporter: Preetam Rao
Assignee: Shalin Shekhar Mangar
 Fix For: 1.4

 Attachments: SOLR-934.patch, SOLR-934.patch, SOLR-934.patch, 
 SOLR-934.patch, SOLR-934.patch

   Original Estimate: 24h
  Remaining Estimate: 24h

 Enable importing of mails into solr through DIH. Take one or more mailbox 
 credentials, download and index their content along with the content from 
 attachments. The folders to fetch can be made configurable based on various 
 criteria. Apache Tika is used for extracting content from different kinds of 
 attachments. JavaMail is used for mail box related operations like fetching 
 mails, filtering them etc.
 The basic configuration for one mail box is as below:
 {code:xml}
 document
entity processor=MailEntityProcessor user=someb...@gmail.com 
 password=something host=imap.gmail.com protocol=imaps/
 /document
 {code}
 The below is the list of all configuration available:
 {color:green}Required{color}
 -
 *user* 
 *pwd* 
 *protocol*  (only imaps supported now)
 *host* 
 {color:green}Optional{color}
 -
 *folders* - comma seperated list of folders. 
 If not specified, default folder is used. Nested folders can be specified 
 like a/b/c
 *recurse* - index subfolders. Defaults to true.
 *exclude* - comma seperated list of patterns. 
 *include* - comma seperated list of patterns.
 *batchSize* - mails to fetch at once in a given folder. 
 Only headers can be prefetched in Javamail IMAP.
 *readTimeout* - defaults to 6ms
 *conectTimeout* - defaults to 3ms
 *fetchSize* - IMAP config. 32KB default
 *fetchMailsSince* -
 date/time in -MM-dd HH:mm:ss format, mails received after which will be 
 fetched. Useful for delta import.
 *customFilter* - class name.  
 {code}
 import javax.mail.Folder;
 import javax.mail.SearchTerm;
 clz implements MailEntityProcessor.CustomFilter() {
 public SearchTerm getCustomSearch(Folder folder);
 }
 {code}
 *processAttachement* - defaults to true
 The below are the indexed fields.
 {code}
   // Fields To Index
   // single valued
   private static final String SUBJECT = subject;
   private static final String FROM = from;
   private static final String SENT_DATE = sentDate;
   private static final String XMAILER = xMailer;
   // multi valued
   private static final String TO_CC_BCC = allTo;
   private static final String FLAGS = flags;
   private static final String CONTENT = content;
   private static final String ATTACHMENT = attachement;
   private static final String ATTACHMENT_NAMES = attachementNames;
   // flag values
   private static final String FLAG_ANSWERED = answered;
   private static final String FLAG_DELETED = deleted;
   private static final String FLAG_DRAFT = draft;
   private static final String FLAG_FLAGGED = flagged;
   private static final String FLAG_RECENT = recent;
   private static final String FLAG_SEEN = seen;
 {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-934) Enable importing of mails into a solr index through DIH.

2009-04-07 Thread Ryan McKinley (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-934?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12696849#action_12696849
 ] 

Ryan McKinley commented on SOLR-934:


bq. FWIW: Message-ID while common is not mandatory (see sec3.6 and sec3.6.4 
of RFCs #2822 and #5322)

In practice you can not rely on the the Message-ID to be unique.  Most modern 
mail servers do a good job making sure each value is unique, but some old MS 
mail servers sent the same message ID for *every* message!  

 Enable importing of mails into a solr index through DIH.
 

 Key: SOLR-934
 URL: https://issues.apache.org/jira/browse/SOLR-934
 Project: Solr
  Issue Type: New Feature
  Components: contrib - DataImportHandler
Affects Versions: 1.4
Reporter: Preetam Rao
Assignee: Shalin Shekhar Mangar
 Fix For: 1.4

 Attachments: SOLR-934.patch, SOLR-934.patch, SOLR-934.patch, 
 SOLR-934.patch, SOLR-934.patch

   Original Estimate: 24h
  Remaining Estimate: 24h

 Enable importing of mails into solr through DIH. Take one or more mailbox 
 credentials, download and index their content along with the content from 
 attachments. The folders to fetch can be made configurable based on various 
 criteria. Apache Tika is used for extracting content from different kinds of 
 attachments. JavaMail is used for mail box related operations like fetching 
 mails, filtering them etc.
 The basic configuration for one mail box is as below:
 {code:xml}
 document
entity processor=MailEntityProcessor user=someb...@gmail.com 
 password=something host=imap.gmail.com protocol=imaps/
 /document
 {code}
 The below is the list of all configuration available:
 {color:green}Required{color}
 -
 *user* 
 *pwd* 
 *protocol*  (only imaps supported now)
 *host* 
 {color:green}Optional{color}
 -
 *folders* - comma seperated list of folders. 
 If not specified, default folder is used. Nested folders can be specified 
 like a/b/c
 *recurse* - index subfolders. Defaults to true.
 *exclude* - comma seperated list of patterns. 
 *include* - comma seperated list of patterns.
 *batchSize* - mails to fetch at once in a given folder. 
 Only headers can be prefetched in Javamail IMAP.
 *readTimeout* - defaults to 6ms
 *conectTimeout* - defaults to 3ms
 *fetchSize* - IMAP config. 32KB default
 *fetchMailsSince* -
 date/time in -MM-dd HH:mm:ss format, mails received after which will be 
 fetched. Useful for delta import.
 *customFilter* - class name.  
 {code}
 import javax.mail.Folder;
 import javax.mail.SearchTerm;
 clz implements MailEntityProcessor.CustomFilter() {
 public SearchTerm getCustomSearch(Folder folder);
 }
 {code}
 *processAttachement* - defaults to true
 The below are the indexed fields.
 {code}
   // Fields To Index
   // single valued
   private static final String SUBJECT = subject;
   private static final String FROM = from;
   private static final String SENT_DATE = sentDate;
   private static final String XMAILER = xMailer;
   // multi valued
   private static final String TO_CC_BCC = allTo;
   private static final String FLAGS = flags;
   private static final String CONTENT = content;
   private static final String ATTACHMENT = attachement;
   private static final String ATTACHMENT_NAMES = attachementNames;
   // flag values
   private static final String FLAG_ANSWERED = answered;
   private static final String FLAG_DELETED = deleted;
   private static final String FLAG_DRAFT = draft;
   private static final String FLAG_FLAGGED = flagged;
   private static final String FLAG_RECENT = recent;
   private static final String FLAG_SEEN = seen;
 {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-934) Enable importing of mails into a solr index through DIH.

2009-04-07 Thread Noble Paul (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-934?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12696877#action_12696877
 ] 

Noble Paul commented on SOLR-934:
-

bq.One question, what is the uniqueKey that we should use when indexing emails?

The Message-ID can be emitted by the EntityProcessor it can be left to the 
discretion of the user whether to use that as a uniqueKey or not.

 Enable importing of mails into a solr index through DIH.
 

 Key: SOLR-934
 URL: https://issues.apache.org/jira/browse/SOLR-934
 Project: Solr
  Issue Type: New Feature
  Components: contrib - DataImportHandler
Affects Versions: 1.4
Reporter: Preetam Rao
Assignee: Shalin Shekhar Mangar
 Fix For: 1.4

 Attachments: SOLR-934.patch, SOLR-934.patch, SOLR-934.patch, 
 SOLR-934.patch, SOLR-934.patch

   Original Estimate: 24h
  Remaining Estimate: 24h

 Enable importing of mails into solr through DIH. Take one or more mailbox 
 credentials, download and index their content along with the content from 
 attachments. The folders to fetch can be made configurable based on various 
 criteria. Apache Tika is used for extracting content from different kinds of 
 attachments. JavaMail is used for mail box related operations like fetching 
 mails, filtering them etc.
 The basic configuration for one mail box is as below:
 {code:xml}
 document
entity processor=MailEntityProcessor user=someb...@gmail.com 
 password=something host=imap.gmail.com protocol=imaps/
 /document
 {code}
 The below is the list of all configuration available:
 {color:green}Required{color}
 -
 *user* 
 *pwd* 
 *protocol*  (only imaps supported now)
 *host* 
 {color:green}Optional{color}
 -
 *folders* - comma seperated list of folders. 
 If not specified, default folder is used. Nested folders can be specified 
 like a/b/c
 *recurse* - index subfolders. Defaults to true.
 *exclude* - comma seperated list of patterns. 
 *include* - comma seperated list of patterns.
 *batchSize* - mails to fetch at once in a given folder. 
 Only headers can be prefetched in Javamail IMAP.
 *readTimeout* - defaults to 6ms
 *conectTimeout* - defaults to 3ms
 *fetchSize* - IMAP config. 32KB default
 *fetchMailsSince* -
 date/time in -MM-dd HH:mm:ss format, mails received after which will be 
 fetched. Useful for delta import.
 *customFilter* - class name.  
 {code}
 import javax.mail.Folder;
 import javax.mail.SearchTerm;
 clz implements MailEntityProcessor.CustomFilter() {
 public SearchTerm getCustomSearch(Folder folder);
 }
 {code}
 *processAttachement* - defaults to true
 The below are the indexed fields.
 {code}
   // Fields To Index
   // single valued
   private static final String SUBJECT = subject;
   private static final String FROM = from;
   private static final String SENT_DATE = sentDate;
   private static final String XMAILER = xMailer;
   // multi valued
   private static final String TO_CC_BCC = allTo;
   private static final String FLAGS = flags;
   private static final String CONTENT = content;
   private static final String ATTACHMENT = attachement;
   private static final String ATTACHMENT_NAMES = attachementNames;
   // flag values
   private static final String FLAG_ANSWERED = answered;
   private static final String FLAG_DELETED = deleted;
   private static final String FLAG_DRAFT = draft;
   private static final String FLAG_FLAGGED = flagged;
   private static final String FLAG_RECENT = recent;
   private static final String FLAG_SEEN = seen;
 {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-934) Enable importing of mails into a solr index through DIH.

2009-01-28 Thread Shalin Shekhar Mangar (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-934?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12668153#action_12668153
 ] 

Shalin Shekhar Mangar commented on SOLR-934:


MailEntityProcessor and its dependencies must be kept in one place -- either in 
WEB-INF/lib or $solr_home/lib. We can't keep just the MailEntityProcessor in 
the war because it won't be able to load the dependencies from $solr_home/lib 
(due to the classloader being different) and asking the user to drop the 
dependencies to WEB-INF/lib does not sound good. It is impractical to keep all 
these dependencies in the solr war itself because most users will not need this 
functionality.

I guess this needs to go into a separate contrib area. Thoughts?

PS: a contrib for a contrib, cool! :)

 Enable importing of mails into a solr index through DIH.
 

 Key: SOLR-934
 URL: https://issues.apache.org/jira/browse/SOLR-934
 Project: Solr
  Issue Type: New Feature
  Components: contrib - DataImportHandler
Affects Versions: 1.4
Reporter: Preetam Rao
Assignee: Shalin Shekhar Mangar
 Fix For: 1.4

 Attachments: SOLR-934.patch, SOLR-934.patch, SOLR-934.patch

   Original Estimate: 24h
  Remaining Estimate: 24h

 Enable importing of mails into solr through DIH. Take one or more mailbox 
 credentials, download and index their content along with the content from 
 attachments. The folders to fetch can be made configurable based on various 
 criteria. Apache Tika is used for extracting content from different kinds of 
 attachments. JavaMail is used for mail box related operations like fetching 
 mails, filtering them etc.
 The basic configuration for one mail box is as below:
 {code:xml}
 document
entity processor=MailEntityProcessor user=someb...@gmail.com 
 password=something host=imap.gmail.com protocol=imaps/
 /document
 {code}
 The below is the list of all configuration available:
 {color:green}Required{color}
 -
 *user* 
 *pwd* 
 *protocol*  (only imaps supported now)
 *host* 
 {color:green}Optional{color}
 -
 *folders* - comma seperated list of folders. 
 If not specified, default folder is used. Nested folders can be specified 
 like a/b/c
 *recurse* - index subfolders. Defaults to true.
 *exclude* - comma seperated list of patterns. 
 *include* - comma seperated list of patterns.
 *batchSize* - mails to fetch at once in a given folder. 
 Only headers can be prefetched in Javamail IMAP.
 *readTimeout* - defaults to 6ms
 *conectTimeout* - defaults to 3ms
 *fetchSize* - IMAP config. 32KB default
 *fetchMailsSince* -
 date/time in -MM-dd HH:mm:ss format, mails received after which will be 
 fetched. Useful for delta import.
 *customFilter* - class name.  
 {code}
 import javax.mail.Folder;
 import javax.mail.SearchTerm;
 clz implements MailEntityProcessor.CustomFilter() {
 public SearchTerm getCustomSearch(Folder folder);
 }
 {code}
 *processAttachement* - defaults to true
 The below are the indexed fields.
 {code}
   // Fields To Index
   // single valued
   private static final String SUBJECT = subject;
   private static final String FROM = from;
   private static final String SENT_DATE = sentDate;
   private static final String XMAILER = xMailer;
   // multi valued
   private static final String TO_CC_BCC = allTo;
   private static final String FLAGS = flags;
   private static final String CONTENT = content;
   private static final String ATTACHMENT = attachement;
   private static final String ATTACHMENT_NAMES = attachementNames;
   // flag values
   private static final String FLAG_ANSWERED = answered;
   private static final String FLAG_DELETED = deleted;
   private static final String FLAG_DRAFT = draft;
   private static final String FLAG_FLAGGED = flagged;
   private static final String FLAG_RECENT = recent;
   private static final String FLAG_SEEN = seen;
 {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-934) Enable importing of mails into a solr index through DIH.

2009-01-28 Thread Noble Paul (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-934?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12668336#action_12668336
 ] 

Noble Paul commented on SOLR-934:
-

how about a new contrib called 'dih-ext' . So all the future DIH enhancements 
which require external dependencies can go here (like a TikaEntityProcessor). 

 Enable importing of mails into a solr index through DIH.
 

 Key: SOLR-934
 URL: https://issues.apache.org/jira/browse/SOLR-934
 Project: Solr
  Issue Type: New Feature
  Components: contrib - DataImportHandler
Affects Versions: 1.4
Reporter: Preetam Rao
Assignee: Shalin Shekhar Mangar
 Fix For: 1.4

 Attachments: SOLR-934.patch, SOLR-934.patch, SOLR-934.patch

   Original Estimate: 24h
  Remaining Estimate: 24h

 Enable importing of mails into solr through DIH. Take one or more mailbox 
 credentials, download and index their content along with the content from 
 attachments. The folders to fetch can be made configurable based on various 
 criteria. Apache Tika is used for extracting content from different kinds of 
 attachments. JavaMail is used for mail box related operations like fetching 
 mails, filtering them etc.
 The basic configuration for one mail box is as below:
 {code:xml}
 document
entity processor=MailEntityProcessor user=someb...@gmail.com 
 password=something host=imap.gmail.com protocol=imaps/
 /document
 {code}
 The below is the list of all configuration available:
 {color:green}Required{color}
 -
 *user* 
 *pwd* 
 *protocol*  (only imaps supported now)
 *host* 
 {color:green}Optional{color}
 -
 *folders* - comma seperated list of folders. 
 If not specified, default folder is used. Nested folders can be specified 
 like a/b/c
 *recurse* - index subfolders. Defaults to true.
 *exclude* - comma seperated list of patterns. 
 *include* - comma seperated list of patterns.
 *batchSize* - mails to fetch at once in a given folder. 
 Only headers can be prefetched in Javamail IMAP.
 *readTimeout* - defaults to 6ms
 *conectTimeout* - defaults to 3ms
 *fetchSize* - IMAP config. 32KB default
 *fetchMailsSince* -
 date/time in -MM-dd HH:mm:ss format, mails received after which will be 
 fetched. Useful for delta import.
 *customFilter* - class name.  
 {code}
 import javax.mail.Folder;
 import javax.mail.SearchTerm;
 clz implements MailEntityProcessor.CustomFilter() {
 public SearchTerm getCustomSearch(Folder folder);
 }
 {code}
 *processAttachement* - defaults to true
 The below are the indexed fields.
 {code}
   // Fields To Index
   // single valued
   private static final String SUBJECT = subject;
   private static final String FROM = from;
   private static final String SENT_DATE = sentDate;
   private static final String XMAILER = xMailer;
   // multi valued
   private static final String TO_CC_BCC = allTo;
   private static final String FLAGS = flags;
   private static final String CONTENT = content;
   private static final String ATTACHMENT = attachement;
   private static final String ATTACHMENT_NAMES = attachementNames;
   // flag values
   private static final String FLAG_ANSWERED = answered;
   private static final String FLAG_DELETED = deleted;
   private static final String FLAG_DRAFT = draft;
   private static final String FLAG_FLAGGED = flagged;
   private static final String FLAG_RECENT = recent;
   private static final String FLAG_SEEN = seen;
 {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-934) Enable importing of mails into a solr index through DIH.

2009-01-02 Thread Preetam Rao (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-934?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12660272#action_12660272
 ] 

Preetam Rao commented on SOLR-934:
--

Regarding comma separated list of patterns:

Folder names won't contain commas usually.
The regex which will contain commas is for limiting number of occurances like 
{M,N}, which also does not seem to be very useful in restricting
folder names.

Can we leave it as it is till the need arises ? If not what would be a good 
escape character or replacement for comma ?

 Enable importing of mails into a solr index through DIH.
 

 Key: SOLR-934
 URL: https://issues.apache.org/jira/browse/SOLR-934
 Project: Solr
  Issue Type: New Feature
  Components: contrib - DataImportHandler
Affects Versions: 1.4
Reporter: Preetam Rao
Assignee: Shalin Shekhar Mangar
 Fix For: 1.4

 Attachments: SOLR-934.patch, SOLR-934.patch

   Original Estimate: 24h
  Remaining Estimate: 24h

 Enable importing of mails into solr through DIH. Take one or more mailbox 
 credentials, download and index their content along with the content from 
 attachments. The folders to fetch can be made configurable based on various 
 criteria. Apache Tika is used for extracting content from different kinds of 
 attachments. JavaMail is used for mail box related operations like fetching 
 mails, filtering them etc.
 The basic configuration for one mail box is as below:
 {code:xml}
 document
entity processor=MailEntityProcessor user=someb...@gmail.com 
 password=something host=imap.gmail.com protocol=imaps/
 /document
 {code}
 The below is the list of all configuration available:
 {color:green}Required{color}
 -
 *user* 
 *pwd* 
 *protocol*  (only imaps supported now)
 *host* 
 {color:green}Optional{color}
 -
 *folders* - comma seperated list of folders. 
 If not specified, default folder is used. Nested folders can be specified 
 like a/b/c
 *recurse* - index subfolders. Defaults to true.
 *exclude* - comma seperated list of patterns. 
 *include* - comma seperated list of patterns.
 *batchSize* - mails to fetch at once in a given folder. 
 Only headers can be prefetched in Javamail IMAP.
 *readTimeout* - defaults to 6ms
 *conectTimeout* - defaults to 3ms
 *fetchSize* - IMAP config. 32KB default
 *fetchMailsSince* -
 date/time in miliiseconds, mails received after which will be fetched. Useful 
 for delta import.
 *customFilter* - class name.  
 {code}
 import javax.mail.Folder;
 import javax.mail.SearchTerm;
 clz implements MailEntityProcessor.CustomFilter() {
 public SearchTerm getCustomSearch(Folder folder);
 }
 {code}
 *processAttachement* - defaults to true
 The below are the indexed fields.
 {code}
   // Fields To Index
   // single valued
   private static final String SUBJECT = subject;
   private static final String FROM = from;
   private static final String SENT_DATE = sentDate;
   private static final String XMAILER = xMailer;
   // multi valued
   private static final String TO_CC_BCC = allTo;
   private static final String FLAGS = flags;
   private static final String CONTENT = content;
   private static final String ATTACHMENT = attachement;
   private static final String ATTACHMENT_NAMES = attachementNames;
   // flag values
   private static final String FLAG_ANSWERED = answered;
   private static final String FLAG_DELETED = deleted;
   private static final String FLAG_DRAFT = draft;
   private static final String FLAG_FLAGGED = flagged;
   private static final String FLAG_RECENT = recent;
   private static final String FLAG_SEEN = seen;
 {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-934) Enable importing of mails into a solr index through DIH.

2009-01-02 Thread Noble Paul (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-934?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12660274#action_12660274
 ] 

Noble Paul commented on SOLR-934:
-

This is a trivial thing. Other suggestions are really important

 Enable importing of mails into a solr index through DIH.
 

 Key: SOLR-934
 URL: https://issues.apache.org/jira/browse/SOLR-934
 Project: Solr
  Issue Type: New Feature
  Components: contrib - DataImportHandler
Affects Versions: 1.4
Reporter: Preetam Rao
Assignee: Shalin Shekhar Mangar
 Fix For: 1.4

 Attachments: SOLR-934.patch, SOLR-934.patch

   Original Estimate: 24h
  Remaining Estimate: 24h

 Enable importing of mails into solr through DIH. Take one or more mailbox 
 credentials, download and index their content along with the content from 
 attachments. The folders to fetch can be made configurable based on various 
 criteria. Apache Tika is used for extracting content from different kinds of 
 attachments. JavaMail is used for mail box related operations like fetching 
 mails, filtering them etc.
 The basic configuration for one mail box is as below:
 {code:xml}
 document
entity processor=MailEntityProcessor user=someb...@gmail.com 
 password=something host=imap.gmail.com protocol=imaps/
 /document
 {code}
 The below is the list of all configuration available:
 {color:green}Required{color}
 -
 *user* 
 *pwd* 
 *protocol*  (only imaps supported now)
 *host* 
 {color:green}Optional{color}
 -
 *folders* - comma seperated list of folders. 
 If not specified, default folder is used. Nested folders can be specified 
 like a/b/c
 *recurse* - index subfolders. Defaults to true.
 *exclude* - comma seperated list of patterns. 
 *include* - comma seperated list of patterns.
 *batchSize* - mails to fetch at once in a given folder. 
 Only headers can be prefetched in Javamail IMAP.
 *readTimeout* - defaults to 6ms
 *conectTimeout* - defaults to 3ms
 *fetchSize* - IMAP config. 32KB default
 *fetchMailsSince* -
 date/time in miliiseconds, mails received after which will be fetched. Useful 
 for delta import.
 *customFilter* - class name.  
 {code}
 import javax.mail.Folder;
 import javax.mail.SearchTerm;
 clz implements MailEntityProcessor.CustomFilter() {
 public SearchTerm getCustomSearch(Folder folder);
 }
 {code}
 *processAttachement* - defaults to true
 The below are the indexed fields.
 {code}
   // Fields To Index
   // single valued
   private static final String SUBJECT = subject;
   private static final String FROM = from;
   private static final String SENT_DATE = sentDate;
   private static final String XMAILER = xMailer;
   // multi valued
   private static final String TO_CC_BCC = allTo;
   private static final String FLAGS = flags;
   private static final String CONTENT = content;
   private static final String ATTACHMENT = attachement;
   private static final String ATTACHMENT_NAMES = attachementNames;
   // flag values
   private static final String FLAG_ANSWERED = answered;
   private static final String FLAG_DELETED = deleted;
   private static final String FLAG_DRAFT = draft;
   private static final String FLAG_FLAGGED = flagged;
   private static final String FLAG_RECENT = recent;
   private static final String FLAG_SEEN = seen;
 {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: [jira] Commented: (SOLR-934) Enable importing of mails into a solr index through DIH.

2009-01-02 Thread Ben Johnson

Thanks Paul and Preetam.  A couple of further things:

- How do you envisage this functionality being used?  I can see indexing all 
emails for all users as part of a one-off system setup/migration process, 
but also as a core feature to ensure all emails received by a 
company/organisation are indexed (and stored).  This could be done either by 
the end-user, who controls what should be indexed (i.e. certain work-related 
emails only) or directly from the mail server, where all emails would be 
indexed (including personal emails, which could later be deleted from the 
index if desired) to ensure no important emails get missed.  Is this the 
sort of thing you had in mind?  There is also the issue of not 
indexing/storing the same email from multiple users' mailboxes (haven't 
worked that one out yet, possibly via a hash).


- Is the mailbox 'configuration' (entity tag) stored in data-config.xml on 
the Solr server?  If so, this would seem to have quite a lot of 
administrative overhead - how do you manage a system with 5000+ users?  How 
are the accounts/passwords maintained?  Are the passwords stored in plain 
text?


- Minor typo: *conectTimeout* should be *connectTimeout*

- A few real-world scenarios I've encountered are:
   - be able to handle an email sent to over 5000 recipients (in the 'To:' 
field)
   - be able to handle an email with a 'long' subject line (240+ 
characters)

   - be able to handle an email with 100 attachments
   - be able to handle an email with attachments with 'long' names (240+ 
characters)


This caused several problems in the software I was using at the time (a 
proprietary system, not Solr-based), either memory-related issues or file 
system errors when running on Windows where the file system or its API 
limited file names to 255 characters, including the path.


Thanks very much!
Ben

--
From: Noble Paul നോബിള്‍ नोब्ळ् noble.p...@gmail.com
Sent: Friday, January 02, 2009 5:02 AM
To: solr-dev@lucene.apache.org
Subject: Re: [jira] Commented: (SOLR-934) Enable importing of mails into a 
solr index through DIH.



Hi Ben,
You can take a look at the wiki page for DIH
http://wiki.apache.org/solr/DataImportHandler

It helps you index mostly structured data into Solr from db, xml etc .
It can be considered as an ETL tool
(http://en.wikipedia.org/wiki/Extract,_transform,_load ) for Solr.

Adding mail support means you can index your emails into Sols with a
few lines of configuration
--Noble

On Fri, Jan 2, 2009 at 4:30 AM, Ben Johnson
ben.john...@jandpconsulting.co.uk wrote:
I'm watching this issue with interest, but I'm having trouble 
understanding

the bigger picture.  I am prototyping a system that uses Restlet to store
and index objects (mainly MS Office and OpenOffice documents and emails), 
so

I am planning to use Solr with Tika to index the objects.

I know nothing about DIH (Distributed Index Handler?), so I'm not sure 
what

role it plays with Solr.  Is it a vendor-specific technology (from
Autonomy)?  What does it do?  Do you give it objects to index and it 
handles
them by passing it to one or more Solr/Tika indexing servers?  And are 
you

thinking that this would therefore be a good place to not only index the
objects, but also pass the information about the digital content to 
DROID?


Reading a bit about DROID (from TNA, The National Archives), it seems 
like

it is used to capture information about the digital content of objects
stored in a content repository.  How does this fit with Solr?  I thought
Solr with Tika just did the indexing of text-based objects, but the 
actual

storage of the objects would be elsewhere (probably in the file system).
From what I can tell, DROID would operate on the file system objects, not
the indexing information.  Have I got this right?

Ideally, I would also like to convert any suitable content into PDF/A 
format
for long-term archival - probably not relevant to this issue, but I 
thought
I'd mention it in case you see an application of this as part of email 
and

attachment storage.

Sorry for all the questions, but hopefully someone could clarify this for
me!

Thanks very much
Ben Johnson

--
From: Grant Ingersoll (JIRA) j...@apache.org
Sent: Thursday, January 01, 2009 7:07 PM
To: solr-dev@lucene.apache.org
Subject: [jira] Commented: (SOLR-934) Enable importing of mails into a 
solr

index through DIH.



  [
https://issues.apache.org/jira/browse/SOLR-934?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12660210#action_12660210
]

Grant Ingersoll commented on SOLR-934:
--

Would it make more sense for DIH to farm out it's content acquisition to 
a
library like Droids?  Then, we could have real crawling, etc. all 
through a

pluggable connector framework.


Enable importing of mails into a solr index through DIH

Re: [jira] Commented: (SOLR-934) Enable importing of mails into a solr index through DIH.

2009-01-02 Thread Noble Paul നോബിള്‍ नोब्ळ्
On Fri, Jan 2, 2009 at 3:42 PM, Ben Johnson
ben.john...@jandpconsulting.co.uk wrote:
 Thanks Paul and Preetam.  A couple of further things:

 - How do you envisage this functionality being used?  I can see indexing all
 emails for all users as part of a one-off system setup/migration process,
 but also as a core feature to ensure all emails received by a
 company/organisation are indexed (and stored).  This could be done either by
 the end-user, who controls what should be indexed (i.e. certain work-related
 emails only) or directly from the mail server, where all emails would be
 indexed (including personal emails, which could later be deleted from the
 index if desired) to ensure no important emails get missed.  Is this the
 sort of thing you had in mind?  There is also the issue of not
 indexing/storing the same email from multiple users' mailboxes (haven't
 worked that one out yet, possibly via a hash).

 - Is the mailbox 'configuration' (entity tag) stored in data-config.xml on
 the Solr server?  If so, this would seem to have quite a lot of
Do you wish all users mails to be indexed into single index ? it is
possible by passing on the username password as request parameters .


 administrative overhead - how do you manage a system with 5000+ users?  How
 are the accounts/passwords maintained?  Are the passwords stored in plain
 text?

 - Minor typo: *conectTimeout* should be *connectTimeout*

 - A few real-world scenarios I've encountered are:
   - be able to handle an email sent to over 5000 recipients (in the 'To:'
 field)
   - be able to handle an email with a 'long' subject line (240+ characters)
   - be able to handle an email with 100 attachments
   - be able to handle an email with attachments with 'long' names (240+
 characters)

 This caused several problems in the software I was using at the time (a
 proprietary system, not Solr-based), either memory-related issues or file
 system errors when running on Windows where the file system or its API
 limited file names to 255 characters, including the path.

 Thanks very much!
 Ben

 --
 From: Noble Paul നോബിള്‍ नोब्ळ् noble.p...@gmail.com
 Sent: Friday, January 02, 2009 5:02 AM
 To: solr-dev@lucene.apache.org
 Subject: Re: [jira] Commented: (SOLR-934) Enable importing of mails into a
 solr index through DIH.

 Hi Ben,
 You can take a look at the wiki page for DIH
 http://wiki.apache.org/solr/DataImportHandler

 It helps you index mostly structured data into Solr from db, xml etc .
 It can be considered as an ETL tool
 (http://en.wikipedia.org/wiki/Extract,_transform,_load ) for Solr.

 Adding mail support means you can index your emails into Sols with a
 few lines of configuration
 --Noble

 On Fri, Jan 2, 2009 at 4:30 AM, Ben Johnson
 ben.john...@jandpconsulting.co.uk wrote:

 I'm watching this issue with interest, but I'm having trouble
 understanding
 the bigger picture.  I am prototyping a system that uses Restlet to store
 and index objects (mainly MS Office and OpenOffice documents and emails),
 so
 I am planning to use Solr with Tika to index the objects.

 I know nothing about DIH (Distributed Index Handler?), so I'm not sure
 what
 role it plays with Solr.  Is it a vendor-specific technology (from
 Autonomy)?  What does it do?  Do you give it objects to index and it
 handles
 them by passing it to one or more Solr/Tika indexing servers?  And are
 you
 thinking that this would therefore be a good place to not only index the
 objects, but also pass the information about the digital content to
 DROID?

 Reading a bit about DROID (from TNA, The National Archives), it seems
 like
 it is used to capture information about the digital content of objects
 stored in a content repository.  How does this fit with Solr?  I thought
 Solr with Tika just did the indexing of text-based objects, but the
 actual
 storage of the objects would be elsewhere (probably in the file system).
 From what I can tell, DROID would operate on the file system objects, not
 the indexing information.  Have I got this right?

 Ideally, I would also like to convert any suitable content into PDF/A
 format
 for long-term archival - probably not relevant to this issue, but I
 thought
 I'd mention it in case you see an application of this as part of email
 and
 attachment storage.

 Sorry for all the questions, but hopefully someone could clarify this for
 me!

 Thanks very much
 Ben Johnson

 --
 From: Grant Ingersoll (JIRA) j...@apache.org
 Sent: Thursday, January 01, 2009 7:07 PM
 To: solr-dev@lucene.apache.org
 Subject: [jira] Commented: (SOLR-934) Enable importing of mails into a
 solr
 index through DIH.


  [

 https://issues.apache.org/jira/browse/SOLR-934?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12660210#action_12660210
 ]

 Grant Ingersoll commented on SOLR-934:
 --

 Would it make more sense

Re: [jira] Commented: (SOLR-934) Enable importing of mails into a solr index through DIH.

2009-01-02 Thread Ben Johnson

Hi Paul

Yes, I was thinking that emails for all users would be indexed into a single 
index, at least conceptually.  I'm thinking of a corporate/organisational 
repository that any user could search for relevant information, be that 
email or some other kind of document (e.g. MS Office, OpenOffice, PDF, 
etc...).  An example usage would be for government organisations in the 
United Kingdom that need to respond to Freedom of Information (FOI) requests 
and are therefore required by law to produce all information regarding a 
particular subject if requested (sensitive information excluded).


I haven't looked into architectural options for the indexes - I don't know 
if it is possible/desirable to split indexes up and use some sort of 
federated search to produce results, but at least conceptually I was 
thinking of a single source for the indexing information.


Regards
Ben

--
From: Noble Paul നോബിള്‍ नोब्ळ् noble.p...@gmail.com
Sent: Friday, January 02, 2009 11:21 AM
To: solr-dev@lucene.apache.org
Subject: Re: [jira] Commented: (SOLR-934) Enable importing of mails into a 
solr index through DIH.



On Fri, Jan 2, 2009 at 3:42 PM, Ben Johnson
ben.john...@jandpconsulting.co.uk wrote:

Thanks Paul and Preetam.  A couple of further things:

- How do you envisage this functionality being used?  I can see indexing 
all

emails for all users as part of a one-off system setup/migration process,
but also as a core feature to ensure all emails received by a
company/organisation are indexed (and stored).  This could be done either 
by
the end-user, who controls what should be indexed (i.e. certain 
work-related

emails only) or directly from the mail server, where all emails would be
indexed (including personal emails, which could later be deleted from the
index if desired) to ensure no important emails get missed.  Is this the
sort of thing you had in mind?  There is also the issue of not
indexing/storing the same email from multiple users' mailboxes (haven't
worked that one out yet, possibly via a hash).

- Is the mailbox 'configuration' (entity tag) stored in data-config.xml 
on

the Solr server?  If so, this would seem to have quite a lot of

Do you wish all users mails to be indexed into single index ? it is
possible by passing on the username password as request parameters .


administrative overhead - how do you manage a system with 5000+ users? 
How

are the accounts/passwords maintained?  Are the passwords stored in plain
text?

- Minor typo: *conectTimeout* should be *connectTimeout*

- A few real-world scenarios I've encountered are:
  - be able to handle an email sent to over 5000 recipients (in the 'To:'
field)
  - be able to handle an email with a 'long' subject line (240+ 
characters)

  - be able to handle an email with 100 attachments
  - be able to handle an email with attachments with 'long' names (240+
characters)

This caused several problems in the software I was using at the time (a
proprietary system, not Solr-based), either memory-related issues or file
system errors when running on Windows where the file system or its API
limited file names to 255 characters, including the path.

Thanks very much!
Ben

--
From: Noble Paul നോബിള്‍ नोब्ळ् noble.p...@gmail.com
Sent: Friday, January 02, 2009 5:02 AM
To: solr-dev@lucene.apache.org
Subject: Re: [jira] Commented: (SOLR-934) Enable importing of mails into 
a

solr index through DIH.


Hi Ben,
You can take a look at the wiki page for DIH
http://wiki.apache.org/solr/DataImportHandler

It helps you index mostly structured data into Solr from db, xml etc .
It can be considered as an ETL tool
(http://en.wikipedia.org/wiki/Extract,_transform,_load ) for Solr.

Adding mail support means you can index your emails into Sols with a
few lines of configuration
--Noble

On Fri, Jan 2, 2009 at 4:30 AM, Ben Johnson
ben.john...@jandpconsulting.co.uk wrote:


I'm watching this issue with interest, but I'm having trouble
understanding
the bigger picture.  I am prototyping a system that uses Restlet to 
store
and index objects (mainly MS Office and OpenOffice documents and 
emails),

so
I am planning to use Solr with Tika to index the objects.

I know nothing about DIH (Distributed Index Handler?), so I'm not sure
what
role it plays with Solr.  Is it a vendor-specific technology (from
Autonomy)?  What does it do?  Do you give it objects to index and it
handles
them by passing it to one or more Solr/Tika indexing servers?  And are
you
thinking that this would therefore be a good place to not only index 
the

objects, but also pass the information about the digital content to
DROID?

Reading a bit about DROID (from TNA, The National Archives), it seems
like
it is used to capture information about the digital content of objects
stored in a content repository.  How does this fit with Solr?  I 
thought

Solr with Tika just did the indexing of text-based objects

Re: [jira] Commented: (SOLR-934) Enable importing of mails into a solr index through DIH.

2009-01-02 Thread Otis Gospodnetic
Quick clarifications:

- Droids: http://incubator.apache.org/droids/index.html
- DIH: http://wiki.apache.org/solr/DataImportHandler
- Solr + Tika: http://wiki.apache.org/solr/ExtractingRequestHandler


Otis 
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



- Original Message 
 From: Ben Johnson ben.john...@jandpconsulting.co.uk
 To: solr-dev@lucene.apache.org
 Sent: Thursday, January 1, 2009 6:00:43 PM
 Subject: Re: [jira] Commented: (SOLR-934) Enable importing of mails into a 
 solr index through DIH.
 
 I'm watching this issue with interest, but I'm having trouble understanding 
 the 
 bigger picture.  I am prototyping a system that uses Restlet to store and 
 index 
 objects (mainly MS Office and OpenOffice documents and emails), so I am 
 planning 
 to use Solr with Tika to index the objects.
 
 I know nothing about DIH (Distributed Index Handler?), so I'm not sure what 
 role 
 it plays with Solr.  Is it a vendor-specific technology (from Autonomy)?  
 What 
 does it do?  Do you give it objects to index and it handles them by passing 
 it 
 to one or more Solr/Tika indexing servers?  And are you thinking that this 
 would 
 therefore be a good place to not only index the objects, but also pass the 
 information about the digital content to DROID?
 
 Reading a bit about DROID (from TNA, The National Archives), it seems like it 
 is 
 used to capture information about the digital content of objects stored in a 
 content repository.  How does this fit with Solr?  I thought Solr with Tika 
 just 
 did the indexing of text-based objects, but the actual storage of the objects 
 would be elsewhere (probably in the file system). From what I can tell, DROID 
 would operate on the file system objects, not the indexing information.  Have 
 I 
 got this right?
 
 Ideally, I would also like to convert any suitable content into PDF/A format 
 for 
 long-term archival - probably not relevant to this issue, but I thought I'd 
 mention it in case you see an application of this as part of email and 
 attachment storage.
 
 Sorry for all the questions, but hopefully someone could clarify this for me!
 
 Thanks very much
 Ben Johnson
 
 --
 From: Grant Ingersoll (JIRA) 
 Sent: Thursday, January 01, 2009 7:07 PM
 To: 
 Subject: [jira] Commented: (SOLR-934) Enable importing of mails into a solr 
 index through DIH.
 
  
 [ 
 https://issues.apache.org/jira/browse/SOLR-934?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12660210#action_12660210
  
 ]
  
  Grant Ingersoll commented on SOLR-934:
  --
  
  Would it make more sense for DIH to farm out it's content acquisition to a 
 library like Droids?  Then, we could have real crawling, etc. all through a 
 pluggable connector framework.
  
  Enable importing of mails into a solr index through DIH.
  
  
  Key: SOLR-934
  URL: https://issues.apache.org/jira/browse/SOLR-934
  Project: Solr
   Issue Type: New Feature
   Components: contrib - DataImportHandler
 Affects Versions: 1.4
 Reporter: Preetam Rao
 Assignee: Shalin Shekhar Mangar
  Fix For: 1.4
  
  Attachments: SOLR-934.patch, SOLR-934.patch
  
Original Estimate: 24h
   Remaining Estimate: 24h
  
  Enable importing of mails into solr through DIH. Take one or more mailbox 
 credentials, download and index their content along with the content from 
 attachments. The folders to fetch can be made configurable based on various 
 criteria. Apache Tika is used for extracting content from different kinds of 
 attachments. JavaMail is used for mail box related operations like fetching 
 mails, filtering them etc.
  The basic configuration for one mail box is as below:
  {code:xml}
  
 
  password=something host=imap.gmail.com 
  protocol=imaps/
  
  {code}
  The below is the list of all configuration available:
  {color:green}Required{color}
  -
  *user*
  *pwd*
  *protocol*  (only imaps supported now)
  *host*
  {color:green}Optional{color}
  -
  *folders* - comma seperated list of folders.
  If not specified, default folder is used. Nested folders can be specified 
 like a/b/c
  *recurse* - index subfolders. Defaults to true.
  *exclude* - comma seperated list of patterns.
  *include* - comma seperated list of patterns.
  *batchSize* - mails to fetch at once in a given folder.
  Only headers can be prefetched in Javamail IMAP.
  *readTimeout* - defaults to 6ms
  *conectTimeout* - defaults to 3ms
  *fetchSize* - IMAP config. 32KB default
  *fetchMailsSince* -
  date/time in miliiseconds, mails received after which will be fetched. 
  Useful 
 for delta import.
  *customFilter* - class name.
  {code}
  import javax.mail.Folder;
  import javax.mail.SearchTerm;
  clz

Re: [jira] Commented: (SOLR-934) Enable importing of mails into a solr index through DIH.

2009-01-02 Thread Noble Paul നോബിള്‍ नोब्ळ्
On Fri, Jan 2, 2009 at 6:24 PM, Ben Johnson
ben.john...@jandpconsulting.co.uk wrote:
 Hi Paul

 Yes, I was thinking that emails for all users would be indexed into a single
 index, at least conceptually.  I'm thinking of a corporate/organisational
 repository that any user could search for relevant information, be that
 email or some other kind of document (e.g. MS Office, OpenOffice, PDF,
 etc...).  An example usage would be for government organisations in the
 United Kingdom that need to respond to Freedom of Information (FOI) requests
 and are therefore required by law to produce all information regarding a
 particular subject if requested (sensitive information excluded).

 I haven't looked into architectural options for the indexes - I don't know
 if it is possible/desirable to split indexes up and use some sort of
 federated search to produce results, but at least conceptually I was
 thinking of a single source for the indexing information.

A very simple solution is to keep all users in single index and use
'fq' and limit the search to that user.


 Regards
 Ben

 --
 From: Noble Paul നോബിള്‍ नोब्ळ् noble.p...@gmail.com
 Sent: Friday, January 02, 2009 11:21 AM
 To: solr-dev@lucene.apache.org
 Subject: Re: [jira] Commented: (SOLR-934) Enable importing of mails into a
 solr index through DIH.

 On Fri, Jan 2, 2009 at 3:42 PM, Ben Johnson
 ben.john...@jandpconsulting.co.uk wrote:

 Thanks Paul and Preetam.  A couple of further things:

 - How do you envisage this functionality being used?  I can see indexing
 all
 emails for all users as part of a one-off system setup/migration process,
 but also as a core feature to ensure all emails received by a
 company/organisation are indexed (and stored).  This could be done either
 by
 the end-user, who controls what should be indexed (i.e. certain
 work-related
 emails only) or directly from the mail server, where all emails would be
 indexed (including personal emails, which could later be deleted from the
 index if desired) to ensure no important emails get missed.  Is this the
 sort of thing you had in mind?  There is also the issue of not
 indexing/storing the same email from multiple users' mailboxes (haven't
 worked that one out yet, possibly via a hash).

 - Is the mailbox 'configuration' (entity tag) stored in data-config.xml
 on
 the Solr server?  If so, this would seem to have quite a lot of

 Do you wish all users mails to be indexed into single index ? it is
 possible by passing on the username password as request parameters .


 administrative overhead - how do you manage a system with 5000+ users?
 How
 are the accounts/passwords maintained?  Are the passwords stored in plain
 text?

 - Minor typo: *conectTimeout* should be *connectTimeout*

 - A few real-world scenarios I've encountered are:
  - be able to handle an email sent to over 5000 recipients (in the 'To:'
 field)
  - be able to handle an email with a 'long' subject line (240+
 characters)
  - be able to handle an email with 100 attachments
  - be able to handle an email with attachments with 'long' names (240+
 characters)

 This caused several problems in the software I was using at the time (a
 proprietary system, not Solr-based), either memory-related issues or file
 system errors when running on Windows where the file system or its API
 limited file names to 255 characters, including the path.

 Thanks very much!
 Ben

 --
 From: Noble Paul നോബിള്‍ नोब्ळ् noble.p...@gmail.com
 Sent: Friday, January 02, 2009 5:02 AM
 To: solr-dev@lucene.apache.org
 Subject: Re: [jira] Commented: (SOLR-934) Enable importing of mails into
 a
 solr index through DIH.

 Hi Ben,
 You can take a look at the wiki page for DIH
 http://wiki.apache.org/solr/DataImportHandler

 It helps you index mostly structured data into Solr from db, xml etc .
 It can be considered as an ETL tool
 (http://en.wikipedia.org/wiki/Extract,_transform,_load ) for Solr.

 Adding mail support means you can index your emails into Sols with a
 few lines of configuration
 --Noble

 On Fri, Jan 2, 2009 at 4:30 AM, Ben Johnson
 ben.john...@jandpconsulting.co.uk wrote:

 I'm watching this issue with interest, but I'm having trouble
 understanding
 the bigger picture.  I am prototyping a system that uses Restlet to
 store
 and index objects (mainly MS Office and OpenOffice documents and
 emails),
 so
 I am planning to use Solr with Tika to index the objects.

 I know nothing about DIH (Distributed Index Handler?), so I'm not sure
 what
 role it plays with Solr.  Is it a vendor-specific technology (from
 Autonomy)?  What does it do?  Do you give it objects to index and it
 handles
 them by passing it to one or more Solr/Tika indexing servers?  And are
 you
 thinking that this would therefore be a good place to not only index
 the
 objects, but also pass the information about the digital content to
 DROID?

 Reading a bit about DROID (from TNA

[jira] Commented: (SOLR-934) Enable importing of mails into a solr index through DIH.

2009-01-01 Thread Noble Paul (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-934?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12660208#action_12660208
 ] 

Noble Paul commented on SOLR-934:
-

looks good. A few observations.
* the init must call super.init()
* Right before returning nextRow() ,call super.applyTransformer(row)
* Returning null signals end of rows. Close any connections or do cleanup
* 'exclude' and 'include' should either allow for escaping comma (between 
multiple regex) or it can just take one reex for the time being

 Enable importing of mails into a solr index through DIH.
 

 Key: SOLR-934
 URL: https://issues.apache.org/jira/browse/SOLR-934
 Project: Solr
  Issue Type: New Feature
  Components: contrib - DataImportHandler
Affects Versions: 1.4
Reporter: Preetam Rao
Assignee: Shalin Shekhar Mangar
 Fix For: 1.4

 Attachments: SOLR-934.patch, SOLR-934.patch

   Original Estimate: 24h
  Remaining Estimate: 24h

 Enable importing of mails into solr through DIH. Take one or more mailbox 
 credentials, download and index their content along with the content from 
 attachments. The folders to fetch can be made configurable based on various 
 criteria. Apache Tika is used for extracting content from different kinds of 
 attachments. JavaMail is used for mail box related operations like fetching 
 mails, filtering them etc.
 The basic configuration for one mail box is as below:
 {code:xml}
 document
entity processor=MailEntityProcessor user=someb...@gmail.com 
 password=something host=imap.gmail.com protocol=imaps/
 /document
 {code}
 The below is the list of all configuration available:
 {color:green}Required{color}
 -
 *user* 
 *pwd* 
 *protocol*  (only imaps supported now)
 *host* 
 {color:green}Optional{color}
 -
 *folders* - comma seperated list of folders. 
 If not specified, default folder is used. Nested folders can be specified 
 like a/b/c
 *recurse* - index subfolders. Defaults to true.
 *exclude* - comma seperated list of patterns. 
 *include* - comma seperated list of patterns.
 *batchSize* - mails to fetch at once in a given folder. 
 Only headers can be prefetched in Javamail IMAP.
 *readTimeout* - defaults to 6ms
 *conectTimeout* - defaults to 3ms
 *fetchSize* - IMAP config. 32KB default
 *fetchMailsSince* -
 date/time in miliiseconds, mails received after which will be fetched. Useful 
 for delta import.
 *customFilter* - class name.  
 {code}
 import javax.mail.Folder;
 import javax.mail.SearchTerm;
 clz implements MailEntityProcessor.CustomFilter() {
 public SearchTerm getCustomSearch(Folder folder);
 }
 {code}
 *processAttachement* - defaults to true
 The below are the indexed fields.
 {code}
   // Fields To Index
   // single valued
   private static final String SUBJECT = subject;
   private static final String FROM = from;
   private static final String SENT_DATE = sentDate;
   private static final String XMAILER = xMailer;
   // multi valued
   private static final String TO_CC_BCC = allTo;
   private static final String FLAGS = flags;
   private static final String CONTENT = content;
   private static final String ATTACHMENT = attachement;
   private static final String ATTACHMENT_NAMES = attachementNames;
   // flag values
   private static final String FLAG_ANSWERED = answered;
   private static final String FLAG_DELETED = deleted;
   private static final String FLAG_DRAFT = draft;
   private static final String FLAG_FLAGGED = flagged;
   private static final String FLAG_RECENT = recent;
   private static final String FLAG_SEEN = seen;
 {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-934) Enable importing of mails into a solr index through DIH.

2009-01-01 Thread Grant Ingersoll (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-934?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12660210#action_12660210
 ] 

Grant Ingersoll commented on SOLR-934:
--

Would it make more sense for DIH to farm out it's content acquisition to a 
library like Droids?  Then, we could have real crawling, etc. all through a 
pluggable connector framework.

 Enable importing of mails into a solr index through DIH.
 

 Key: SOLR-934
 URL: https://issues.apache.org/jira/browse/SOLR-934
 Project: Solr
  Issue Type: New Feature
  Components: contrib - DataImportHandler
Affects Versions: 1.4
Reporter: Preetam Rao
Assignee: Shalin Shekhar Mangar
 Fix For: 1.4

 Attachments: SOLR-934.patch, SOLR-934.patch

   Original Estimate: 24h
  Remaining Estimate: 24h

 Enable importing of mails into solr through DIH. Take one or more mailbox 
 credentials, download and index their content along with the content from 
 attachments. The folders to fetch can be made configurable based on various 
 criteria. Apache Tika is used for extracting content from different kinds of 
 attachments. JavaMail is used for mail box related operations like fetching 
 mails, filtering them etc.
 The basic configuration for one mail box is as below:
 {code:xml}
 document
entity processor=MailEntityProcessor user=someb...@gmail.com 
 password=something host=imap.gmail.com protocol=imaps/
 /document
 {code}
 The below is the list of all configuration available:
 {color:green}Required{color}
 -
 *user* 
 *pwd* 
 *protocol*  (only imaps supported now)
 *host* 
 {color:green}Optional{color}
 -
 *folders* - comma seperated list of folders. 
 If not specified, default folder is used. Nested folders can be specified 
 like a/b/c
 *recurse* - index subfolders. Defaults to true.
 *exclude* - comma seperated list of patterns. 
 *include* - comma seperated list of patterns.
 *batchSize* - mails to fetch at once in a given folder. 
 Only headers can be prefetched in Javamail IMAP.
 *readTimeout* - defaults to 6ms
 *conectTimeout* - defaults to 3ms
 *fetchSize* - IMAP config. 32KB default
 *fetchMailsSince* -
 date/time in miliiseconds, mails received after which will be fetched. Useful 
 for delta import.
 *customFilter* - class name.  
 {code}
 import javax.mail.Folder;
 import javax.mail.SearchTerm;
 clz implements MailEntityProcessor.CustomFilter() {
 public SearchTerm getCustomSearch(Folder folder);
 }
 {code}
 *processAttachement* - defaults to true
 The below are the indexed fields.
 {code}
   // Fields To Index
   // single valued
   private static final String SUBJECT = subject;
   private static final String FROM = from;
   private static final String SENT_DATE = sentDate;
   private static final String XMAILER = xMailer;
   // multi valued
   private static final String TO_CC_BCC = allTo;
   private static final String FLAGS = flags;
   private static final String CONTENT = content;
   private static final String ATTACHMENT = attachement;
   private static final String ATTACHMENT_NAMES = attachementNames;
   // flag values
   private static final String FLAG_ANSWERED = answered;
   private static final String FLAG_DELETED = deleted;
   private static final String FLAG_DRAFT = draft;
   private static final String FLAG_FLAGGED = flagged;
   private static final String FLAG_RECENT = recent;
   private static final String FLAG_SEEN = seen;
 {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-934) Enable importing of mails into a solr index through DIH.

2009-01-01 Thread Noble Paul (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-934?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12660214#action_12660214
 ] 

Noble Paul commented on SOLR-934:
-

bq.Would it make more sense for DIH to farm out it's content acquisition to a 
library like Droids

It would be great. It should be possible to have a DroidEntityProcessor one 
day.  


 Enable importing of mails into a solr index through DIH.
 

 Key: SOLR-934
 URL: https://issues.apache.org/jira/browse/SOLR-934
 Project: Solr
  Issue Type: New Feature
  Components: contrib - DataImportHandler
Affects Versions: 1.4
Reporter: Preetam Rao
Assignee: Shalin Shekhar Mangar
 Fix For: 1.4

 Attachments: SOLR-934.patch, SOLR-934.patch

   Original Estimate: 24h
  Remaining Estimate: 24h

 Enable importing of mails into solr through DIH. Take one or more mailbox 
 credentials, download and index their content along with the content from 
 attachments. The folders to fetch can be made configurable based on various 
 criteria. Apache Tika is used for extracting content from different kinds of 
 attachments. JavaMail is used for mail box related operations like fetching 
 mails, filtering them etc.
 The basic configuration for one mail box is as below:
 {code:xml}
 document
entity processor=MailEntityProcessor user=someb...@gmail.com 
 password=something host=imap.gmail.com protocol=imaps/
 /document
 {code}
 The below is the list of all configuration available:
 {color:green}Required{color}
 -
 *user* 
 *pwd* 
 *protocol*  (only imaps supported now)
 *host* 
 {color:green}Optional{color}
 -
 *folders* - comma seperated list of folders. 
 If not specified, default folder is used. Nested folders can be specified 
 like a/b/c
 *recurse* - index subfolders. Defaults to true.
 *exclude* - comma seperated list of patterns. 
 *include* - comma seperated list of patterns.
 *batchSize* - mails to fetch at once in a given folder. 
 Only headers can be prefetched in Javamail IMAP.
 *readTimeout* - defaults to 6ms
 *conectTimeout* - defaults to 3ms
 *fetchSize* - IMAP config. 32KB default
 *fetchMailsSince* -
 date/time in miliiseconds, mails received after which will be fetched. Useful 
 for delta import.
 *customFilter* - class name.  
 {code}
 import javax.mail.Folder;
 import javax.mail.SearchTerm;
 clz implements MailEntityProcessor.CustomFilter() {
 public SearchTerm getCustomSearch(Folder folder);
 }
 {code}
 *processAttachement* - defaults to true
 The below are the indexed fields.
 {code}
   // Fields To Index
   // single valued
   private static final String SUBJECT = subject;
   private static final String FROM = from;
   private static final String SENT_DATE = sentDate;
   private static final String XMAILER = xMailer;
   // multi valued
   private static final String TO_CC_BCC = allTo;
   private static final String FLAGS = flags;
   private static final String CONTENT = content;
   private static final String ATTACHMENT = attachement;
   private static final String ATTACHMENT_NAMES = attachementNames;
   // flag values
   private static final String FLAG_ANSWERED = answered;
   private static final String FLAG_DELETED = deleted;
   private static final String FLAG_DRAFT = draft;
   private static final String FLAG_FLAGGED = flagged;
   private static final String FLAG_RECENT = recent;
   private static final String FLAG_SEEN = seen;
 {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: [jira] Commented: (SOLR-934) Enable importing of mails into a solr index through DIH.

2009-01-01 Thread Preetam Rao
Hi Ben,
DIH stands for Data Import Handler. Its main aim is to provide a away of
indexing data into solr from different kind of sources, mainly DB, Rest
APIs, Files etc. This issue deals with adding one more data source (which is
handled by something called EntityProcessor in DIH lingo) which is a IMAP
mail box. Tika is used in this case, for indexing attachments
which can be off any mime type.

I don't know much of Droid. But a quick read suggests that its for getting
the meta data from digital content, similar to Tika for various mime types.
One integration of solr and droid would be searching a digital library. We
use driod to get the meta data on content and store the data in solr for
searching. The real digital content would be still be somewhere and solr
documents will hold a pointer to that content.

Regarding storage, lucene is used for storing anything that a user stores as
key value pairs. One can store the extracted content itself. But since its
not going to help much, usually one would just the index content and store
the index into solr and have the indexed documents contain the pointer to
real document.

Hope this helps.

On Fri, Jan 2, 2009 at 4:30 AM, Ben Johnson 
ben.john...@jandpconsulting.co.uk wrote:

 I'm watching this issue with interest, but I'm having trouble understanding
 the bigger picture.  I am prototyping a system that uses Restlet to store
 and index objects (mainly MS Office and OpenOffice documents and emails), so
 I am planning to use Solr with Tika to index the objects.

 I know nothing about DIH (Distributed Index Handler?), so I'm not sure what
 role it plays with Solr.  Is it a vendor-specific technology (from
 Autonomy)?  What does it do?  Do you give it objects to index and it handles
 them by passing it to one or more Solr/Tika indexing servers?  And are you
 thinking that this would therefore be a good place to not only index the
 objects, but also pass the information about the digital content to DROID?

 Reading a bit about DROID (from TNA, The National Archives), it seems like
 it is used to capture information about the digital content of objects
 stored in a content repository.  How does this fit with Solr?  I thought
 Solr with Tika just did the indexing of text-based objects, but the actual
 storage of the objects would be elsewhere (probably in the file system).
 From what I can tell, DROID would operate on the file system objects, not
 the indexing information.  Have I got this right?

 Ideally, I would also like to convert any suitable content into PDF/A
 format for long-term archival - probably not relevant to this issue, but I
 thought I'd mention it in case you see an application of this as part of
 email and attachment storage.

 Sorry for all the questions, but hopefully someone could clarify this for
 me!

 Thanks very much
 Ben Johnson

 --
 From: Grant Ingersoll (JIRA) j...@apache.org
 Sent: Thursday, January 01, 2009 7:07 PM
 To: solr-dev@lucene.apache.org
 Subject: [jira] Commented: (SOLR-934) Enable importing of mails into a solr
 index through DIH.



   [
 https://issues.apache.org/jira/browse/SOLR-934?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12660210#action_12660210]

 Grant Ingersoll commented on SOLR-934:
 --

 Would it make more sense for DIH to farm out it's content acquisition to a
 library like Droids?  Then, we could have real crawling, etc. all through a
 pluggable connector framework.

  Enable importing of mails into a solr index through DIH.
 

Key: SOLR-934
URL: https://issues.apache.org/jira/browse/SOLR-934
Project: Solr
 Issue Type: New Feature
 Components: contrib - DataImportHandler
   Affects Versions: 1.4
   Reporter: Preetam Rao
   Assignee: Shalin Shekhar Mangar
Fix For: 1.4

Attachments: SOLR-934.patch, SOLR-934.patch

  Original Estimate: 24h
  Remaining Estimate: 24h

 Enable importing of mails into solr through DIH. Take one or more mailbox
 credentials, download and index their content along with the content from
 attachments. The folders to fetch can be made configurable based on various
 criteria. Apache Tika is used for extracting content from different kinds of
 attachments. JavaMail is used for mail box related operations like fetching
 mails, filtering them etc.
 The basic configuration for one mail box is as below:
 {code:xml}
 document
   entity processor=MailEntityProcessor user=someb...@gmail.com
password=something host=imap.gmail.com
 protocol=imaps/
 /document
 {code}
 The below is the list of all configuration available:
 {color:green}Required{color}
 -
 *user*
 *pwd*
 *protocol*  (only imaps supported now)
 *host*
 {color:green}Optional{color}
 -
 *folders* - comma seperated list of folders

Re: [jira] Commented: (SOLR-934) Enable importing of mails into a solr index through DIH.

2009-01-01 Thread Noble Paul നോബിള്‍ नोब्ळ्
Hi Ben,
You can take a look at the wiki page for DIH
http://wiki.apache.org/solr/DataImportHandler

It helps you index mostly structured data into Solr from db, xml etc .
It can be considered as an ETL tool
(http://en.wikipedia.org/wiki/Extract,_transform,_load ) for Solr.

Adding mail support means you can index your emails into Sols with a
few lines of configuration
--Noble

On Fri, Jan 2, 2009 at 4:30 AM, Ben Johnson
ben.john...@jandpconsulting.co.uk wrote:
 I'm watching this issue with interest, but I'm having trouble understanding
 the bigger picture.  I am prototyping a system that uses Restlet to store
 and index objects (mainly MS Office and OpenOffice documents and emails), so
 I am planning to use Solr with Tika to index the objects.

 I know nothing about DIH (Distributed Index Handler?), so I'm not sure what
 role it plays with Solr.  Is it a vendor-specific technology (from
 Autonomy)?  What does it do?  Do you give it objects to index and it handles
 them by passing it to one or more Solr/Tika indexing servers?  And are you
 thinking that this would therefore be a good place to not only index the
 objects, but also pass the information about the digital content to DROID?

 Reading a bit about DROID (from TNA, The National Archives), it seems like
 it is used to capture information about the digital content of objects
 stored in a content repository.  How does this fit with Solr?  I thought
 Solr with Tika just did the indexing of text-based objects, but the actual
 storage of the objects would be elsewhere (probably in the file system).
 From what I can tell, DROID would operate on the file system objects, not
 the indexing information.  Have I got this right?

 Ideally, I would also like to convert any suitable content into PDF/A format
 for long-term archival - probably not relevant to this issue, but I thought
 I'd mention it in case you see an application of this as part of email and
 attachment storage.

 Sorry for all the questions, but hopefully someone could clarify this for
 me!

 Thanks very much
 Ben Johnson

 --
 From: Grant Ingersoll (JIRA) j...@apache.org
 Sent: Thursday, January 01, 2009 7:07 PM
 To: solr-dev@lucene.apache.org
 Subject: [jira] Commented: (SOLR-934) Enable importing of mails into a solr
 index through DIH.


   [
 https://issues.apache.org/jira/browse/SOLR-934?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12660210#action_12660210
 ]

 Grant Ingersoll commented on SOLR-934:
 --

 Would it make more sense for DIH to farm out it's content acquisition to a
 library like Droids?  Then, we could have real crawling, etc. all through a
 pluggable connector framework.

 Enable importing of mails into a solr index through DIH.
 

Key: SOLR-934
URL: https://issues.apache.org/jira/browse/SOLR-934
Project: Solr
 Issue Type: New Feature
 Components: contrib - DataImportHandler
   Affects Versions: 1.4
   Reporter: Preetam Rao
   Assignee: Shalin Shekhar Mangar
Fix For: 1.4

Attachments: SOLR-934.patch, SOLR-934.patch

  Original Estimate: 24h
  Remaining Estimate: 24h

 Enable importing of mails into solr through DIH. Take one or more mailbox
 credentials, download and index their content along with the content from
 attachments. The folders to fetch can be made configurable based on various
 criteria. Apache Tika is used for extracting content from different kinds of
 attachments. JavaMail is used for mail box related operations like fetching
 mails, filtering them etc.
 The basic configuration for one mail box is as below:
 {code:xml}
 document
   entity processor=MailEntityProcessor user=someb...@gmail.com
password=something host=imap.gmail.com
 protocol=imaps/
 /document
 {code}
 The below is the list of all configuration available:
 {color:green}Required{color}
 -
 *user*
 *pwd*
 *protocol*  (only imaps supported now)
 *host*
 {color:green}Optional{color}
 -
 *folders* - comma seperated list of folders.
 If not specified, default folder is used. Nested folders can be specified
 like a/b/c
 *recurse* - index subfolders. Defaults to true.
 *exclude* - comma seperated list of patterns.
 *include* - comma seperated list of patterns.
 *batchSize* - mails to fetch at once in a given folder.
 Only headers can be prefetched in Javamail IMAP.
 *readTimeout* - defaults to 6ms
 *conectTimeout* - defaults to 3ms
 *fetchSize* - IMAP config. 32KB default
 *fetchMailsSince* -
 date/time in miliiseconds, mails received after which will be fetched.
 Useful for delta import.
 *customFilter* - class name.
 {code}
 import javax.mail.Folder;
 import javax.mail.SearchTerm;
 clz implements MailEntityProcessor.CustomFilter() {
 public SearchTerm getCustomSearch(Folder folder

[jira] Commented: (SOLR-934) Enable importing of mails into a solr index through DIH.

2008-12-23 Thread Shalin Shekhar Mangar (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-934?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12659046#action_12659046
 ] 

Shalin Shekhar Mangar commented on SOLR-934:


Thanks for this Preetam, looks great!

A few suggestions:
# Use the Lucene code style -- you can get a codestyle for Eclipse/Idea from 
http://wiki.apache.org/solr/HowToContribute
# Let us use the Java variable naming convention for the fields e.g sent_date 
becomes sentDate
# I don't think we need the sent_date_display, people can always format the 
date and display as they want
# All the attributes for the entity processor should be templatized e.g 
user=${dataimporter.request.user} and so on. You'd need to use 
context.getVariableResolver().replaceTokens(attr)
# The Profile class looks un-necessary. The values can be stored directly as 
private variables
# Attachment names can be another multi-valued field
# Exception while connecting must be propagated so that the users know why the 
connection is failing.
# For delta imports, we can just provide a olderThan and newerThan syntax. That 
should be enough
# Streaming is recommended instead of calling folder.getMessages(). We can use 
getMessages(int start, int end) and the batchSize can be a configurable 
parameter with some sane default.

Support for recursive folders will be awesome.

 Enable importing of mails into a solr index through DIH.
 

 Key: SOLR-934
 URL: https://issues.apache.org/jira/browse/SOLR-934
 Project: Solr
  Issue Type: New Feature
  Components: contrib - DataImportHandler
Affects Versions: 1.4
Reporter: Preetam Rao
Priority: Minor
 Fix For: 1.4

 Attachments: SOLR-934.patch

   Original Estimate: 120h
  Remaining Estimate: 120h

 Enable importing of mails into solr through DIH. Take one or more mailbox 
 credentials, download and index their content along with the content from 
 attachments.
 The folders to fetch can be made configurable based on various criteria.
 Apache Tika can be used for extracting content from different kinds of 
 attachments.
 JavaMail can be used for mail box related operations like fetching mails, 
 filtering them etc.
 The basic configuration for one mail box can look something like this:
 document
entity processor=org.apache.solr.handler.dataimport.MailEntityProcessor
  user=someb...@gmail.com
 password=something
 host=imap.gmail.com
 protocol=imaps
 folder=test1/
 /document
 - This can be enhanced with timeouts, list to be read from a file, folder 
 filters, delta import etc.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-934) Enable importing of mails into a solr index through DIH.

2008-12-23 Thread Preetam Rao (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-934?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12659050#action_12659050
 ] 

Preetam Rao commented on SOLR-934:
--

I agree with all the comments... Will incorporate them soon...

 Enable importing of mails into a solr index through DIH.
 

 Key: SOLR-934
 URL: https://issues.apache.org/jira/browse/SOLR-934
 Project: Solr
  Issue Type: New Feature
  Components: contrib - DataImportHandler
Affects Versions: 1.4
Reporter: Preetam Rao
Assignee: Shalin Shekhar Mangar
 Fix For: 1.4

 Attachments: SOLR-934.patch

   Original Estimate: 120h
  Remaining Estimate: 120h

 Enable importing of mails into solr through DIH. Take one or more mailbox 
 credentials, download and index their content along with the content from 
 attachments.
 The folders to fetch can be made configurable based on various criteria.
 Apache Tika can be used for extracting content from different kinds of 
 attachments.
 JavaMail can be used for mail box related operations like fetching mails, 
 filtering them etc.
 The basic configuration for one mail box can look something like this:
 {code:xml}
 document
entity processor=MailEntityProcessor user=someb...@gmail.com 
 password=something host=imap.gmail.com protocol=imaps 
 folder=test1/
 /document
 {code}
 - This can be enhanced with timeouts, list to be read from a file, folder 
 filters, delta import etc.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.