[jira] Commented: (SOLR-934) Enable importing of mails into a solr index through DIH.
[ https://issues.apache.org/jira/browse/SOLR-934?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12660272#action_12660272 ] Preetam Rao commented on SOLR-934: -- Regarding comma separated list of patterns: Folder names won't contain commas usually. The regex which will contain commas is for limiting number of occurances like {M,N}, which also does not seem to be very useful in restricting folder names. Can we leave it as it is till the need arises ? If not what would be a good escape character or replacement for comma ? Enable importing of mails into a solr index through DIH. Key: SOLR-934 URL: https://issues.apache.org/jira/browse/SOLR-934 Project: Solr Issue Type: New Feature Components: contrib - DataImportHandler Affects Versions: 1.4 Reporter: Preetam Rao Assignee: Shalin Shekhar Mangar Fix For: 1.4 Attachments: SOLR-934.patch, SOLR-934.patch Original Estimate: 24h Remaining Estimate: 24h Enable importing of mails into solr through DIH. Take one or more mailbox credentials, download and index their content along with the content from attachments. The folders to fetch can be made configurable based on various criteria. Apache Tika is used for extracting content from different kinds of attachments. JavaMail is used for mail box related operations like fetching mails, filtering them etc. The basic configuration for one mail box is as below: {code:xml} document entity processor=MailEntityProcessor user=someb...@gmail.com password=something host=imap.gmail.com protocol=imaps/ /document {code} The below is the list of all configuration available: {color:green}Required{color} - *user* *pwd* *protocol* (only imaps supported now) *host* {color:green}Optional{color} - *folders* - comma seperated list of folders. If not specified, default folder is used. Nested folders can be specified like a/b/c *recurse* - index subfolders. Defaults to true. *exclude* - comma seperated list of patterns. *include* - comma seperated list of patterns. *batchSize* - mails to fetch at once in a given folder. Only headers can be prefetched in Javamail IMAP. *readTimeout* - defaults to 6ms *conectTimeout* - defaults to 3ms *fetchSize* - IMAP config. 32KB default *fetchMailsSince* - date/time in miliiseconds, mails received after which will be fetched. Useful for delta import. *customFilter* - class name. {code} import javax.mail.Folder; import javax.mail.SearchTerm; clz implements MailEntityProcessor.CustomFilter() { public SearchTerm getCustomSearch(Folder folder); } {code} *processAttachement* - defaults to true The below are the indexed fields. {code} // Fields To Index // single valued private static final String SUBJECT = subject; private static final String FROM = from; private static final String SENT_DATE = sentDate; private static final String XMAILER = xMailer; // multi valued private static final String TO_CC_BCC = allTo; private static final String FLAGS = flags; private static final String CONTENT = content; private static final String ATTACHMENT = attachement; private static final String ATTACHMENT_NAMES = attachementNames; // flag values private static final String FLAG_ANSWERED = answered; private static final String FLAG_DELETED = deleted; private static final String FLAG_DRAFT = draft; private static final String FLAG_FLAGGED = flagged; private static final String FLAG_RECENT = recent; private static final String FLAG_SEEN = seen; {code} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-934) Enable importing of mails into a solr index through DIH.
[ https://issues.apache.org/jira/browse/SOLR-934?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12660274#action_12660274 ] Noble Paul commented on SOLR-934: - This is a trivial thing. Other suggestions are really important Enable importing of mails into a solr index through DIH. Key: SOLR-934 URL: https://issues.apache.org/jira/browse/SOLR-934 Project: Solr Issue Type: New Feature Components: contrib - DataImportHandler Affects Versions: 1.4 Reporter: Preetam Rao Assignee: Shalin Shekhar Mangar Fix For: 1.4 Attachments: SOLR-934.patch, SOLR-934.patch Original Estimate: 24h Remaining Estimate: 24h Enable importing of mails into solr through DIH. Take one or more mailbox credentials, download and index their content along with the content from attachments. The folders to fetch can be made configurable based on various criteria. Apache Tika is used for extracting content from different kinds of attachments. JavaMail is used for mail box related operations like fetching mails, filtering them etc. The basic configuration for one mail box is as below: {code:xml} document entity processor=MailEntityProcessor user=someb...@gmail.com password=something host=imap.gmail.com protocol=imaps/ /document {code} The below is the list of all configuration available: {color:green}Required{color} - *user* *pwd* *protocol* (only imaps supported now) *host* {color:green}Optional{color} - *folders* - comma seperated list of folders. If not specified, default folder is used. Nested folders can be specified like a/b/c *recurse* - index subfolders. Defaults to true. *exclude* - comma seperated list of patterns. *include* - comma seperated list of patterns. *batchSize* - mails to fetch at once in a given folder. Only headers can be prefetched in Javamail IMAP. *readTimeout* - defaults to 6ms *conectTimeout* - defaults to 3ms *fetchSize* - IMAP config. 32KB default *fetchMailsSince* - date/time in miliiseconds, mails received after which will be fetched. Useful for delta import. *customFilter* - class name. {code} import javax.mail.Folder; import javax.mail.SearchTerm; clz implements MailEntityProcessor.CustomFilter() { public SearchTerm getCustomSearch(Folder folder); } {code} *processAttachement* - defaults to true The below are the indexed fields. {code} // Fields To Index // single valued private static final String SUBJECT = subject; private static final String FROM = from; private static final String SENT_DATE = sentDate; private static final String XMAILER = xMailer; // multi valued private static final String TO_CC_BCC = allTo; private static final String FLAGS = flags; private static final String CONTENT = content; private static final String ATTACHMENT = attachement; private static final String ATTACHMENT_NAMES = attachementNames; // flag values private static final String FLAG_ANSWERED = answered; private static final String FLAG_DELETED = deleted; private static final String FLAG_DRAFT = draft; private static final String FLAG_FLAGGED = flagged; private static final String FLAG_RECENT = recent; private static final String FLAG_SEEN = seen; {code} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (SOLR-934) Enable importing of mails into a solr index through DIH.
[ https://issues.apache.org/jira/browse/SOLR-934?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Preetam Rao updated SOLR-934: - Attachment: SOLR-934.patch Thanks for comments and feedback Noble and Shalin. Attached is the latest patch which calls init() as well as applyTransformer(). Receives fetchTimeSince in -MM-dd HH:mm:ss format. exclude/include pattern is still comma seperated. Cleanup is already being handled in FolderIterator when it learns that all folders have been exhausted. Could not attach dependency jars (13MB). Single part or multi part with smaller size both fail... Enable importing of mails into a solr index through DIH. Key: SOLR-934 URL: https://issues.apache.org/jira/browse/SOLR-934 Project: Solr Issue Type: New Feature Components: contrib - DataImportHandler Affects Versions: 1.4 Reporter: Preetam Rao Assignee: Shalin Shekhar Mangar Fix For: 1.4 Attachments: SOLR-934.patch, SOLR-934.patch, SOLR-934.patch Original Estimate: 24h Remaining Estimate: 24h Enable importing of mails into solr through DIH. Take one or more mailbox credentials, download and index their content along with the content from attachments. The folders to fetch can be made configurable based on various criteria. Apache Tika is used for extracting content from different kinds of attachments. JavaMail is used for mail box related operations like fetching mails, filtering them etc. The basic configuration for one mail box is as below: {code:xml} document entity processor=MailEntityProcessor user=someb...@gmail.com password=something host=imap.gmail.com protocol=imaps/ /document {code} The below is the list of all configuration available: {color:green}Required{color} - *user* *pwd* *protocol* (only imaps supported now) *host* {color:green}Optional{color} - *folders* - comma seperated list of folders. If not specified, default folder is used. Nested folders can be specified like a/b/c *recurse* - index subfolders. Defaults to true. *exclude* - comma seperated list of patterns. *include* - comma seperated list of patterns. *batchSize* - mails to fetch at once in a given folder. Only headers can be prefetched in Javamail IMAP. *readTimeout* - defaults to 6ms *conectTimeout* - defaults to 3ms *fetchSize* - IMAP config. 32KB default *fetchMailsSince* - date/time in miliiseconds, mails received after which will be fetched. Useful for delta import. *customFilter* - class name. {code} import javax.mail.Folder; import javax.mail.SearchTerm; clz implements MailEntityProcessor.CustomFilter() { public SearchTerm getCustomSearch(Folder folder); } {code} *processAttachement* - defaults to true The below are the indexed fields. {code} // Fields To Index // single valued private static final String SUBJECT = subject; private static final String FROM = from; private static final String SENT_DATE = sentDate; private static final String XMAILER = xMailer; // multi valued private static final String TO_CC_BCC = allTo; private static final String FLAGS = flags; private static final String CONTENT = content; private static final String ATTACHMENT = attachement; private static final String ATTACHMENT_NAMES = attachementNames; // flag values private static final String FLAG_ANSWERED = answered; private static final String FLAG_DELETED = deleted; private static final String FLAG_DRAFT = draft; private static final String FLAG_FLAGGED = flagged; private static final String FLAG_RECENT = recent; private static final String FLAG_SEEN = seen; {code} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Re: [jira] Commented: (SOLR-934) Enable importing of mails into a solr index through DIH.
Thanks Paul and Preetam. A couple of further things: - How do you envisage this functionality being used? I can see indexing all emails for all users as part of a one-off system setup/migration process, but also as a core feature to ensure all emails received by a company/organisation are indexed (and stored). This could be done either by the end-user, who controls what should be indexed (i.e. certain work-related emails only) or directly from the mail server, where all emails would be indexed (including personal emails, which could later be deleted from the index if desired) to ensure no important emails get missed. Is this the sort of thing you had in mind? There is also the issue of not indexing/storing the same email from multiple users' mailboxes (haven't worked that one out yet, possibly via a hash). - Is the mailbox 'configuration' (entity tag) stored in data-config.xml on the Solr server? If so, this would seem to have quite a lot of administrative overhead - how do you manage a system with 5000+ users? How are the accounts/passwords maintained? Are the passwords stored in plain text? - Minor typo: *conectTimeout* should be *connectTimeout* - A few real-world scenarios I've encountered are: - be able to handle an email sent to over 5000 recipients (in the 'To:' field) - be able to handle an email with a 'long' subject line (240+ characters) - be able to handle an email with 100 attachments - be able to handle an email with attachments with 'long' names (240+ characters) This caused several problems in the software I was using at the time (a proprietary system, not Solr-based), either memory-related issues or file system errors when running on Windows where the file system or its API limited file names to 255 characters, including the path. Thanks very much! Ben -- From: Noble Paul നോബിള് नोब्ळ् noble.p...@gmail.com Sent: Friday, January 02, 2009 5:02 AM To: solr-dev@lucene.apache.org Subject: Re: [jira] Commented: (SOLR-934) Enable importing of mails into a solr index through DIH. Hi Ben, You can take a look at the wiki page for DIH http://wiki.apache.org/solr/DataImportHandler It helps you index mostly structured data into Solr from db, xml etc . It can be considered as an ETL tool (http://en.wikipedia.org/wiki/Extract,_transform,_load ) for Solr. Adding mail support means you can index your emails into Sols with a few lines of configuration --Noble On Fri, Jan 2, 2009 at 4:30 AM, Ben Johnson ben.john...@jandpconsulting.co.uk wrote: I'm watching this issue with interest, but I'm having trouble understanding the bigger picture. I am prototyping a system that uses Restlet to store and index objects (mainly MS Office and OpenOffice documents and emails), so I am planning to use Solr with Tika to index the objects. I know nothing about DIH (Distributed Index Handler?), so I'm not sure what role it plays with Solr. Is it a vendor-specific technology (from Autonomy)? What does it do? Do you give it objects to index and it handles them by passing it to one or more Solr/Tika indexing servers? And are you thinking that this would therefore be a good place to not only index the objects, but also pass the information about the digital content to DROID? Reading a bit about DROID (from TNA, The National Archives), it seems like it is used to capture information about the digital content of objects stored in a content repository. How does this fit with Solr? I thought Solr with Tika just did the indexing of text-based objects, but the actual storage of the objects would be elsewhere (probably in the file system). From what I can tell, DROID would operate on the file system objects, not the indexing information. Have I got this right? Ideally, I would also like to convert any suitable content into PDF/A format for long-term archival - probably not relevant to this issue, but I thought I'd mention it in case you see an application of this as part of email and attachment storage. Sorry for all the questions, but hopefully someone could clarify this for me! Thanks very much Ben Johnson -- From: Grant Ingersoll (JIRA) j...@apache.org Sent: Thursday, January 01, 2009 7:07 PM To: solr-dev@lucene.apache.org Subject: [jira] Commented: (SOLR-934) Enable importing of mails into a solr index through DIH. [ https://issues.apache.org/jira/browse/SOLR-934?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12660210#action_12660210 ] Grant Ingersoll commented on SOLR-934: -- Would it make more sense for DIH to farm out it's content acquisition to a library like Droids? Then, we could have real crawling, etc. all through a pluggable connector framework. Enable importing of mails into a solr index through DIH.
Re: [jira] Commented: (SOLR-934) Enable importing of mails into a solr index through DIH.
On Fri, Jan 2, 2009 at 3:42 PM, Ben Johnson ben.john...@jandpconsulting.co.uk wrote: Thanks Paul and Preetam. A couple of further things: - How do you envisage this functionality being used? I can see indexing all emails for all users as part of a one-off system setup/migration process, but also as a core feature to ensure all emails received by a company/organisation are indexed (and stored). This could be done either by the end-user, who controls what should be indexed (i.e. certain work-related emails only) or directly from the mail server, where all emails would be indexed (including personal emails, which could later be deleted from the index if desired) to ensure no important emails get missed. Is this the sort of thing you had in mind? There is also the issue of not indexing/storing the same email from multiple users' mailboxes (haven't worked that one out yet, possibly via a hash). - Is the mailbox 'configuration' (entity tag) stored in data-config.xml on the Solr server? If so, this would seem to have quite a lot of Do you wish all users mails to be indexed into single index ? it is possible by passing on the username password as request parameters . administrative overhead - how do you manage a system with 5000+ users? How are the accounts/passwords maintained? Are the passwords stored in plain text? - Minor typo: *conectTimeout* should be *connectTimeout* - A few real-world scenarios I've encountered are: - be able to handle an email sent to over 5000 recipients (in the 'To:' field) - be able to handle an email with a 'long' subject line (240+ characters) - be able to handle an email with 100 attachments - be able to handle an email with attachments with 'long' names (240+ characters) This caused several problems in the software I was using at the time (a proprietary system, not Solr-based), either memory-related issues or file system errors when running on Windows where the file system or its API limited file names to 255 characters, including the path. Thanks very much! Ben -- From: Noble Paul നോബിള് नोब्ळ् noble.p...@gmail.com Sent: Friday, January 02, 2009 5:02 AM To: solr-dev@lucene.apache.org Subject: Re: [jira] Commented: (SOLR-934) Enable importing of mails into a solr index through DIH. Hi Ben, You can take a look at the wiki page for DIH http://wiki.apache.org/solr/DataImportHandler It helps you index mostly structured data into Solr from db, xml etc . It can be considered as an ETL tool (http://en.wikipedia.org/wiki/Extract,_transform,_load ) for Solr. Adding mail support means you can index your emails into Sols with a few lines of configuration --Noble On Fri, Jan 2, 2009 at 4:30 AM, Ben Johnson ben.john...@jandpconsulting.co.uk wrote: I'm watching this issue with interest, but I'm having trouble understanding the bigger picture. I am prototyping a system that uses Restlet to store and index objects (mainly MS Office and OpenOffice documents and emails), so I am planning to use Solr with Tika to index the objects. I know nothing about DIH (Distributed Index Handler?), so I'm not sure what role it plays with Solr. Is it a vendor-specific technology (from Autonomy)? What does it do? Do you give it objects to index and it handles them by passing it to one or more Solr/Tika indexing servers? And are you thinking that this would therefore be a good place to not only index the objects, but also pass the information about the digital content to DROID? Reading a bit about DROID (from TNA, The National Archives), it seems like it is used to capture information about the digital content of objects stored in a content repository. How does this fit with Solr? I thought Solr with Tika just did the indexing of text-based objects, but the actual storage of the objects would be elsewhere (probably in the file system). From what I can tell, DROID would operate on the file system objects, not the indexing information. Have I got this right? Ideally, I would also like to convert any suitable content into PDF/A format for long-term archival - probably not relevant to this issue, but I thought I'd mention it in case you see an application of this as part of email and attachment storage. Sorry for all the questions, but hopefully someone could clarify this for me! Thanks very much Ben Johnson -- From: Grant Ingersoll (JIRA) j...@apache.org Sent: Thursday, January 01, 2009 7:07 PM To: solr-dev@lucene.apache.org Subject: [jira] Commented: (SOLR-934) Enable importing of mails into a solr index through DIH. [ https://issues.apache.org/jira/browse/SOLR-934?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12660210#action_12660210 ] Grant Ingersoll commented on SOLR-934: -- Would it make more sense
Re: [jira] Commented: (SOLR-934) Enable importing of mails into a solr index through DIH.
Hi Paul Yes, I was thinking that emails for all users would be indexed into a single index, at least conceptually. I'm thinking of a corporate/organisational repository that any user could search for relevant information, be that email or some other kind of document (e.g. MS Office, OpenOffice, PDF, etc...). An example usage would be for government organisations in the United Kingdom that need to respond to Freedom of Information (FOI) requests and are therefore required by law to produce all information regarding a particular subject if requested (sensitive information excluded). I haven't looked into architectural options for the indexes - I don't know if it is possible/desirable to split indexes up and use some sort of federated search to produce results, but at least conceptually I was thinking of a single source for the indexing information. Regards Ben -- From: Noble Paul നോബിള് नोब्ळ् noble.p...@gmail.com Sent: Friday, January 02, 2009 11:21 AM To: solr-dev@lucene.apache.org Subject: Re: [jira] Commented: (SOLR-934) Enable importing of mails into a solr index through DIH. On Fri, Jan 2, 2009 at 3:42 PM, Ben Johnson ben.john...@jandpconsulting.co.uk wrote: Thanks Paul and Preetam. A couple of further things: - How do you envisage this functionality being used? I can see indexing all emails for all users as part of a one-off system setup/migration process, but also as a core feature to ensure all emails received by a company/organisation are indexed (and stored). This could be done either by the end-user, who controls what should be indexed (i.e. certain work-related emails only) or directly from the mail server, where all emails would be indexed (including personal emails, which could later be deleted from the index if desired) to ensure no important emails get missed. Is this the sort of thing you had in mind? There is also the issue of not indexing/storing the same email from multiple users' mailboxes (haven't worked that one out yet, possibly via a hash). - Is the mailbox 'configuration' (entity tag) stored in data-config.xml on the Solr server? If so, this would seem to have quite a lot of Do you wish all users mails to be indexed into single index ? it is possible by passing on the username password as request parameters . administrative overhead - how do you manage a system with 5000+ users? How are the accounts/passwords maintained? Are the passwords stored in plain text? - Minor typo: *conectTimeout* should be *connectTimeout* - A few real-world scenarios I've encountered are: - be able to handle an email sent to over 5000 recipients (in the 'To:' field) - be able to handle an email with a 'long' subject line (240+ characters) - be able to handle an email with 100 attachments - be able to handle an email with attachments with 'long' names (240+ characters) This caused several problems in the software I was using at the time (a proprietary system, not Solr-based), either memory-related issues or file system errors when running on Windows where the file system or its API limited file names to 255 characters, including the path. Thanks very much! Ben -- From: Noble Paul നോബിള് नोब्ळ् noble.p...@gmail.com Sent: Friday, January 02, 2009 5:02 AM To: solr-dev@lucene.apache.org Subject: Re: [jira] Commented: (SOLR-934) Enable importing of mails into a solr index through DIH. Hi Ben, You can take a look at the wiki page for DIH http://wiki.apache.org/solr/DataImportHandler It helps you index mostly structured data into Solr from db, xml etc . It can be considered as an ETL tool (http://en.wikipedia.org/wiki/Extract,_transform,_load ) for Solr. Adding mail support means you can index your emails into Sols with a few lines of configuration --Noble On Fri, Jan 2, 2009 at 4:30 AM, Ben Johnson ben.john...@jandpconsulting.co.uk wrote: I'm watching this issue with interest, but I'm having trouble understanding the bigger picture. I am prototyping a system that uses Restlet to store and index objects (mainly MS Office and OpenOffice documents and emails), so I am planning to use Solr with Tika to index the objects. I know nothing about DIH (Distributed Index Handler?), so I'm not sure what role it plays with Solr. Is it a vendor-specific technology (from Autonomy)? What does it do? Do you give it objects to index and it handles them by passing it to one or more Solr/Tika indexing servers? And are you thinking that this would therefore be a good place to not only index the objects, but also pass the information about the digital content to DROID? Reading a bit about DROID (from TNA, The National Archives), it seems like it is used to capture information about the digital content of objects stored in a content repository. How does this fit with Solr? I thought Solr with Tika just did the indexing of text-based objects, but the
Re: [jira] Commented: (SOLR-934) Enable importing of mails into a solr index through DIH.
Quick clarifications: - Droids: http://incubator.apache.org/droids/index.html - DIH: http://wiki.apache.org/solr/DataImportHandler - Solr + Tika: http://wiki.apache.org/solr/ExtractingRequestHandler Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Ben Johnson ben.john...@jandpconsulting.co.uk To: solr-dev@lucene.apache.org Sent: Thursday, January 1, 2009 6:00:43 PM Subject: Re: [jira] Commented: (SOLR-934) Enable importing of mails into a solr index through DIH. I'm watching this issue with interest, but I'm having trouble understanding the bigger picture. I am prototyping a system that uses Restlet to store and index objects (mainly MS Office and OpenOffice documents and emails), so I am planning to use Solr with Tika to index the objects. I know nothing about DIH (Distributed Index Handler?), so I'm not sure what role it plays with Solr. Is it a vendor-specific technology (from Autonomy)? What does it do? Do you give it objects to index and it handles them by passing it to one or more Solr/Tika indexing servers? And are you thinking that this would therefore be a good place to not only index the objects, but also pass the information about the digital content to DROID? Reading a bit about DROID (from TNA, The National Archives), it seems like it is used to capture information about the digital content of objects stored in a content repository. How does this fit with Solr? I thought Solr with Tika just did the indexing of text-based objects, but the actual storage of the objects would be elsewhere (probably in the file system). From what I can tell, DROID would operate on the file system objects, not the indexing information. Have I got this right? Ideally, I would also like to convert any suitable content into PDF/A format for long-term archival - probably not relevant to this issue, but I thought I'd mention it in case you see an application of this as part of email and attachment storage. Sorry for all the questions, but hopefully someone could clarify this for me! Thanks very much Ben Johnson -- From: Grant Ingersoll (JIRA) Sent: Thursday, January 01, 2009 7:07 PM To: Subject: [jira] Commented: (SOLR-934) Enable importing of mails into a solr index through DIH. [ https://issues.apache.org/jira/browse/SOLR-934?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12660210#action_12660210 ] Grant Ingersoll commented on SOLR-934: -- Would it make more sense for DIH to farm out it's content acquisition to a library like Droids? Then, we could have real crawling, etc. all through a pluggable connector framework. Enable importing of mails into a solr index through DIH. Key: SOLR-934 URL: https://issues.apache.org/jira/browse/SOLR-934 Project: Solr Issue Type: New Feature Components: contrib - DataImportHandler Affects Versions: 1.4 Reporter: Preetam Rao Assignee: Shalin Shekhar Mangar Fix For: 1.4 Attachments: SOLR-934.patch, SOLR-934.patch Original Estimate: 24h Remaining Estimate: 24h Enable importing of mails into solr through DIH. Take one or more mailbox credentials, download and index their content along with the content from attachments. The folders to fetch can be made configurable based on various criteria. Apache Tika is used for extracting content from different kinds of attachments. JavaMail is used for mail box related operations like fetching mails, filtering them etc. The basic configuration for one mail box is as below: {code:xml} password=something host=imap.gmail.com protocol=imaps/ {code} The below is the list of all configuration available: {color:green}Required{color} - *user* *pwd* *protocol* (only imaps supported now) *host* {color:green}Optional{color} - *folders* - comma seperated list of folders. If not specified, default folder is used. Nested folders can be specified like a/b/c *recurse* - index subfolders. Defaults to true. *exclude* - comma seperated list of patterns. *include* - comma seperated list of patterns. *batchSize* - mails to fetch at once in a given folder. Only headers can be prefetched in Javamail IMAP. *readTimeout* - defaults to 6ms *conectTimeout* - defaults to 3ms *fetchSize* - IMAP config. 32KB default *fetchMailsSince* - date/time in miliiseconds, mails received after which will be fetched. Useful for delta import. *customFilter* - class name. {code} import javax.mail.Folder; import javax.mail.SearchTerm; clz
[jira] Commented: (SOLR-773) Incorporate Local Lucene/Solr
[ https://issues.apache.org/jira/browse/SOLR-773?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12660384#action_12660384 ] patrick o'leary commented on SOLR-773: -- Lucene uses a static sort comparator getCachedComparator in lucene's FieldSortedHitQueue.java The assumption being that the sort comparator would never have any data in it I guess. As the distances in the geo sort are a hashmap produced by the distance query, the ScoreDocComparator creates a memory leak unless the scope of the distance query is within the process block. It's messy but the only work around I could find. Putting the distance query in the response builder could make this leak again. Incorporate Local Lucene/Solr - Key: SOLR-773 URL: https://issues.apache.org/jira/browse/SOLR-773 Project: Solr Issue Type: New Feature Reporter: Grant Ingersoll Priority: Minor Attachments: SOLR-773-local-lucene.patch, SOLR-773-local-lucene.patch, SOLR-773-local-lucene.patch, spatial-solr.tar.gz Local Lucene has been donated to the Lucene project. It has some Solr components, but we should evaluate how best to incorporate it into Solr. See http://lucene.markmail.org/message/orzro22sqdj3wows?q=LocalLucene -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-773) Incorporate Local Lucene/Solr
[ https://issues.apache.org/jira/browse/SOLR-773?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12660387#action_12660387 ] Ryan McKinley commented on SOLR-773: hymmm. I don't follow. Is the problem that the HashMap stays in static memory for each request? If so, could we put the map in the request context? Is this an issue with the lucene Sort Comparator interface or with how the solr implementation passes the results around? Incorporate Local Lucene/Solr - Key: SOLR-773 URL: https://issues.apache.org/jira/browse/SOLR-773 Project: Solr Issue Type: New Feature Reporter: Grant Ingersoll Priority: Minor Attachments: SOLR-773-local-lucene.patch, SOLR-773-local-lucene.patch, SOLR-773-local-lucene.patch, spatial-solr.tar.gz Local Lucene has been donated to the Lucene project. It has some Solr components, but we should evaluate how best to incorporate it into Solr. See http://lucene.markmail.org/message/orzro22sqdj3wows?q=LocalLucene -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-773) Incorporate Local Lucene/Solr
[ https://issues.apache.org/jira/browse/SOLR-773?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12660389#action_12660389 ] patrick o'leary commented on SOLR-773: -- It's because of the FieldSortedHitQueue in lucene, even though sorts are generally created as new objects, the FieldSortedHitQueue maintains a static cache of them- Somebody actually had another work around http://mail-archives.apache.org/mod_mbox/lucene-java-user/200806.mbox/%3c571296.22735...@web50301.mail.re2.yahoo.com%3e I haven't tried it, but it might be an option. Incorporate Local Lucene/Solr - Key: SOLR-773 URL: https://issues.apache.org/jira/browse/SOLR-773 Project: Solr Issue Type: New Feature Reporter: Grant Ingersoll Priority: Minor Attachments: SOLR-773-local-lucene.patch, SOLR-773-local-lucene.patch, SOLR-773-local-lucene.patch, spatial-solr.tar.gz Local Lucene has been donated to the Lucene project. It has some Solr components, but we should evaluate how best to incorporate it into Solr. See http://lucene.markmail.org/message/orzro22sqdj3wows?q=LocalLucene -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Re: [jira] Commented: (SOLR-934) Enable importing of mails into a solr index through DIH.
On Fri, Jan 2, 2009 at 6:24 PM, Ben Johnson ben.john...@jandpconsulting.co.uk wrote: Hi Paul Yes, I was thinking that emails for all users would be indexed into a single index, at least conceptually. I'm thinking of a corporate/organisational repository that any user could search for relevant information, be that email or some other kind of document (e.g. MS Office, OpenOffice, PDF, etc...). An example usage would be for government organisations in the United Kingdom that need to respond to Freedom of Information (FOI) requests and are therefore required by law to produce all information regarding a particular subject if requested (sensitive information excluded). I haven't looked into architectural options for the indexes - I don't know if it is possible/desirable to split indexes up and use some sort of federated search to produce results, but at least conceptually I was thinking of a single source for the indexing information. A very simple solution is to keep all users in single index and use 'fq' and limit the search to that user. Regards Ben -- From: Noble Paul നോബിള് नोब्ळ् noble.p...@gmail.com Sent: Friday, January 02, 2009 11:21 AM To: solr-dev@lucene.apache.org Subject: Re: [jira] Commented: (SOLR-934) Enable importing of mails into a solr index through DIH. On Fri, Jan 2, 2009 at 3:42 PM, Ben Johnson ben.john...@jandpconsulting.co.uk wrote: Thanks Paul and Preetam. A couple of further things: - How do you envisage this functionality being used? I can see indexing all emails for all users as part of a one-off system setup/migration process, but also as a core feature to ensure all emails received by a company/organisation are indexed (and stored). This could be done either by the end-user, who controls what should be indexed (i.e. certain work-related emails only) or directly from the mail server, where all emails would be indexed (including personal emails, which could later be deleted from the index if desired) to ensure no important emails get missed. Is this the sort of thing you had in mind? There is also the issue of not indexing/storing the same email from multiple users' mailboxes (haven't worked that one out yet, possibly via a hash). - Is the mailbox 'configuration' (entity tag) stored in data-config.xml on the Solr server? If so, this would seem to have quite a lot of Do you wish all users mails to be indexed into single index ? it is possible by passing on the username password as request parameters . administrative overhead - how do you manage a system with 5000+ users? How are the accounts/passwords maintained? Are the passwords stored in plain text? - Minor typo: *conectTimeout* should be *connectTimeout* - A few real-world scenarios I've encountered are: - be able to handle an email sent to over 5000 recipients (in the 'To:' field) - be able to handle an email with a 'long' subject line (240+ characters) - be able to handle an email with 100 attachments - be able to handle an email with attachments with 'long' names (240+ characters) This caused several problems in the software I was using at the time (a proprietary system, not Solr-based), either memory-related issues or file system errors when running on Windows where the file system or its API limited file names to 255 characters, including the path. Thanks very much! Ben -- From: Noble Paul നോബിള് नोब्ळ् noble.p...@gmail.com Sent: Friday, January 02, 2009 5:02 AM To: solr-dev@lucene.apache.org Subject: Re: [jira] Commented: (SOLR-934) Enable importing of mails into a solr index through DIH. Hi Ben, You can take a look at the wiki page for DIH http://wiki.apache.org/solr/DataImportHandler It helps you index mostly structured data into Solr from db, xml etc . It can be considered as an ETL tool (http://en.wikipedia.org/wiki/Extract,_transform,_load ) for Solr. Adding mail support means you can index your emails into Sols with a few lines of configuration --Noble On Fri, Jan 2, 2009 at 4:30 AM, Ben Johnson ben.john...@jandpconsulting.co.uk wrote: I'm watching this issue with interest, but I'm having trouble understanding the bigger picture. I am prototyping a system that uses Restlet to store and index objects (mainly MS Office and OpenOffice documents and emails), so I am planning to use Solr with Tika to index the objects. I know nothing about DIH (Distributed Index Handler?), so I'm not sure what role it plays with Solr. Is it a vendor-specific technology (from Autonomy)? What does it do? Do you give it objects to index and it handles them by passing it to one or more Solr/Tika indexing servers? And are you thinking that this would therefore be a good place to not only index the objects, but also pass the information about the digital content to DROID? Reading a bit about DROID (from TNA,