[jira] Commented: (CONNECTORS-116) Possibly remove memex connector depending upon legal resolution
[ https://issues.apache.org/jira/browse/CONNECTORS-116?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12920568#action_12920568 ] Jack Krupansky commented on CONNECTORS-116: --- It would be nice to see a comment about what would be required to add Memex support back. I note the following statement in the original incubation submission: It is unlikely that EMC, OpenText, Memex, or IBM would grant Apache-license-compatible use of these client libraries. Thus, the expectation is that users of these connectors obtain the necessary client libraries from the owners prior to building or using the corresponding connector. An alternative would be to undertake a clean-room implementation of the client API's, which may well yield suitable results in some cases (LiveLink, Memex, FileNet), while being out of reach in others (Documentum). Conditional compilation, for the short term, is thus likely to be a necessity. Is it only the Memex connector that now has this problem? Do we need do a clean-room implementation for Memex? For any of the others? FWIW, I don't see a Google Connector for Memex. Possibly remove memex connector depending upon legal resolution --- Key: CONNECTORS-116 URL: https://issues.apache.org/jira/browse/CONNECTORS-116 Project: ManifoldCF Issue Type: Task Components: Memex connector Reporter: Robert Muir Assignee: Robert Muir Apparently there is an IP problem with the memex connector code. Depending upon what apache legal says, we will take any action under this issue publicly. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (CONNECTORS-118) Crawled archive files should be expanded into their constituent files
Crawled archive files should be expanded into their constituent files - Key: CONNECTORS-118 URL: https://issues.apache.org/jira/browse/CONNECTORS-118 Project: ManifoldCF Issue Type: New Feature Components: Framework crawler agent Reporter: Jack Krupansky Archive files such as zip, mbox, tar, etc. should be expanded into their constituent files during crawling of repositories so that any output connector would output the flattened archive. This could be an option, defaulted to ON, since someone may want to implement a copy connector that maintains crawled files as-is. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (CONNECTORS-118) Crawled archive files should be expanded into their constituent files
[ https://issues.apache.org/jira/browse/CONNECTORS-118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12920730#action_12920730 ] Jack Krupansky commented on CONNECTORS-118: --- Support within the file system connector is obviously the higher priority. Windows shares as well. And FTP/SFTP. Crawled archive files should be expanded into their constituent files - Key: CONNECTORS-118 URL: https://issues.apache.org/jira/browse/CONNECTORS-118 Project: ManifoldCF Issue Type: New Feature Components: Framework crawler agent Reporter: Jack Krupansky Archive files such as zip, mbox, tar, etc. should be expanded into their constituent files during crawling of repositories so that any output connector would output the flattened archive. This could be an option, defaulted to ON, since someone may want to implement a copy connector that maintains crawled files as-is. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (CONNECTORS-118) Crawled archive files should be expanded into their constituent files
[ https://issues.apache.org/jira/browse/CONNECTORS-118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12920787#action_12920787 ] Jack Krupansky commented on CONNECTORS-118: --- Aperture's approach was just a starting point for discussion for how to form an id for a file in an archive file. As long as the MCF rules are functionally equivalent to the Apache VFS rules, we should be okay. In short, my proposal does not have a requirement for what an id should look like, just a suggestion. Crawled archive files should be expanded into their constituent files - Key: CONNECTORS-118 URL: https://issues.apache.org/jira/browse/CONNECTORS-118 Project: ManifoldCF Issue Type: New Feature Components: Framework crawler agent Reporter: Jack Krupansky Archive files such as zip, mbox, tar, etc. should be expanded into their constituent files during crawling of repositories so that any output connector would output the flattened archive. This could be an option, defaulted to ON, since someone may want to implement a copy connector that maintains crawled files as-is. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (CONNECTORS-118) Crawled archive files should be expanded into their constituent files
[ https://issues.apache.org/jira/browse/CONNECTORS-118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12920801#action_12920801 ] Jack Krupansky commented on CONNECTORS-118: --- One of those VFS links points to all the Java packages used to access the list of archive formats I listed. I have personally written unit tests that generated most of those formats which Aperture then extracted. Crawled archive files should be expanded into their constituent files - Key: CONNECTORS-118 URL: https://issues.apache.org/jira/browse/CONNECTORS-118 Project: ManifoldCF Issue Type: New Feature Components: Framework crawler agent Reporter: Jack Krupansky Archive files such as zip, mbox, tar, etc. should be expanded into their constituent files during crawling of repositories so that any output connector would output the flattened archive. This could be an option, defaulted to ON, since someone may want to implement a copy connector that maintains crawled files as-is. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Issue Comment Edited: (CONNECTORS-118) Crawled archive files should be expanded into their constituent files
[ https://issues.apache.org/jira/browse/CONNECTORS-118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12920801#action_12920801 ] Jack Krupansky edited comment on CONNECTORS-118 at 10/13/10 7:35 PM: - I have personally written unit tests that generated most of those formats which Aperture then extracted. See: http://sourceforge.net/apps/trac/aperture/wiki/SubCrawlers org.apache.tools.bzip2 - BZIP2 archives. java.util.zip.GZIPInputStream - GZIP archives. javax.mail - message/rfc822-style messages and mbox files. org.apache.tools.tar - tar archives. was (Author: jkrupan): One of those VFS links points to all the Java packages used to access the list of archive formats I listed. I have personally written unit tests that generated most of those formats which Aperture then extracted. Crawled archive files should be expanded into their constituent files - Key: CONNECTORS-118 URL: https://issues.apache.org/jira/browse/CONNECTORS-118 Project: ManifoldCF Issue Type: New Feature Components: Framework crawler agent Reporter: Jack Krupansky Archive files such as zip, mbox, tar, etc. should be expanded into their constituent files during crawling of repositories so that any output connector would output the flattened archive. This could be an option, defaulted to ON, since someone may want to implement a copy connector that maintains crawled files as-is. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (CONNECTORS-98) API should be pure RESTful with the API verb represented using the HTTP GET/PUT/POST/DELETE methods
[ https://issues.apache.org/jira/browse/CONNECTORS-98?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12909036#action_12909036 ] Jack Krupansky commented on CONNECTORS-98: -- Looks good. This meets meets my expectations. Any further tweaks that might arise would be distinct Jira issues. API should be pure RESTful with the API verb represented using the HTTP GET/PUT/POST/DELETE methods - Key: CONNECTORS-98 URL: https://issues.apache.org/jira/browse/CONNECTORS-98 Project: Apache Connectors Framework Issue Type: Improvement Components: API Affects Versions: LCF Release 0.5 Reporter: Jack Krupansky Fix For: LCF Release 0.5 (This was originally a comment on CONNECTORS-56 dated 7/16/2010.) It has come to my attention that the API would be more pure RESTful if the API verb was represented using the HTTP GET/PUT/POST/DELETE methods and the input argument identifier represented in the context path. So, GET outputconnection/get \{connection_name:_connection_name_\} would be GET outputconnections/connection_name and GET outputconnection/delete \{connection_name:_connection_name_\} would be DELETE outputconnections/connection_name and GET outputconnection/list would be GET outputconnections and PUT outputconnection/save \{outputconnection:_output_connection_object_\} would be PUT outputconnections/connection_name \{outputconnection:_output_connection_object_\} What we have today is certainly workable, but just not as pure as some might desire. It would be better to take care of this before the initial release so that we never have to answer the question of why it wasn't done as a proper RESTful API. BTW, I did check to verify that an HttpServlet running under Jetty can process the DELETE and PUT methods (using the doDelete and doPut method overrides.) Also, POST should be usable as an alternative to PUT for API calls that have large volumes of data. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (CONNECTORS-98) API should be pure RESTful with the API verb represented using the HTTP GET/PUT/POST/DELETE methods
[ https://issues.apache.org/jira/browse/CONNECTORS-98?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12908581#action_12908581 ] Jack Krupansky commented on CONNECTORS-98: -- Just to confirm, as requested, that I am comfortable sticking with connection name (and job name, etc.) in API paths as opposed to using a more abstract id since we seem to have an encoding convention to deal with slash so that an ACF object name can always be represented using a single HTTP path segment. Names clearly feel more natural and will be easier to use, both for app code using the ACF API and for CURL and other scripting tools. API should be pure RESTful with the API verb represented using the HTTP GET/PUT/POST/DELETE methods - Key: CONNECTORS-98 URL: https://issues.apache.org/jira/browse/CONNECTORS-98 Project: Apache Connectors Framework Issue Type: Improvement Components: API Affects Versions: LCF Release 0.5 Reporter: Jack Krupansky Fix For: LCF Release 0.5 (This was originally a comment on CONNECTORS-56 dated 7/16/2010.) It has come to my attention that the API would be more pure RESTful if the API verb was represented using the HTTP GET/PUT/POST/DELETE methods and the input argument identifier represented in the context path. So, GET outputconnection/get \{connection_name:_connection_name_\} would be GET outputconnections/connection_name and GET outputconnection/delete \{connection_name:_connection_name_\} would be DELETE outputconnections/connection_name and GET outputconnection/list would be GET outputconnections and PUT outputconnection/save \{outputconnection:_output_connection_object_\} would be PUT outputconnections/connection_name \{outputconnection:_output_connection_object_\} What we have today is certainly workable, but just not as pure as some might desire. It would be better to take care of this before the initial release so that we never have to answer the question of why it wasn't done as a proper RESTful API. BTW, I did check to verify that an HttpServlet running under Jetty can process the DELETE and PUT methods (using the doDelete and doPut method overrides.) Also, POST should be usable as an alternative to PUT for API calls that have large volumes of data. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (CONNECTORS-98) API should be pure RESTful with the API verb represented using the HTTP GET/PUT/POST/DELETE methods
[ https://issues.apache.org/jira/browse/CONNECTORS-98?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12908148#action_12908148 ] Jack Krupansky commented on CONNECTORS-98: -- I am still pondering this embedded slash issue and checking into some things related to it. Maybe Monday I'll have something more concrete to say. For example, I want to make sure I understand the rules for what a path can have in it in a URI and whether simply placing a name at the tail of the path means it can have slashes or other reserved characters in it. My model is that a name should occupy only a single path component. API should be pure RESTful with the API verb represented using the HTTP GET/PUT/POST/DELETE methods - Key: CONNECTORS-98 URL: https://issues.apache.org/jira/browse/CONNECTORS-98 Project: Apache Connectors Framework Issue Type: Improvement Components: API Affects Versions: LCF Release 0.5 Reporter: Jack Krupansky Fix For: LCF Release 0.5 (This was originally a comment on CONNECTORS-56 dated 7/16/2010.) It has come to my attention that the API would be more pure RESTful if the API verb was represented using the HTTP GET/PUT/POST/DELETE methods and the input argument identifier represented in the context path. So, GET outputconnection/get \{connection_name:_connection_name_\} would be GET outputconnections/connection_name and GET outputconnection/delete \{connection_name:_connection_name_\} would be DELETE outputconnections/connection_name and GET outputconnection/list would be GET outputconnections and PUT outputconnection/save \{outputconnection:_output_connection_object_\} would be PUT outputconnections/connection_name \{outputconnection:_output_connection_object_\} What we have today is certainly workable, but just not as pure as some might desire. It would be better to take care of this before the initial release so that we never have to answer the question of why it wasn't done as a proper RESTful API. BTW, I did check to verify that an HttpServlet running under Jetty can process the DELETE and PUT methods (using the doDelete and doPut method overrides.) Also, POST should be usable as an alternative to PUT for API calls that have large volumes of data. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (CONNECTORS-98) API should be pure RESTful with the API verb represented using the HTTP GET/PUT/POST/DELETE methods
[ https://issues.apache.org/jira/browse/CONNECTORS-98?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12907614#action_12907614 ] Jack Krupansky commented on CONNECTORS-98: -- I have looked at the code a bit but not made any actual progress at a patch, so you can go ahead and take a crack at it. Yes, I'll do the transformation table. As far as updating the wiki, do I have privileges to do that? API should be pure RESTful with the API verb represented using the HTTP GET/PUT/POST/DELETE methods - Key: CONNECTORS-98 URL: https://issues.apache.org/jira/browse/CONNECTORS-98 Project: Apache Connectors Framework Issue Type: Improvement Components: API Affects Versions: LCF Release 0.5 Reporter: Jack Krupansky Fix For: LCF Release 0.5 (This was originally a comment on CONNECTORS-56 dated 7/16/2010.) It has come to my attention that the API would be more pure RESTful if the API verb was represented using the HTTP GET/PUT/POST/DELETE methods and the input argument identifier represented in the context path. So, GET outputconnection/get \{connection_name:_connection_name_\} would be GET outputconnections/connection_name and GET outputconnection/delete \{connection_name:_connection_name_\} would be DELETE outputconnections/connection_name and GET outputconnection/list would be GET outputconnections and PUT outputconnection/save \{outputconnection:_output_connection_object_\} would be PUT outputconnections/connection_name \{outputconnection:_output_connection_object_\} What we have today is certainly workable, but just not as pure as some might desire. It would be better to take care of this before the initial release so that we never have to answer the question of why it wasn't done as a proper RESTful API. BTW, I did check to verify that an HttpServlet running under Jetty can process the DELETE and PUT methods (using the doDelete and doPut method overrides.) Also, POST should be usable as an alternative to PUT for API calls that have large volumes of data. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (CONNECTORS-98) API should be pure RESTful with the API verb represented using the HTTP GET/PUT/POST/DELETE methods
[ https://issues.apache.org/jira/browse/CONNECTORS-98?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12907712#action_12907712 ] Jack Krupansky commented on CONNECTORS-98: -- Some RESTful resource doc: http://en.wikipedia.org/wiki/Representational_State_Transfer http://www.xfront.com/REST-Web-Services.html http://www.oracle.com/technetwork/articles/javase/table3-138001.html The idea of using a plural is that it is the name of the collection and the qualifier (name or argument object) provides the specificity. API should be pure RESTful with the API verb represented using the HTTP GET/PUT/POST/DELETE methods - Key: CONNECTORS-98 URL: https://issues.apache.org/jira/browse/CONNECTORS-98 Project: Apache Connectors Framework Issue Type: Improvement Components: API Affects Versions: LCF Release 0.5 Reporter: Jack Krupansky Fix For: LCF Release 0.5 (This was originally a comment on CONNECTORS-56 dated 7/16/2010.) It has come to my attention that the API would be more pure RESTful if the API verb was represented using the HTTP GET/PUT/POST/DELETE methods and the input argument identifier represented in the context path. So, GET outputconnection/get \{connection_name:_connection_name_\} would be GET outputconnections/connection_name and GET outputconnection/delete \{connection_name:_connection_name_\} would be DELETE outputconnections/connection_name and GET outputconnection/list would be GET outputconnections and PUT outputconnection/save \{outputconnection:_output_connection_object_\} would be PUT outputconnections/connection_name \{outputconnection:_output_connection_object_\} What we have today is certainly workable, but just not as pure as some might desire. It would be better to take care of this before the initial release so that we never have to answer the question of why it wasn't done as a proper RESTful API. BTW, I did check to verify that an HttpServlet running under Jetty can process the DELETE and PUT methods (using the doDelete and doPut method overrides.) Also, POST should be usable as an alternative to PUT for API calls that have large volumes of data. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (CONNECTORS-98) API should be pure RESTful with the API verb represented using the HTTP GET/PUT/POST/DELETE methods
[ https://issues.apache.org/jira/browse/CONNECTORS-98?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12907735#action_12907735 ] Jack Krupansky commented on CONNECTORS-98: -- I think status is probably technically okay since it is disambiguated by number path elements, but it could be moved to the end: GET outputconnections/connection_name/status () vs. GET outputconnections/status/connection_name () Same for execute/request: GET outputconnections/connection_name/request/command (arguments) vs. GET outputconnections/request/connection_name/command (arguments) That way the connection name is always in the same position. So, I'd revise my counter-proposal that way. API should be pure RESTful with the API verb represented using the HTTP GET/PUT/POST/DELETE methods - Key: CONNECTORS-98 URL: https://issues.apache.org/jira/browse/CONNECTORS-98 Project: Apache Connectors Framework Issue Type: Improvement Components: API Affects Versions: LCF Release 0.5 Reporter: Jack Krupansky Fix For: LCF Release 0.5 (This was originally a comment on CONNECTORS-56 dated 7/16/2010.) It has come to my attention that the API would be more pure RESTful if the API verb was represented using the HTTP GET/PUT/POST/DELETE methods and the input argument identifier represented in the context path. So, GET outputconnection/get \{connection_name:_connection_name_\} would be GET outputconnections/connection_name and GET outputconnection/delete \{connection_name:_connection_name_\} would be DELETE outputconnections/connection_name and GET outputconnection/list would be GET outputconnections and PUT outputconnection/save \{outputconnection:_output_connection_object_\} would be PUT outputconnections/connection_name \{outputconnection:_output_connection_object_\} What we have today is certainly workable, but just not as pure as some might desire. It would be better to take care of this before the initial release so that we never have to answer the question of why it wasn't done as a proper RESTful API. BTW, I did check to verify that an HttpServlet running under Jetty can process the DELETE and PUT methods (using the doDelete and doPut method overrides.) Also, POST should be usable as an alternative to PUT for API calls that have large volumes of data. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (CONNECTORS-98) API should be pure RESTful with the API verb represented using the HTTP GET/PUT/POST/DELETE methods
[ https://issues.apache.org/jira/browse/CONNECTORS-98?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12907736#action_12907736 ] Jack Krupansky commented on CONNECTORS-98: -- re: We could not pass (arguments) except as part of the path. Sure, we could go that route, and list the arguments as path elements, but I think a JSON object (array list of arguments) is acceptable. So, I'd go with the latter (JSON.) API should be pure RESTful with the API verb represented using the HTTP GET/PUT/POST/DELETE methods - Key: CONNECTORS-98 URL: https://issues.apache.org/jira/browse/CONNECTORS-98 Project: Apache Connectors Framework Issue Type: Improvement Components: API Affects Versions: LCF Release 0.5 Reporter: Jack Krupansky Fix For: LCF Release 0.5 (This was originally a comment on CONNECTORS-56 dated 7/16/2010.) It has come to my attention that the API would be more pure RESTful if the API verb was represented using the HTTP GET/PUT/POST/DELETE methods and the input argument identifier represented in the context path. So, GET outputconnection/get \{connection_name:_connection_name_\} would be GET outputconnections/connection_name and GET outputconnection/delete \{connection_name:_connection_name_\} would be DELETE outputconnections/connection_name and GET outputconnection/list would be GET outputconnections and PUT outputconnection/save \{outputconnection:_output_connection_object_\} would be PUT outputconnections/connection_name \{outputconnection:_output_connection_object_\} What we have today is certainly workable, but just not as pure as some might desire. It would be better to take care of this before the initial release so that we never have to answer the question of why it wasn't done as a proper RESTful API. BTW, I did check to verify that an HttpServlet running under Jetty can process the DELETE and PUT methods (using the doDelete and doPut method overrides.) Also, POST should be usable as an alternative to PUT for API calls that have large volumes of data. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (CONNECTORS-98) API should be pure RESTful with the API verb represented using the HTTP GET/PUT/POST/DELETE methods
[ https://issues.apache.org/jira/browse/CONNECTORS-98?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12907758#action_12907758 ] Jack Krupansky commented on CONNECTORS-98: -- re: the command cannot itself contain / characters, or it won't be uniquely parseable Elsewhere I noted that URI-reserved characters need to be encoded with the % notation, so this is not a fatal problem. reserved= ; | / | ? | : | @ | | = | + | $ | , API should be pure RESTful with the API verb represented using the HTTP GET/PUT/POST/DELETE methods - Key: CONNECTORS-98 URL: https://issues.apache.org/jira/browse/CONNECTORS-98 Project: Apache Connectors Framework Issue Type: Improvement Components: API Affects Versions: LCF Release 0.5 Reporter: Jack Krupansky Fix For: LCF Release 0.5 (This was originally a comment on CONNECTORS-56 dated 7/16/2010.) It has come to my attention that the API would be more pure RESTful if the API verb was represented using the HTTP GET/PUT/POST/DELETE methods and the input argument identifier represented in the context path. So, GET outputconnection/get \{connection_name:_connection_name_\} would be GET outputconnections/connection_name and GET outputconnection/delete \{connection_name:_connection_name_\} would be DELETE outputconnections/connection_name and GET outputconnection/list would be GET outputconnections and PUT outputconnection/save \{outputconnection:_output_connection_object_\} would be PUT outputconnections/connection_name \{outputconnection:_output_connection_object_\} What we have today is certainly workable, but just not as pure as some might desire. It would be better to take care of this before the initial release so that we never have to answer the question of why it wasn't done as a proper RESTful API. BTW, I did check to verify that an HttpServlet running under Jetty can process the DELETE and PUT methods (using the doDelete and doPut method overrides.) Also, POST should be usable as an alternative to PUT for API calls that have large volumes of data. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (CONNECTORS-98) API should be pure RESTful with the API verb represented using the HTTP GET/PUT/POST/DELETE methods
[ https://issues.apache.org/jira/browse/CONNECTORS-98?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12907875#action_12907875 ] Jack Krupansky commented on CONNECTORS-98: -- It makes sense that GetPathInfo would have removed escapes from the URL. So, either we don't use % escaping or bypass GetPathInfo and manually decode. Maybe we could use backslash for escaping. I'm not sure whether it needs to be % escaped as well. This is only needed if the user has one of the reserved special characters in a name. It would be an issue if it was something that users commonly needed, but it seems like more of an edge case rather than a common case. Encourage people to use alphanumeric, -, and _ for names and it won't be an issue for them. And, the real point of the API is access from code. We can provide helper functions for working with names and building API paths. API should be pure RESTful with the API verb represented using the HTTP GET/PUT/POST/DELETE methods - Key: CONNECTORS-98 URL: https://issues.apache.org/jira/browse/CONNECTORS-98 Project: Apache Connectors Framework Issue Type: Improvement Components: API Affects Versions: LCF Release 0.5 Reporter: Jack Krupansky Fix For: LCF Release 0.5 (This was originally a comment on CONNECTORS-56 dated 7/16/2010.) It has come to my attention that the API would be more pure RESTful if the API verb was represented using the HTTP GET/PUT/POST/DELETE methods and the input argument identifier represented in the context path. So, GET outputconnection/get \{connection_name:_connection_name_\} would be GET outputconnections/connection_name and GET outputconnection/delete \{connection_name:_connection_name_\} would be DELETE outputconnections/connection_name and GET outputconnection/list would be GET outputconnections and PUT outputconnection/save \{outputconnection:_output_connection_object_\} would be PUT outputconnections/connection_name \{outputconnection:_output_connection_object_\} What we have today is certainly workable, but just not as pure as some might desire. It would be better to take care of this before the initial release so that we never have to answer the question of why it wasn't done as a proper RESTful API. BTW, I did check to verify that an HttpServlet running under Jetty can process the DELETE and PUT methods (using the doDelete and doPut method overrides.) Also, POST should be usable as an alternative to PUT for API calls that have large volumes of data. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (CONNECTORS-104) Make it easier to limit a web crawl to a single site
[ https://issues.apache.org/jira/browse/CONNECTORS-104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12907201#action_12907201 ] Jack Krupansky commented on CONNECTORS-104: --- Simple works best. This enhancement is primarily for the simple use case where a novice user tries to do what they think is obvious (crawl the web pages at this URL), but without considering all of the potential nuances or how to fully specify the details of their goal. One nuance is whether subdomains are considered part of the domain. I would say no if a subdomain was specified by the user and yes if no subdomain was specified. Another nuance is whether a path is specified to select a subset of a domain. It would be nice to handle that and (optionally) limit the crawl to that path (or sub-paths below it). An example would be to crawl the news archive for a site. Make it easier to limit a web crawl to a single site Key: CONNECTORS-104 URL: https://issues.apache.org/jira/browse/CONNECTORS-104 Project: Apache Connectors Framework Issue Type: Improvement Components: Web connector Reporter: Jack Krupansky Priority: Minor Unless the user explicitly enters an include regex carefully, a web crawl can quickly get out of control and start crawling the entire web when all the user may really want is to crawl just a single web site or portion thereof. So, it would be preferable if either by default or with a simple button the crawl could be limited to the seed web site(s). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (CONNECTORS-104) Make it easier to limit a web crawl to a single site
Make it easier to limit a web crawl to a single site Key: CONNECTORS-104 URL: https://issues.apache.org/jira/browse/CONNECTORS-104 Project: Apache Connectors Framework Issue Type: Improvement Components: Web connector Affects Versions: LCF Release 0.5 Reporter: Jack Krupansky Priority: Minor Fix For: LCF Release 0.5 Unless the user explicitly enters an include regex carefully, a web crawl can quickly get out of control and start crawling the entire web when all the user may really want is to crawl just a single web site or portion thereof. So, it would be preferable if either by default or with a simple button the crawl could be limited to the seed web site(s). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (CONNECTORS-57) Solr output connector option to commit at end of job, by default
[ https://issues.apache.org/jira/browse/CONNECTORS-57?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12904746#action_12904746 ] Jack Krupansky commented on CONNECTORS-57: -- This looks fine so far and should work for me. If I understand the code, the Connector.noteJobComplete method is called when the job completes or is aborted and the SolrConnector.noteJobComplete implementation method unconditionally does a commit. That's fine my my use case, but we probably still want a connection option to disable that commit if the user has some other commit strategy in mind. Solr output connector option to commit at end of job, by default Key: CONNECTORS-57 URL: https://issues.apache.org/jira/browse/CONNECTORS-57 Project: Apache Connectors Framework Issue Type: Sub-task Components: Lucene/SOLR connector Reporter: Jack Krupansky By default, Solr will eventually commit documents that have been submitted to the Solr Cell interface, but the time lag can confuse and annoy people. Although commit strategy is a difficult issue in general, an option in LCF to automatically commit at the end of a job, by default, would eliminate a lot of potential confusion and generally be close to what the user needs. The desired feature is that there be an option to commit for each job that uses the Solr output connector. This option would default to on (or a different setting based on some global configuration setting), but the user may turn it off if commit is only desired upon completion of some jobs. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (CONNECTORS-98) API should be pure RESTful with the API verb represented using the HTTP GET/PUT/POST/DELETE methods
[ https://issues.apache.org/jira/browse/CONNECTORS-98?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12903559#action_12903559 ] Jack Krupansky commented on CONNECTORS-98: -- I'll be mostly looking through code and thinking it through and looking at the API string changes first, so I may not touch any code for another week, if not longer. Feel free to rename or refactor code at will. I'll probably let you know in advance of what changes I expect to make in the code. API should be pure RESTful with the API verb represented using the HTTP GET/PUT/POST/DELETE methods - Key: CONNECTORS-98 URL: https://issues.apache.org/jira/browse/CONNECTORS-98 Project: Apache Connectors Framework Issue Type: Improvement Components: API Affects Versions: LCF Release 0.5 Reporter: Jack Krupansky Fix For: LCF Release 0.5 (This was originally a comment on CONNECTORS-56 dated 7/16/2010.) It has come to my attention that the API would be more pure RESTful if the API verb was represented using the HTTP GET/PUT/POST/DELETE methods and the input argument identifier represented in the context path. So, GET outputconnection/get \{connection_name:_connection_name_\} would be GET outputconnections/connection_name and GET outputconnection/delete \{connection_name:_connection_name_\} would be DELETE outputconnections/connection_name and GET outputconnection/list would be GET outputconnections and PUT outputconnection/save \{outputconnection:_output_connection_object_\} would be PUT outputconnections/connection_name \{outputconnection:_output_connection_object_\} What we have today is certainly workable, but just not as pure as some might desire. It would be better to take care of this before the initial release so that we never have to answer the question of why it wasn't done as a proper RESTful API. BTW, I did check to verify that an HttpServlet running under Jetty can process the DELETE and PUT methods (using the doDelete and doPut method overrides.) Also, POST should be usable as an alternative to PUT for API calls that have large volumes of data. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (CONNECTORS-98) API should be pure RESTful with the API verb represented using the HTTP GET/PUT/POST/DELETE methods
API should be pure RESTful with the API verb represented using the HTTP GET/PUT/POST/DELETE methods - Key: CONNECTORS-98 URL: https://issues.apache.org/jira/browse/CONNECTORS-98 Project: Apache Connectors Framework Issue Type: Improvement Components: API Affects Versions: LCF Release 0.5 Reporter: Jack Krupansky Fix For: LCF Release 0.5 (This was originally a comment on CONNECTORS-56 dated 7/16/2010.) It has come to my attention that the API would be more pure RESTful if the API verb was represented using the HTTP GET/PUT/POST/DELETE methods and the input argument identifier represented in the context path. So, GET outputconnection/get \{connection_name:_connection_name_\} would be GET outputconnections/connection_name and GET outputconnection/delete \{connection_name:_connection_name_\} would be DELETE outputconnections/connection_name and GET outputconnection/list would be GET outputconnections and PUT outputconnection/save \{outputconnection:_output_connection_object_\} would be PUT outputconnections/connection_name \{outputconnection:_output_connection_object_\} What we have today is certainly workable, but just not as pure as some might desire. It would be better to take care of this before the initial release so that we never have to answer the question of why it wasn't done as a proper RESTful API. BTW, I did check to verify that an HttpServlet running under Jetty can process the DELETE and PUT methods (using the doDelete and doPut method overrides.) Also, POST should be usable as an alternative to PUT for API calls that have large volumes of data. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (CONNECTORS-98) API should be pure RESTful with the API verb represented using the HTTP GET/PUT/POST/DELETE methods
[ https://issues.apache.org/jira/browse/CONNECTORS-98?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12902982#action_12902982 ] Jack Krupansky commented on CONNECTORS-98: -- Karl asks what do you plan to do for the list and execute verbs? List would be a GET and execute would be PUT. API should be pure RESTful with the API verb represented using the HTTP GET/PUT/POST/DELETE methods - Key: CONNECTORS-98 URL: https://issues.apache.org/jira/browse/CONNECTORS-98 Project: Apache Connectors Framework Issue Type: Improvement Components: API Affects Versions: LCF Release 0.5 Reporter: Jack Krupansky Fix For: LCF Release 0.5 (This was originally a comment on CONNECTORS-56 dated 7/16/2010.) It has come to my attention that the API would be more pure RESTful if the API verb was represented using the HTTP GET/PUT/POST/DELETE methods and the input argument identifier represented in the context path. So, GET outputconnection/get \{connection_name:_connection_name_\} would be GET outputconnections/connection_name and GET outputconnection/delete \{connection_name:_connection_name_\} would be DELETE outputconnections/connection_name and GET outputconnection/list would be GET outputconnections and PUT outputconnection/save \{outputconnection:_output_connection_object_\} would be PUT outputconnections/connection_name \{outputconnection:_output_connection_object_\} What we have today is certainly workable, but just not as pure as some might desire. It would be better to take care of this before the initial release so that we never have to answer the question of why it wasn't done as a proper RESTful API. BTW, I did check to verify that an HttpServlet running under Jetty can process the DELETE and PUT methods (using the doDelete and doPut method overrides.) Also, POST should be usable as an alternative to PUT for API calls that have large volumes of data. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (CONNECTORS-98) API should be pure RESTful with the API verb represented using the HTTP GET/PUT/POST/DELETE methods
[ https://issues.apache.org/jira/browse/CONNECTORS-98?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12902983#action_12902983 ] Jack Krupansky commented on CONNECTORS-98: -- Karl says I await your patch. Point well made. There is a great starting point with the current code. A bit of refactoring required. API should be pure RESTful with the API verb represented using the HTTP GET/PUT/POST/DELETE methods - Key: CONNECTORS-98 URL: https://issues.apache.org/jira/browse/CONNECTORS-98 Project: Apache Connectors Framework Issue Type: Improvement Components: API Affects Versions: LCF Release 0.5 Reporter: Jack Krupansky Fix For: LCF Release 0.5 (This was originally a comment on CONNECTORS-56 dated 7/16/2010.) It has come to my attention that the API would be more pure RESTful if the API verb was represented using the HTTP GET/PUT/POST/DELETE methods and the input argument identifier represented in the context path. So, GET outputconnection/get \{connection_name:_connection_name_\} would be GET outputconnections/connection_name and GET outputconnection/delete \{connection_name:_connection_name_\} would be DELETE outputconnections/connection_name and GET outputconnection/list would be GET outputconnections and PUT outputconnection/save \{outputconnection:_output_connection_object_\} would be PUT outputconnections/connection_name \{outputconnection:_output_connection_object_\} What we have today is certainly workable, but just not as pure as some might desire. It would be better to take care of this before the initial release so that we never have to answer the question of why it wasn't done as a proper RESTful API. BTW, I did check to verify that an HttpServlet running under Jetty can process the DELETE and PUT methods (using the doDelete and doPut method overrides.) Also, POST should be usable as an alternative to PUT for API calls that have large volumes of data. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (CONNECTORS-56) All features should be accessible through an API
[ https://issues.apache.org/jira/browse/CONNECTORS-56?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12888377#action_12888377 ] Jack Krupansky commented on CONNECTORS-56: -- Some cURL and/or Perl test scripts to illustrate use of the API would be helpful. All features should be accessible through an API Key: CONNECTORS-56 URL: https://issues.apache.org/jira/browse/CONNECTORS-56 Project: Lucene Connector Framework Issue Type: Sub-task Components: Framework core Reporter: Jack Krupansky LCF consists of a full-featured crawling engine and a full-featured user interface to access the features of that engine, but some applications are better served with a full API that lets the application control the crawling engine, including creation and editing of connections and creation, editing, and control of jobs. Put simply, everything that a user can accomplish via the LCF UI should be doable through an LCF API. All LCF objects should be queryable through the API. A primary use case is Solr applications which currently use Aperture for crawling, but would prefer the full-featured capabilities of LCF as a crawling engine over Aperture. I do not wish to over-specify the API in this initial description, but I think the LCF API should probably be a traditional REST API., with some of the API elements specified via the context path, some parameters via URL query parameters, and complex, detailed structures as JSON (or similar.). The precise details of the API are beyond the scope of this initial description and will be added incrementally once the high-level approach to the API becomes reasonably settled. A job status and event reporting scheme is also needed in conjunction with the LCF API. That requirement has already been captured as CONNECTORS-41. The intention for the API is to create, edit, access, and control all of the objects managed by LCF. The main focus is on repositories, jobs, and status, and less about document-specific crawling information, but there may be some benefit to querying crawling status for individual documents as well. Nothing in this proposal should in any way limit or constrain the features that will be available in the LCF UI. The intent is that LCF should continue to have a full-featured UI, but in addition to a full-featured API. Note: This issue is part of Phase 2 of the CONNECTORS-50 umbrella issue. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (CONNECTORS-60) Agent process should be started automatically
[ https://issues.apache.org/jira/browse/CONNECTORS-60?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12888000#action_12888000 ] Jack Krupansky commented on CONNECTORS-60: -- Unless I am mistaken, the jetty integration is for QuickStart (single process) only. The issue is for non-QuickStart, multi-process execution. Agent process should be started automatically - Key: CONNECTORS-60 URL: https://issues.apache.org/jira/browse/CONNECTORS-60 Project: Lucene Connector Framework Issue Type: Sub-task Reporter: Jack Krupansky LCF as it exists today is a bit too complex to run for an average user, especially with a separate agent process for crawling. LCF should be as easy to run as Solr is today. QuickStart is a good move in this direction, but the same user-visible simplicity is needed for full LCF. The separate agent process is a reasonable design for execution, but a little too cumbersome for the average user to manage. Unfortunately, it is expected that starting up a multi-process application will require platform-specific scripting. Note: This issue is part of Phase 1 of the CONNECTORS-50 umbrella issue. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (CONNECTORS-55) Bundle database server with LCF packaged product
[ https://issues.apache.org/jira/browse/CONNECTORS-55?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12886720#action_12886720 ] Jack Krupansky commented on CONNECTORS-55: -- Karl notes that we've had to mess with the stuffer query on pretty near every point release of Postgresql. Letting/forcing the user to pick the right/acceptable release of PostgreSQL to install is error prone and a support headache. I would argue that it is better for the LCF team to bundle the right/best release of PostgreSQL with LCF. Bundle database server with LCF packaged product Key: CONNECTORS-55 URL: https://issues.apache.org/jira/browse/CONNECTORS-55 Project: Lucene Connector Framework Issue Type: Improvement Components: Framework core Reporter: Jack Krupansky The current requirement that the user install and deploy a PostgreSQL server complicates the installation and deployment of LCF for the user. Installation and deployment of LCF should be as simple as Solr itself. QuickStart is great for the low-end and basic evaluation, but a comparable level of simplified installation and deployment is still needed for full-blown, high-end environments that need the full performance of a ProstgreSQL-class database server. So, PostgreSQL should be bundled with the packaged release of LCF so that installation and deployment of LCF will automatically install and deploy a subset of the full PostgreSQL distribution that is sufficient for the needs of LCF. Starting LCF, with or without the LCF UI, should automatically start the database server. Shutting down LCF should also shutdown the database server process. A typical use case would be for a non-developer who is comfortable with Solr and simply wants to crawl documents from, for example, a SharePoint repository and feed them into Solr. QuickStart should work well for the low end or in the early stages of evaluation, but the user would prefer to evaluate the real thing with something resembling a production crawl of thousands of documents. Such a user might not be a hard-core developer or be comfortable fiddling with a lot of software components simply to do one conceptually simple operation. It should still be possible for the user to supply database server settings to override the defaults, but the LCF package should have all of the best-practice settings deemed appropriate for use with LCF. One downside is that installation and deployment will be platform-specific since there are multiple processes and PostgreSQL itself requires a platform-specific installation. This proposal presumes that PostgreSQL is the best option for the foreseeable future, but nothing here is intended to preclude support for other database servers in futures releases. This proposal should not have any impact on QuickStart packaging or deployment. Note: This issue is part of Phase 1 of the CONNECTORS-50 umbrella issue. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (CONNECTORS-55) Bundle database server with LCF packaged product
[ https://issues.apache.org/jira/browse/CONNECTORS-55?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12886724#action_12886724 ] Jack Krupansky commented on CONNECTORS-55: -- When Karl says It *does* limit your ability to use other commands simultaneously (referring to use of embedded Derby), he is referring to commands executed using the executecommand shell script, such as registering and unregistering connectors, which is something typically done once before starting the UI or once every blue moon when you want to support a new type of repository, but not done on as regular a basis as editing connections and jobs and running jobs. The java classes to execute those commands would be, by definition, outside of the LCF process. Bundle database server with LCF packaged product Key: CONNECTORS-55 URL: https://issues.apache.org/jira/browse/CONNECTORS-55 Project: Lucene Connector Framework Issue Type: Improvement Components: Framework core Reporter: Jack Krupansky The current requirement that the user install and deploy a PostgreSQL server complicates the installation and deployment of LCF for the user. Installation and deployment of LCF should be as simple as Solr itself. QuickStart is great for the low-end and basic evaluation, but a comparable level of simplified installation and deployment is still needed for full-blown, high-end environments that need the full performance of a ProstgreSQL-class database server. So, PostgreSQL should be bundled with the packaged release of LCF so that installation and deployment of LCF will automatically install and deploy a subset of the full PostgreSQL distribution that is sufficient for the needs of LCF. Starting LCF, with or without the LCF UI, should automatically start the database server. Shutting down LCF should also shutdown the database server process. A typical use case would be for a non-developer who is comfortable with Solr and simply wants to crawl documents from, for example, a SharePoint repository and feed them into Solr. QuickStart should work well for the low end or in the early stages of evaluation, but the user would prefer to evaluate the real thing with something resembling a production crawl of thousands of documents. Such a user might not be a hard-core developer or be comfortable fiddling with a lot of software components simply to do one conceptually simple operation. It should still be possible for the user to supply database server settings to override the defaults, but the LCF package should have all of the best-practice settings deemed appropriate for use with LCF. One downside is that installation and deployment will be platform-specific since there are multiple processes and PostgreSQL itself requires a platform-specific installation. This proposal presumes that PostgreSQL is the best option for the foreseeable future, but nothing here is intended to preclude support for other database servers in futures releases. This proposal should not have any impact on QuickStart packaging or deployment. Note: This issue is part of Phase 1 of the CONNECTORS-50 umbrella issue. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (CONNECTORS-55) Bundle database server with LCF packaged product
[ https://issues.apache.org/jira/browse/CONNECTORS-55?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12886490#action_12886490 ] Jack Krupansky commented on CONNECTORS-55: -- I was using the term install loosely, not so much the way a typical package has a GUI wizard and lots of stuff going on, but more in the sense of raw Solr where you download, unzip, and files are in sub directories right where they need to be. In that sense, the theory is that a subset of PostgreSQL could be in a subdirectory. Some enterprising vendor, such as Lucid Imagination, might want to have a fancy GUI install, but that would be beyond the scope of what I intended here. Bundle database server with LCF packaged product Key: CONNECTORS-55 URL: https://issues.apache.org/jira/browse/CONNECTORS-55 Project: Lucene Connector Framework Issue Type: Improvement Components: Framework core Reporter: Jack Krupansky The current requirement that the user install and deploy a PostgreSQL server complicates the installation and deployment of LCF for the user. Installation and deployment of LCF should be as simple as Solr itself. QuickStart is great for the low-end and basic evaluation, but a comparable level of simplified installation and deployment is still needed for full-blown, high-end environments that need the full performance of a ProstgreSQL-class database server. So, PostgreSQL should be bundled with the packaged release of LCF so that installation and deployment of LCF will automatically install and deploy a subset of the full PostgreSQL distribution that is sufficient for the needs of LCF. Starting LCF, with or without the LCF UI, should automatically start the database server. Shutting down LCF should also shutdown the database server process. A typical use case would be for a non-developer who is comfortable with Solr and simply wants to crawl documents from, for example, a SharePoint repository and feed them into Solr. QuickStart should work well for the low end or in the early stages of evaluation, but the user would prefer to evaluate the real thing with something resembling a production crawl of thousands of documents. Such a user might not be a hard-core developer or be comfortable fiddling with a lot of software components simply to do one conceptually simple operation. It should still be possible for the user to supply database server settings to override the defaults, but the LCF package should have all of the best-practice settings deemed appropriate for use with LCF. One downside is that installation and deployment will be platform-specific since there are multiple processes and PostgreSQL itself requires a platform-specific installation. This proposal presumes that PostgreSQL is the best option for the foreseeable future, but nothing here is intended to preclude support for other database servers in futures releases. This proposal should not have any impact on QuickStart packaging or deployment. Note: This issue is part of Phase 1 of the CONNECTORS-50 umbrella issue. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (CONNECTORS-56) All features should be accessible through an API
All features should be accessible through an API Key: CONNECTORS-56 URL: https://issues.apache.org/jira/browse/CONNECTORS-56 Project: Lucene Connector Framework Issue Type: Improvement Components: Framework core Reporter: Jack Krupansky LCF consists of a full-featured crawling engine and a full-featured user interface to access the features of that engine, but some applications are better served with a full API that lets the application control the crawling engine, including creation and editing of connections and creation, editing, and control of jobs. Put simply, everything that a user can accomplish via the LCF UI should be doable through an LCF API. All LCF objects should be queryable through the API. A primary use case is Solr applications which currently use Aperture for crawling, but would prefer the full-featured capabilities of LCF as a crawling engine over Aperture. I do not wish to over-specify the API in this initial description, but I think the LCF API should probably be a traditional REST API., with some of the API elements specified via the context path, some parameters via URL query parameters, and complex, detailed structures as JSON (or similar.). The precise details of the API are beyond the scope of this initial description and will be added incrementally once the high-level approach to the API becomes reasonably settled. A job status and event reporting scheme is also needed in conjunction with the LCF API. That requirement has already been captured as CONNECTORS-41. The intention for the API is to create, edit, access, and control all of the objects managed by LCF. The main focus is on repositories, jobs, and status, and less about document-specific crawling information, but there may be some benefit to querying crawling status for individual documents as well. Nothing in this proposal should in any way limit or constrain the features that will be available in the LCF UI. The intent is that LCF should continue to have a full-featured UI, but in addition to a full-featured API. Note: This issue is part of Phase 2 of the CONNECTORS-50 umbrella issue. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (CONNECTORS-50) Proposal for initial two releases of LCF, including packaged product and full API
Proposal for initial two releases of LCF, including packaged product and full API - Key: CONNECTORS-50 URL: https://issues.apache.org/jira/browse/CONNECTORS-50 Project: Lucene Connector Framework Issue Type: New Feature Components: Framework core Reporter: Jack Krupansky Currently, LCF has a relatively high-bar or evaluation and use, requiring developer expertise. Also, although LCF has a comprehensive UI, it is not currently packaged for use as a crawling engine for advanced applications. A small set of individual feature requests are needed to address these issues. They are summarized briefly to show how they fit together for two initial releases of LCF, but will be broken out into individual LCF Jira issues. Goals: 1. LCF as a standalone, downloadable, usable-out-of-the-box product (much as Solr is today) 2. LCF as a toolkit for developers needing customized crawling and repository access 3. An API-based crawling engine that can be integrated with applications (as Aperture is today) Larger goals: 1. Make it very easy for users to evaluate LCF. 2. Make it very easy for developers to customize LCF. 3. Make it very easy for appplications to fully manage and control LCF in operation. Two phases: 1) Standalone, packaged app that is super-easy to evaluate and deploy. Call it LCF 0.5. 2) API-based crawling engine for applications for which the UI might not be appropriate. Call it LCF 1.0. Phase 1 --- LCF 0.5 right out of the box would interface loosely with Solr 1.4 or later. It would contain roughly the features that are currently in place or currently underway, plus a little more. Specifically, LCF 0.5 would contain these additional capabilities: 1. Plug-in architecture for connectors (already underway) 2. Packaged app ready to run with embedded Jetty app server (I think this has been agreed to) 3. Bundled with database - PostgreSQL or derby - ready to run without additional manual setup 4. Mini-API to initially configure default connections and example jobs for file system and web crawl 5. Agent process started automatically (platform-specific startup required) 6. Solr output connector option to commit at end of job, by default Installation and basic evaluation of LCF would be essentially as simple as Solr is today. The example connections and jobs would permit the user to initiate example crawls of a file system example directory and an example web on the LCF web site with just a couple of clicks (as opposed to the detailed manual setup required today to create repository and output connections and jobs. It is worth considering whether the SharePoint connector could also be included as part of the default package. Users could then add additional connectors and repositories and jobs as desired. Timeframe for release? Level of effort? Phase 2 --- The essence of Phase 2 is that LCF would be split to allow direct, full API access to LCF as a crawling engine, in additional to the full LCF UI. Call this LCF 1.0. Specifically, LCF 1.0 would contain these additional capabilities: 1. Full API for LCF as a crawling engine 2. LCF can be bundled within an app (such as the default LCF package itself with its UI) 3. LCF event and activity notification for full control by an application (already a Jira request) Overall, LCF will offer roughly the same crawling capabilities as with LCF 0.5, plus whatever bug fixes and minor enhancements might also be added. Timeframe for release? Level of effort? - Issues: - Can we package PostgreSQL with LCF so LCF can set it up? - Or do we need Derby for that purpose? - Managing multiple processes (UI, database, agent, app processes) - What exactly would the API look like? (URL, XML, JSON, YAML?) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (CONNECTORS-37) LCF should use an XML configuration file, not the simple name/value config file it currently has
[ https://issues.apache.org/jira/browse/CONNECTORS-37?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12874029#action_12874029 ] Jack Krupansky commented on CONNECTORS-37: -- I'll defer to the community on the logging issue, other than to simply say that it should be as standard as possible and relatively compatible with how Solr does logging so that it will not surprise people. I don't have a problem with the LCF .properties file per se, other than the fact that since it is restricted to being strictly keyword/value pairs it cannot contain more complex, structured configuration information. The main thing I'd like to see is that the current executecommand configuration setup, such as which output connectors and crawlers to register, be done using descriptions in a config file rather than discrete shell commands to manually execute. The default config file from svn checkout should have a default set of connectors, crawlers, etc., and have commented-out entries for other connectors that people can un-comment and edit as desired. A key advantage of having such a config file is that when people do report problems here we can ask them to provide their config file rather than ask them to try to remember and re-type whatever commands they might remember that they intended to type. Whether connections and jobs can be initially created from a config file is a larger discussion. The main point here is simply that it be easy to get LCF initialized and configured for the really basic stuff needed for a typical initial evaluation (comparable to what occurs in a Solr tutorial.) The proverbial zero-hour experience. LCF should use an XML configuration file, not the simple name/value config file it currently has Key: CONNECTORS-37 URL: https://issues.apache.org/jira/browse/CONNECTORS-37 Project: Lucene Connector Framework Issue Type: Improvement Components: Framework core Reporter: Karl Wright LCF's configuration file is limited in what it can specify, and XML configuration files seem to offer more flexibility and are the modern norm. Before backwards compatibility becomes an issue, it may therefore be worth converting the property file reader to use XML rather than name/value format. It would also be nice to be able to fold the logging configuration into the same file, if this seems possible. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.