[jira] Commented: (CONNECTORS-116) Possibly remove memex connector depending upon legal resolution

2010-10-13 Thread Jack Krupansky (JIRA)

[ 
https://issues.apache.org/jira/browse/CONNECTORS-116?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12920568#action_12920568
 ] 

Jack Krupansky commented on CONNECTORS-116:
---

It would be nice to see a comment about what would be required to add Memex 
support back.

I note the following statement in the original incubation submission:

It is unlikely that EMC, OpenText, Memex, or IBM would grant 
Apache-license-compatible use of these client libraries. Thus, the expectation 
is that users of these connectors obtain the necessary client libraries from 
the owners prior to building or using the corresponding connector. An 
alternative would be to undertake a clean-room implementation of the client 
API's, which may well yield suitable results in some cases (LiveLink, Memex, 
FileNet), while being out of reach in others (Documentum). Conditional 
compilation, for the short term, is thus likely to be a necessity.

Is it only the Memex connector that now has this problem?

Do we need do a clean-room implementation for Memex? For any of the others?

FWIW, I don't see a Google Connector for Memex.


 Possibly remove memex connector depending upon legal resolution
 ---

 Key: CONNECTORS-116
 URL: https://issues.apache.org/jira/browse/CONNECTORS-116
 Project: ManifoldCF
  Issue Type: Task
  Components: Memex connector
Reporter: Robert Muir
Assignee: Robert Muir

 Apparently there is an IP problem with the memex connector code.
 Depending upon what apache legal says, we will take any action under this 
 issue publicly.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (CONNECTORS-118) Crawled archive files should be expanded into their constituent files

2010-10-13 Thread Jack Krupansky (JIRA)
Crawled archive files should be expanded into their constituent files
-

 Key: CONNECTORS-118
 URL: https://issues.apache.org/jira/browse/CONNECTORS-118
 Project: ManifoldCF
  Issue Type: New Feature
  Components: Framework crawler agent
Reporter: Jack Krupansky


Archive files such as zip, mbox, tar, etc. should be expanded into their 
constituent files during crawling of repositories so that any output connector 
would output the flattened archive.

This could be an option, defaulted to ON, since someone may want to implement a 
copy connector that maintains crawled files as-is.


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (CONNECTORS-118) Crawled archive files should be expanded into their constituent files

2010-10-13 Thread Jack Krupansky (JIRA)

[ 
https://issues.apache.org/jira/browse/CONNECTORS-118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12920730#action_12920730
 ] 

Jack Krupansky commented on CONNECTORS-118:
---

Support within the file system connector is obviously the higher priority. 
Windows shares as well. And FTP/SFTP.


 Crawled archive files should be expanded into their constituent files
 -

 Key: CONNECTORS-118
 URL: https://issues.apache.org/jira/browse/CONNECTORS-118
 Project: ManifoldCF
  Issue Type: New Feature
  Components: Framework crawler agent
Reporter: Jack Krupansky

 Archive files such as zip, mbox, tar, etc. should be expanded into their 
 constituent files during crawling of repositories so that any output 
 connector would output the flattened archive.
 This could be an option, defaulted to ON, since someone may want to implement 
 a copy connector that maintains crawled files as-is.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (CONNECTORS-118) Crawled archive files should be expanded into their constituent files

2010-10-13 Thread Jack Krupansky (JIRA)

[ 
https://issues.apache.org/jira/browse/CONNECTORS-118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12920787#action_12920787
 ] 

Jack Krupansky commented on CONNECTORS-118:
---

Aperture's approach was just a starting point for discussion for how to form an 
id for a file in an archive file. As long as the MCF rules are functionally 
equivalent to the Apache VFS rules, we should be okay.

In short, my proposal does not have a requirement for what an id should look 
like, just a suggestion.


 Crawled archive files should be expanded into their constituent files
 -

 Key: CONNECTORS-118
 URL: https://issues.apache.org/jira/browse/CONNECTORS-118
 Project: ManifoldCF
  Issue Type: New Feature
  Components: Framework crawler agent
Reporter: Jack Krupansky

 Archive files such as zip, mbox, tar, etc. should be expanded into their 
 constituent files during crawling of repositories so that any output 
 connector would output the flattened archive.
 This could be an option, defaulted to ON, since someone may want to implement 
 a copy connector that maintains crawled files as-is.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (CONNECTORS-118) Crawled archive files should be expanded into their constituent files

2010-10-13 Thread Jack Krupansky (JIRA)

[ 
https://issues.apache.org/jira/browse/CONNECTORS-118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12920801#action_12920801
 ] 

Jack Krupansky commented on CONNECTORS-118:
---

One of those VFS links points to all the Java packages used to access the list 
of archive formats I listed. I have personally written unit tests that 
generated most of those formats which Aperture then extracted.


 Crawled archive files should be expanded into their constituent files
 -

 Key: CONNECTORS-118
 URL: https://issues.apache.org/jira/browse/CONNECTORS-118
 Project: ManifoldCF
  Issue Type: New Feature
  Components: Framework crawler agent
Reporter: Jack Krupansky

 Archive files such as zip, mbox, tar, etc. should be expanded into their 
 constituent files during crawling of repositories so that any output 
 connector would output the flattened archive.
 This could be an option, defaulted to ON, since someone may want to implement 
 a copy connector that maintains crawled files as-is.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Issue Comment Edited: (CONNECTORS-118) Crawled archive files should be expanded into their constituent files

2010-10-13 Thread Jack Krupansky (JIRA)

[ 
https://issues.apache.org/jira/browse/CONNECTORS-118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12920801#action_12920801
 ] 

Jack Krupansky edited comment on CONNECTORS-118 at 10/13/10 7:35 PM:
-

I have personally written unit tests that generated most of those formats which 
Aperture then extracted.

See:
http://sourceforge.net/apps/trac/aperture/wiki/SubCrawlers

org.apache.tools.bzip2 - BZIP2 archives.
java.util.zip.GZIPInputStream - GZIP archives.
javax.mail   - message/rfc822-style messages and mbox files.
org.apache.tools.tar - tar archives.



  was (Author: jkrupan):
One of those VFS links points to all the Java packages used to access the 
list of archive formats I listed. I have personally written unit tests that 
generated most of those formats which Aperture then extracted.

  
 Crawled archive files should be expanded into their constituent files
 -

 Key: CONNECTORS-118
 URL: https://issues.apache.org/jira/browse/CONNECTORS-118
 Project: ManifoldCF
  Issue Type: New Feature
  Components: Framework crawler agent
Reporter: Jack Krupansky

 Archive files such as zip, mbox, tar, etc. should be expanded into their 
 constituent files during crawling of repositories so that any output 
 connector would output the flattened archive.
 This could be an option, defaulted to ON, since someone may want to implement 
 a copy connector that maintains crawled files as-is.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (CONNECTORS-98) API should be pure RESTful with the API verb represented using the HTTP GET/PUT/POST/DELETE methods

2010-09-13 Thread Jack Krupansky (JIRA)

[ 
https://issues.apache.org/jira/browse/CONNECTORS-98?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12909036#action_12909036
 ] 

Jack Krupansky commented on CONNECTORS-98:
--

Looks good. This meets meets my expectations. Any further tweaks that might 
arise would be distinct Jira issues.

 API should be pure RESTful with the API verb represented using the HTTP 
 GET/PUT/POST/DELETE methods
 -

 Key: CONNECTORS-98
 URL: https://issues.apache.org/jira/browse/CONNECTORS-98
 Project: Apache Connectors Framework
  Issue Type: Improvement
  Components: API
Affects Versions: LCF Release 0.5
Reporter: Jack Krupansky
 Fix For: LCF Release 0.5


 (This was originally a comment on CONNECTORS-56 dated 7/16/2010.)
 It has come to my attention that the API would be more pure RESTful if the 
 API verb was represented using the HTTP GET/PUT/POST/DELETE methods and the 
 input argument identifier represented in the context path.
 So,  GET outputconnection/get \{connection_name:_connection_name_\} would 
 be GET outputconnections/connection_name
 and GET outputconnection/delete \{connection_name:_connection_name_\} 
 would be DELETE outputconnections/connection_name
 and GET outputconnection/list would be GET outputconnections
 and PUT outputconnection/save 
 \{outputconnection:_output_connection_object_\} would be PUT 
 outputconnections/connection_name 
 \{outputconnection:_output_connection_object_\}
 What we have today is certainly workable, but just not as pure as some 
 might desire. It would be better to take care of this before the initial 
 release so that we never have to answer the question of why it wasn't done as 
 a proper RESTful API.
 BTW, I did check to verify that an HttpServlet running under Jetty can 
 process the DELETE and PUT methods (using the doDelete and doPut method 
 overrides.)
 Also, POST should be usable as an alternative to PUT for API calls that have 
 large volumes of data.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (CONNECTORS-98) API should be pure RESTful with the API verb represented using the HTTP GET/PUT/POST/DELETE methods

2010-09-12 Thread Jack Krupansky (JIRA)

[ 
https://issues.apache.org/jira/browse/CONNECTORS-98?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12908581#action_12908581
 ] 

Jack Krupansky commented on CONNECTORS-98:
--

Just to confirm, as requested, that I am comfortable sticking with connection 
name (and job name, etc.) in API paths as opposed to using a more abstract id 
since we seem to have an encoding convention to deal with slash so that an ACF 
object name can always be represented using a single HTTP path segment. Names 
clearly feel more natural and will be easier to use, both for app code using 
the ACF API and for CURL and other scripting tools.




 API should be pure RESTful with the API verb represented using the HTTP 
 GET/PUT/POST/DELETE methods
 -

 Key: CONNECTORS-98
 URL: https://issues.apache.org/jira/browse/CONNECTORS-98
 Project: Apache Connectors Framework
  Issue Type: Improvement
  Components: API
Affects Versions: LCF Release 0.5
Reporter: Jack Krupansky
 Fix For: LCF Release 0.5


 (This was originally a comment on CONNECTORS-56 dated 7/16/2010.)
 It has come to my attention that the API would be more pure RESTful if the 
 API verb was represented using the HTTP GET/PUT/POST/DELETE methods and the 
 input argument identifier represented in the context path.
 So,  GET outputconnection/get \{connection_name:_connection_name_\} would 
 be GET outputconnections/connection_name
 and GET outputconnection/delete \{connection_name:_connection_name_\} 
 would be DELETE outputconnections/connection_name
 and GET outputconnection/list would be GET outputconnections
 and PUT outputconnection/save 
 \{outputconnection:_output_connection_object_\} would be PUT 
 outputconnections/connection_name 
 \{outputconnection:_output_connection_object_\}
 What we have today is certainly workable, but just not as pure as some 
 might desire. It would be better to take care of this before the initial 
 release so that we never have to answer the question of why it wasn't done as 
 a proper RESTful API.
 BTW, I did check to verify that an HttpServlet running under Jetty can 
 process the DELETE and PUT methods (using the doDelete and doPut method 
 overrides.)
 Also, POST should be usable as an alternative to PUT for API calls that have 
 large volumes of data.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (CONNECTORS-98) API should be pure RESTful with the API verb represented using the HTTP GET/PUT/POST/DELETE methods

2010-09-10 Thread Jack Krupansky (JIRA)

[ 
https://issues.apache.org/jira/browse/CONNECTORS-98?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12908148#action_12908148
 ] 

Jack Krupansky commented on CONNECTORS-98:
--

I am still pondering this embedded slash issue and checking into some things 
related to it. Maybe Monday I'll have something more concrete to say.

For example, I want to make sure I understand the rules for what a path can 
have in it in a URI and whether simply placing a name at the tail of the path 
means it can have slashes or other reserved characters in it. My model is that 
a name should occupy only a single path component.


 API should be pure RESTful with the API verb represented using the HTTP 
 GET/PUT/POST/DELETE methods
 -

 Key: CONNECTORS-98
 URL: https://issues.apache.org/jira/browse/CONNECTORS-98
 Project: Apache Connectors Framework
  Issue Type: Improvement
  Components: API
Affects Versions: LCF Release 0.5
Reporter: Jack Krupansky
 Fix For: LCF Release 0.5


 (This was originally a comment on CONNECTORS-56 dated 7/16/2010.)
 It has come to my attention that the API would be more pure RESTful if the 
 API verb was represented using the HTTP GET/PUT/POST/DELETE methods and the 
 input argument identifier represented in the context path.
 So,  GET outputconnection/get \{connection_name:_connection_name_\} would 
 be GET outputconnections/connection_name
 and GET outputconnection/delete \{connection_name:_connection_name_\} 
 would be DELETE outputconnections/connection_name
 and GET outputconnection/list would be GET outputconnections
 and PUT outputconnection/save 
 \{outputconnection:_output_connection_object_\} would be PUT 
 outputconnections/connection_name 
 \{outputconnection:_output_connection_object_\}
 What we have today is certainly workable, but just not as pure as some 
 might desire. It would be better to take care of this before the initial 
 release so that we never have to answer the question of why it wasn't done as 
 a proper RESTful API.
 BTW, I did check to verify that an HttpServlet running under Jetty can 
 process the DELETE and PUT methods (using the doDelete and doPut method 
 overrides.)
 Also, POST should be usable as an alternative to PUT for API calls that have 
 large volumes of data.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (CONNECTORS-98) API should be pure RESTful with the API verb represented using the HTTP GET/PUT/POST/DELETE methods

2010-09-09 Thread Jack Krupansky (JIRA)

[ 
https://issues.apache.org/jira/browse/CONNECTORS-98?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12907614#action_12907614
 ] 

Jack Krupansky commented on CONNECTORS-98:
--

I have looked at the code a bit but not made any actual progress at a patch, so 
you can go ahead and take a crack at it. Yes, I'll do the transformation table. 
As far as updating the wiki, do I have privileges to do that?


 API should be pure RESTful with the API verb represented using the HTTP 
 GET/PUT/POST/DELETE methods
 -

 Key: CONNECTORS-98
 URL: https://issues.apache.org/jira/browse/CONNECTORS-98
 Project: Apache Connectors Framework
  Issue Type: Improvement
  Components: API
Affects Versions: LCF Release 0.5
Reporter: Jack Krupansky
 Fix For: LCF Release 0.5


 (This was originally a comment on CONNECTORS-56 dated 7/16/2010.)
 It has come to my attention that the API would be more pure RESTful if the 
 API verb was represented using the HTTP GET/PUT/POST/DELETE methods and the 
 input argument identifier represented in the context path.
 So,  GET outputconnection/get \{connection_name:_connection_name_\} would 
 be GET outputconnections/connection_name
 and GET outputconnection/delete \{connection_name:_connection_name_\} 
 would be DELETE outputconnections/connection_name
 and GET outputconnection/list would be GET outputconnections
 and PUT outputconnection/save 
 \{outputconnection:_output_connection_object_\} would be PUT 
 outputconnections/connection_name 
 \{outputconnection:_output_connection_object_\}
 What we have today is certainly workable, but just not as pure as some 
 might desire. It would be better to take care of this before the initial 
 release so that we never have to answer the question of why it wasn't done as 
 a proper RESTful API.
 BTW, I did check to verify that an HttpServlet running under Jetty can 
 process the DELETE and PUT methods (using the doDelete and doPut method 
 overrides.)
 Also, POST should be usable as an alternative to PUT for API calls that have 
 large volumes of data.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (CONNECTORS-98) API should be pure RESTful with the API verb represented using the HTTP GET/PUT/POST/DELETE methods

2010-09-09 Thread Jack Krupansky (JIRA)

[ 
https://issues.apache.org/jira/browse/CONNECTORS-98?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12907712#action_12907712
 ] 

Jack Krupansky commented on CONNECTORS-98:
--

Some RESTful resource doc:

http://en.wikipedia.org/wiki/Representational_State_Transfer

http://www.xfront.com/REST-Web-Services.html

http://www.oracle.com/technetwork/articles/javase/table3-138001.html

The idea of using a plural is that it is the name of the collection and the 
qualifier (name or argument object) provides the specificity.


 API should be pure RESTful with the API verb represented using the HTTP 
 GET/PUT/POST/DELETE methods
 -

 Key: CONNECTORS-98
 URL: https://issues.apache.org/jira/browse/CONNECTORS-98
 Project: Apache Connectors Framework
  Issue Type: Improvement
  Components: API
Affects Versions: LCF Release 0.5
Reporter: Jack Krupansky
 Fix For: LCF Release 0.5


 (This was originally a comment on CONNECTORS-56 dated 7/16/2010.)
 It has come to my attention that the API would be more pure RESTful if the 
 API verb was represented using the HTTP GET/PUT/POST/DELETE methods and the 
 input argument identifier represented in the context path.
 So,  GET outputconnection/get \{connection_name:_connection_name_\} would 
 be GET outputconnections/connection_name
 and GET outputconnection/delete \{connection_name:_connection_name_\} 
 would be DELETE outputconnections/connection_name
 and GET outputconnection/list would be GET outputconnections
 and PUT outputconnection/save 
 \{outputconnection:_output_connection_object_\} would be PUT 
 outputconnections/connection_name 
 \{outputconnection:_output_connection_object_\}
 What we have today is certainly workable, but just not as pure as some 
 might desire. It would be better to take care of this before the initial 
 release so that we never have to answer the question of why it wasn't done as 
 a proper RESTful API.
 BTW, I did check to verify that an HttpServlet running under Jetty can 
 process the DELETE and PUT methods (using the doDelete and doPut method 
 overrides.)
 Also, POST should be usable as an alternative to PUT for API calls that have 
 large volumes of data.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (CONNECTORS-98) API should be pure RESTful with the API verb represented using the HTTP GET/PUT/POST/DELETE methods

2010-09-09 Thread Jack Krupansky (JIRA)

[ 
https://issues.apache.org/jira/browse/CONNECTORS-98?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12907735#action_12907735
 ] 

Jack Krupansky commented on CONNECTORS-98:
--

I think status is probably technically okay since it is disambiguated by number 
path elements, but it could be moved to the end:

 GET outputconnections/connection_name/status ()

vs.

 GET outputconnections/status/connection_name ()

Same for execute/request:

GET outputconnections/connection_name/request/command (arguments)

vs.

GET outputconnections/request/connection_name/command (arguments)


That way the connection name is always in the same position.

So, I'd revise my counter-proposal that way.


 API should be pure RESTful with the API verb represented using the HTTP 
 GET/PUT/POST/DELETE methods
 -

 Key: CONNECTORS-98
 URL: https://issues.apache.org/jira/browse/CONNECTORS-98
 Project: Apache Connectors Framework
  Issue Type: Improvement
  Components: API
Affects Versions: LCF Release 0.5
Reporter: Jack Krupansky
 Fix For: LCF Release 0.5


 (This was originally a comment on CONNECTORS-56 dated 7/16/2010.)
 It has come to my attention that the API would be more pure RESTful if the 
 API verb was represented using the HTTP GET/PUT/POST/DELETE methods and the 
 input argument identifier represented in the context path.
 So,  GET outputconnection/get \{connection_name:_connection_name_\} would 
 be GET outputconnections/connection_name
 and GET outputconnection/delete \{connection_name:_connection_name_\} 
 would be DELETE outputconnections/connection_name
 and GET outputconnection/list would be GET outputconnections
 and PUT outputconnection/save 
 \{outputconnection:_output_connection_object_\} would be PUT 
 outputconnections/connection_name 
 \{outputconnection:_output_connection_object_\}
 What we have today is certainly workable, but just not as pure as some 
 might desire. It would be better to take care of this before the initial 
 release so that we never have to answer the question of why it wasn't done as 
 a proper RESTful API.
 BTW, I did check to verify that an HttpServlet running under Jetty can 
 process the DELETE and PUT methods (using the doDelete and doPut method 
 overrides.)
 Also, POST should be usable as an alternative to PUT for API calls that have 
 large volumes of data.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (CONNECTORS-98) API should be pure RESTful with the API verb represented using the HTTP GET/PUT/POST/DELETE methods

2010-09-09 Thread Jack Krupansky (JIRA)

[ 
https://issues.apache.org/jira/browse/CONNECTORS-98?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12907736#action_12907736
 ] 

Jack Krupansky commented on CONNECTORS-98:
--

re: We could not pass (arguments) except as part of the path.

Sure, we could go that route, and list the arguments as path elements, but I 
think a JSON object (array list of arguments) is acceptable.

So, I'd go with the latter (JSON.)


 API should be pure RESTful with the API verb represented using the HTTP 
 GET/PUT/POST/DELETE methods
 -

 Key: CONNECTORS-98
 URL: https://issues.apache.org/jira/browse/CONNECTORS-98
 Project: Apache Connectors Framework
  Issue Type: Improvement
  Components: API
Affects Versions: LCF Release 0.5
Reporter: Jack Krupansky
 Fix For: LCF Release 0.5


 (This was originally a comment on CONNECTORS-56 dated 7/16/2010.)
 It has come to my attention that the API would be more pure RESTful if the 
 API verb was represented using the HTTP GET/PUT/POST/DELETE methods and the 
 input argument identifier represented in the context path.
 So,  GET outputconnection/get \{connection_name:_connection_name_\} would 
 be GET outputconnections/connection_name
 and GET outputconnection/delete \{connection_name:_connection_name_\} 
 would be DELETE outputconnections/connection_name
 and GET outputconnection/list would be GET outputconnections
 and PUT outputconnection/save 
 \{outputconnection:_output_connection_object_\} would be PUT 
 outputconnections/connection_name 
 \{outputconnection:_output_connection_object_\}
 What we have today is certainly workable, but just not as pure as some 
 might desire. It would be better to take care of this before the initial 
 release so that we never have to answer the question of why it wasn't done as 
 a proper RESTful API.
 BTW, I did check to verify that an HttpServlet running under Jetty can 
 process the DELETE and PUT methods (using the doDelete and doPut method 
 overrides.)
 Also, POST should be usable as an alternative to PUT for API calls that have 
 large volumes of data.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (CONNECTORS-98) API should be pure RESTful with the API verb represented using the HTTP GET/PUT/POST/DELETE methods

2010-09-09 Thread Jack Krupansky (JIRA)

[ 
https://issues.apache.org/jira/browse/CONNECTORS-98?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12907758#action_12907758
 ] 

Jack Krupansky commented on CONNECTORS-98:
--

re:  the command cannot itself contain / characters, or it won't be 
uniquely parseable

Elsewhere I noted that URI-reserved characters need to be encoded with the % 
notation, so this is not a fatal problem.


  reserved= ; | / | ? | : | @ |  | = | + | $ | ,


 API should be pure RESTful with the API verb represented using the HTTP 
 GET/PUT/POST/DELETE methods
 -

 Key: CONNECTORS-98
 URL: https://issues.apache.org/jira/browse/CONNECTORS-98
 Project: Apache Connectors Framework
  Issue Type: Improvement
  Components: API
Affects Versions: LCF Release 0.5
Reporter: Jack Krupansky
 Fix For: LCF Release 0.5


 (This was originally a comment on CONNECTORS-56 dated 7/16/2010.)
 It has come to my attention that the API would be more pure RESTful if the 
 API verb was represented using the HTTP GET/PUT/POST/DELETE methods and the 
 input argument identifier represented in the context path.
 So,  GET outputconnection/get \{connection_name:_connection_name_\} would 
 be GET outputconnections/connection_name
 and GET outputconnection/delete \{connection_name:_connection_name_\} 
 would be DELETE outputconnections/connection_name
 and GET outputconnection/list would be GET outputconnections
 and PUT outputconnection/save 
 \{outputconnection:_output_connection_object_\} would be PUT 
 outputconnections/connection_name 
 \{outputconnection:_output_connection_object_\}
 What we have today is certainly workable, but just not as pure as some 
 might desire. It would be better to take care of this before the initial 
 release so that we never have to answer the question of why it wasn't done as 
 a proper RESTful API.
 BTW, I did check to verify that an HttpServlet running under Jetty can 
 process the DELETE and PUT methods (using the doDelete and doPut method 
 overrides.)
 Also, POST should be usable as an alternative to PUT for API calls that have 
 large volumes of data.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (CONNECTORS-98) API should be pure RESTful with the API verb represented using the HTTP GET/PUT/POST/DELETE methods

2010-09-09 Thread Jack Krupansky (JIRA)

[ 
https://issues.apache.org/jira/browse/CONNECTORS-98?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12907875#action_12907875
 ] 

Jack Krupansky commented on CONNECTORS-98:
--

It makes sense that GetPathInfo would have removed escapes from the URL. So, 
either we don't use % escaping or bypass GetPathInfo and manually decode.

Maybe we could use backslash for escaping. I'm not sure whether it needs to be 
% escaped as well.

This is only needed if the user has one of the reserved special characters in a 
name. It would be an issue if it was something that users commonly needed, but 
it seems like more of an edge case rather than a common case.

Encourage people to use alphanumeric, -, and _ for names and it won't be an 
issue for them.

And, the real point of the API is access from code. We can provide helper 
functions for working with names and building API paths.



 API should be pure RESTful with the API verb represented using the HTTP 
 GET/PUT/POST/DELETE methods
 -

 Key: CONNECTORS-98
 URL: https://issues.apache.org/jira/browse/CONNECTORS-98
 Project: Apache Connectors Framework
  Issue Type: Improvement
  Components: API
Affects Versions: LCF Release 0.5
Reporter: Jack Krupansky
 Fix For: LCF Release 0.5


 (This was originally a comment on CONNECTORS-56 dated 7/16/2010.)
 It has come to my attention that the API would be more pure RESTful if the 
 API verb was represented using the HTTP GET/PUT/POST/DELETE methods and the 
 input argument identifier represented in the context path.
 So,  GET outputconnection/get \{connection_name:_connection_name_\} would 
 be GET outputconnections/connection_name
 and GET outputconnection/delete \{connection_name:_connection_name_\} 
 would be DELETE outputconnections/connection_name
 and GET outputconnection/list would be GET outputconnections
 and PUT outputconnection/save 
 \{outputconnection:_output_connection_object_\} would be PUT 
 outputconnections/connection_name 
 \{outputconnection:_output_connection_object_\}
 What we have today is certainly workable, but just not as pure as some 
 might desire. It would be better to take care of this before the initial 
 release so that we never have to answer the question of why it wasn't done as 
 a proper RESTful API.
 BTW, I did check to verify that an HttpServlet running under Jetty can 
 process the DELETE and PUT methods (using the doDelete and doPut method 
 overrides.)
 Also, POST should be usable as an alternative to PUT for API calls that have 
 large volumes of data.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (CONNECTORS-104) Make it easier to limit a web crawl to a single site

2010-09-08 Thread Jack Krupansky (JIRA)

[ 
https://issues.apache.org/jira/browse/CONNECTORS-104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12907201#action_12907201
 ] 

Jack Krupansky commented on CONNECTORS-104:
---

Simple works best. This enhancement is primarily for the simple use case where 
a novice user tries to do what they think is obvious (crawl the web pages at 
this URL), but without considering all of the potential nuances or how to 
fully specify the details of their goal.

One nuance is whether subdomains are considered part of the domain. I would say 
no if a subdomain was specified by the user and yes if no subdomain was 
specified.

Another nuance is whether a path is specified to select a subset of a domain. 
It would be nice to handle that and (optionally) limit the crawl to that path 
(or sub-paths below it). An example would be to crawl the news archive for a 
site.


 Make it easier to limit a web crawl to a single site
 

 Key: CONNECTORS-104
 URL: https://issues.apache.org/jira/browse/CONNECTORS-104
 Project: Apache Connectors Framework
  Issue Type: Improvement
  Components: Web connector
Reporter: Jack Krupansky
Priority: Minor

 Unless the user explicitly enters an include regex carefully, a web crawl can 
 quickly get out of control and start crawling the entire web when all the 
 user may really want is to crawl just a single web site or portion thereof. 
 So, it would be preferable if either by default or with a simple button the 
 crawl could be limited to the seed web site(s).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (CONNECTORS-104) Make it easier to limit a web crawl to a single site

2010-09-07 Thread Jack Krupansky (JIRA)
Make it easier to limit a web crawl to a single site


 Key: CONNECTORS-104
 URL: https://issues.apache.org/jira/browse/CONNECTORS-104
 Project: Apache Connectors Framework
  Issue Type: Improvement
  Components: Web connector
Affects Versions: LCF Release 0.5
Reporter: Jack Krupansky
Priority: Minor
 Fix For: LCF Release 0.5


Unless the user explicitly enters an include regex carefully, a web crawl can 
quickly get out of control and start crawling the entire web when all the user 
may really want is to crawl just a single web site or portion thereof. So, it 
would be preferable if either by default or with a simple button the crawl 
could be limited to the seed web site(s).


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (CONNECTORS-57) Solr output connector option to commit at end of job, by default

2010-08-31 Thread Jack Krupansky (JIRA)

[ 
https://issues.apache.org/jira/browse/CONNECTORS-57?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12904746#action_12904746
 ] 

Jack Krupansky commented on CONNECTORS-57:
--

This looks fine so far and should work for me.

If I understand the code, the Connector.noteJobComplete method is called when 
the job completes or is aborted and the SolrConnector.noteJobComplete 
implementation method unconditionally does a commit. That's fine my my use 
case, but we probably still want a connection option to disable that commit if 
the user has some other commit strategy in mind.

 Solr output connector option to commit at end of job, by default
 

 Key: CONNECTORS-57
 URL: https://issues.apache.org/jira/browse/CONNECTORS-57
 Project: Apache Connectors Framework
  Issue Type: Sub-task
  Components: Lucene/SOLR connector
Reporter: Jack Krupansky

 By default, Solr will eventually commit documents that have been submitted to 
 the Solr Cell interface, but the time lag can confuse and annoy people. 
 Although commit strategy is a difficult issue in general, an option in LCF to 
 automatically commit at the end of a job, by default, would eliminate a lot 
 of potential confusion and generally be close to what the user needs.
 The desired feature is that there be an option to commit for each job that 
 uses the Solr output connector. This option would default to on (or a 
 different setting based on some global configuration setting), but the user 
 may turn it off if commit is only desired upon completion of some jobs.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (CONNECTORS-98) API should be pure RESTful with the API verb represented using the HTTP GET/PUT/POST/DELETE methods

2010-08-27 Thread Jack Krupansky (JIRA)

[ 
https://issues.apache.org/jira/browse/CONNECTORS-98?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12903559#action_12903559
 ] 

Jack Krupansky commented on CONNECTORS-98:
--

I'll be mostly looking through code and thinking it through and looking at the 
API string changes first, so I may not touch any code for another week, if not 
longer. Feel free to rename or refactor code at will. I'll probably let you 
know in advance of what changes I expect to make in the code.

 API should be pure RESTful with the API verb represented using the HTTP 
 GET/PUT/POST/DELETE methods
 -

 Key: CONNECTORS-98
 URL: https://issues.apache.org/jira/browse/CONNECTORS-98
 Project: Apache Connectors Framework
  Issue Type: Improvement
  Components: API
Affects Versions: LCF Release 0.5
Reporter: Jack Krupansky
 Fix For: LCF Release 0.5


 (This was originally a comment on CONNECTORS-56 dated 7/16/2010.)
 It has come to my attention that the API would be more pure RESTful if the 
 API verb was represented using the HTTP GET/PUT/POST/DELETE methods and the 
 input argument identifier represented in the context path.
 So,  GET outputconnection/get \{connection_name:_connection_name_\} would 
 be GET outputconnections/connection_name
 and GET outputconnection/delete \{connection_name:_connection_name_\} 
 would be DELETE outputconnections/connection_name
 and GET outputconnection/list would be GET outputconnections
 and PUT outputconnection/save 
 \{outputconnection:_output_connection_object_\} would be PUT 
 outputconnections/connection_name 
 \{outputconnection:_output_connection_object_\}
 What we have today is certainly workable, but just not as pure as some 
 might desire. It would be better to take care of this before the initial 
 release so that we never have to answer the question of why it wasn't done as 
 a proper RESTful API.
 BTW, I did check to verify that an HttpServlet running under Jetty can 
 process the DELETE and PUT methods (using the doDelete and doPut method 
 overrides.)
 Also, POST should be usable as an alternative to PUT for API calls that have 
 large volumes of data.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (CONNECTORS-98) API should be pure RESTful with the API verb represented using the HTTP GET/PUT/POST/DELETE methods

2010-08-26 Thread Jack Krupansky (JIRA)
API should be pure RESTful with the API verb represented using the HTTP 
GET/PUT/POST/DELETE methods
-

 Key: CONNECTORS-98
 URL: https://issues.apache.org/jira/browse/CONNECTORS-98
 Project: Apache Connectors Framework
  Issue Type: Improvement
  Components: API
Affects Versions: LCF Release 0.5
Reporter: Jack Krupansky
 Fix For: LCF Release 0.5


(This was originally a comment on CONNECTORS-56 dated 7/16/2010.)

It has come to my attention that the API would be more pure RESTful if the 
API verb was represented using the HTTP GET/PUT/POST/DELETE methods and the 
input argument identifier represented in the context path.

So,  GET outputconnection/get \{connection_name:_connection_name_\} would 
be GET outputconnections/connection_name

and GET outputconnection/delete \{connection_name:_connection_name_\} would 
be DELETE outputconnections/connection_name

and GET outputconnection/list would be GET outputconnections

and PUT outputconnection/save 
\{outputconnection:_output_connection_object_\} would be PUT 
outputconnections/connection_name 
\{outputconnection:_output_connection_object_\}

What we have today is certainly workable, but just not as pure as some might 
desire. It would be better to take care of this before the initial release so 
that we never have to answer the question of why it wasn't done as a proper 
RESTful API.

BTW, I did check to verify that an HttpServlet running under Jetty can process 
the DELETE and PUT methods (using the doDelete and doPut method overrides.)

Also, POST should be usable as an alternative to PUT for API calls that have 
large volumes of data.


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (CONNECTORS-98) API should be pure RESTful with the API verb represented using the HTTP GET/PUT/POST/DELETE methods

2010-08-26 Thread Jack Krupansky (JIRA)

[ 
https://issues.apache.org/jira/browse/CONNECTORS-98?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12902982#action_12902982
 ] 

Jack Krupansky commented on CONNECTORS-98:
--

Karl asks what do you plan to do for the list and execute verbs?

List would be a GET and execute would be PUT.


 API should be pure RESTful with the API verb represented using the HTTP 
 GET/PUT/POST/DELETE methods
 -

 Key: CONNECTORS-98
 URL: https://issues.apache.org/jira/browse/CONNECTORS-98
 Project: Apache Connectors Framework
  Issue Type: Improvement
  Components: API
Affects Versions: LCF Release 0.5
Reporter: Jack Krupansky
 Fix For: LCF Release 0.5


 (This was originally a comment on CONNECTORS-56 dated 7/16/2010.)
 It has come to my attention that the API would be more pure RESTful if the 
 API verb was represented using the HTTP GET/PUT/POST/DELETE methods and the 
 input argument identifier represented in the context path.
 So,  GET outputconnection/get \{connection_name:_connection_name_\} would 
 be GET outputconnections/connection_name
 and GET outputconnection/delete \{connection_name:_connection_name_\} 
 would be DELETE outputconnections/connection_name
 and GET outputconnection/list would be GET outputconnections
 and PUT outputconnection/save 
 \{outputconnection:_output_connection_object_\} would be PUT 
 outputconnections/connection_name 
 \{outputconnection:_output_connection_object_\}
 What we have today is certainly workable, but just not as pure as some 
 might desire. It would be better to take care of this before the initial 
 release so that we never have to answer the question of why it wasn't done as 
 a proper RESTful API.
 BTW, I did check to verify that an HttpServlet running under Jetty can 
 process the DELETE and PUT methods (using the doDelete and doPut method 
 overrides.)
 Also, POST should be usable as an alternative to PUT for API calls that have 
 large volumes of data.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (CONNECTORS-98) API should be pure RESTful with the API verb represented using the HTTP GET/PUT/POST/DELETE methods

2010-08-26 Thread Jack Krupansky (JIRA)

[ 
https://issues.apache.org/jira/browse/CONNECTORS-98?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12902983#action_12902983
 ] 

Jack Krupansky commented on CONNECTORS-98:
--

Karl says I await your patch.

Point well made. There is a great starting point with the current code. A bit 
of refactoring required.


 API should be pure RESTful with the API verb represented using the HTTP 
 GET/PUT/POST/DELETE methods
 -

 Key: CONNECTORS-98
 URL: https://issues.apache.org/jira/browse/CONNECTORS-98
 Project: Apache Connectors Framework
  Issue Type: Improvement
  Components: API
Affects Versions: LCF Release 0.5
Reporter: Jack Krupansky
 Fix For: LCF Release 0.5


 (This was originally a comment on CONNECTORS-56 dated 7/16/2010.)
 It has come to my attention that the API would be more pure RESTful if the 
 API verb was represented using the HTTP GET/PUT/POST/DELETE methods and the 
 input argument identifier represented in the context path.
 So,  GET outputconnection/get \{connection_name:_connection_name_\} would 
 be GET outputconnections/connection_name
 and GET outputconnection/delete \{connection_name:_connection_name_\} 
 would be DELETE outputconnections/connection_name
 and GET outputconnection/list would be GET outputconnections
 and PUT outputconnection/save 
 \{outputconnection:_output_connection_object_\} would be PUT 
 outputconnections/connection_name 
 \{outputconnection:_output_connection_object_\}
 What we have today is certainly workable, but just not as pure as some 
 might desire. It would be better to take care of this before the initial 
 release so that we never have to answer the question of why it wasn't done as 
 a proper RESTful API.
 BTW, I did check to verify that an HttpServlet running under Jetty can 
 process the DELETE and PUT methods (using the doDelete and doPut method 
 overrides.)
 Also, POST should be usable as an alternative to PUT for API calls that have 
 large volumes of data.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (CONNECTORS-56) All features should be accessible through an API

2010-07-14 Thread Jack Krupansky (JIRA)

[ 
https://issues.apache.org/jira/browse/CONNECTORS-56?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12888377#action_12888377
 ] 

Jack Krupansky commented on CONNECTORS-56:
--

Some cURL and/or Perl test scripts to illustrate use of the API would be 
helpful.

 All features should be accessible through an API
 

 Key: CONNECTORS-56
 URL: https://issues.apache.org/jira/browse/CONNECTORS-56
 Project: Lucene Connector Framework
  Issue Type: Sub-task
  Components: Framework core
Reporter: Jack Krupansky

 LCF consists of a full-featured crawling engine and a full-featured user 
 interface to access the features of that engine, but some applications are 
 better served with a full API that lets the application control the crawling 
 engine, including creation and editing of connections and creation, editing, 
 and control of jobs. Put simply, everything that a user can accomplish via 
 the LCF UI should be doable through an LCF API. All LCF objects should be 
 queryable through the API.
 A primary use case is Solr applications which currently use Aperture for 
 crawling, but would prefer the full-featured capabilities of LCF as a 
 crawling engine over Aperture.
 I do not wish to over-specify the API in this initial description, but I 
 think the LCF API should probably be a traditional REST API., with some of 
 the API elements specified via the context path, some parameters via URL 
 query parameters, and complex, detailed structures as JSON (or similar.). The 
 precise details of the API are beyond the scope of this initial description 
 and will be added incrementally once the high-level approach to the API 
 becomes reasonably settled.
 A job status and event reporting scheme is also needed in conjunction with 
 the LCF API. That requirement has already been captured as CONNECTORS-41.
 The intention for the API is to create, edit, access, and control all of the 
 objects managed by LCF. The main focus is on repositories, jobs, and status, 
 and less about document-specific crawling information, but there may be some 
 benefit to querying crawling status for individual documents as well.
 Nothing in this proposal should in any way limit or constrain the features 
 that will be available in the LCF UI. The intent is that LCF should continue 
 to have a full-featured UI, but in addition to a full-featured API.
 Note: This issue is part of Phase 2 of the CONNECTORS-50 umbrella issue.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (CONNECTORS-60) Agent process should be started automatically

2010-07-13 Thread Jack Krupansky (JIRA)

[ 
https://issues.apache.org/jira/browse/CONNECTORS-60?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12888000#action_12888000
 ] 

Jack Krupansky commented on CONNECTORS-60:
--

Unless I am mistaken, the jetty integration is for QuickStart (single process) 
only. The issue is for non-QuickStart, multi-process execution.


 Agent process should be started automatically
 -

 Key: CONNECTORS-60
 URL: https://issues.apache.org/jira/browse/CONNECTORS-60
 Project: Lucene Connector Framework
  Issue Type: Sub-task
Reporter: Jack Krupansky

 LCF as it exists today is a bit too complex to run for an average user, 
 especially with a separate agent process for crawling. LCF should be as easy 
 to run as Solr is today. QuickStart is a good move in this direction, but the 
 same user-visible simplicity is needed for full LCF. The separate agent 
 process is a reasonable design for execution, but a little too cumbersome for 
 the average user to manage.
 Unfortunately, it is expected that starting up a multi-process application 
 will require platform-specific scripting.
 Note: This issue is part of Phase 1 of the CONNECTORS-50 umbrella issue.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (CONNECTORS-55) Bundle database server with LCF packaged product

2010-07-09 Thread Jack Krupansky (JIRA)

[ 
https://issues.apache.org/jira/browse/CONNECTORS-55?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12886720#action_12886720
 ] 

Jack Krupansky commented on CONNECTORS-55:
--

Karl notes that we've had to mess with the stuffer query on pretty near every 
point release of Postgresql. Letting/forcing the user to pick the 
right/acceptable release of PostgreSQL to install is error prone and a support 
headache. I would argue that it is better for the LCF team to bundle the 
right/best release of PostgreSQL with LCF.

 Bundle database server with LCF packaged product
 

 Key: CONNECTORS-55
 URL: https://issues.apache.org/jira/browse/CONNECTORS-55
 Project: Lucene Connector Framework
  Issue Type: Improvement
  Components: Framework core
Reporter: Jack Krupansky

 The current requirement that the user install and deploy a PostgreSQL server 
 complicates the installation and deployment of LCF for the user. Installation 
 and deployment of LCF should be as simple as Solr itself. QuickStart is great 
 for the low-end and basic evaluation, but a comparable level of simplified 
 installation and deployment is still needed for full-blown, high-end 
 environments that need the full performance of a ProstgreSQL-class database 
 server. So, PostgreSQL should be bundled with the packaged release of LCF so 
 that installation and deployment of LCF will automatically install and deploy 
 a subset of the full PostgreSQL distribution that is sufficient for the needs 
 of LCF. Starting LCF, with or without the LCF UI, should automatically start 
 the database server. Shutting down LCF should also shutdown the database 
 server process.
 A typical use case would be for a non-developer who is comfortable with Solr 
 and simply wants to crawl documents from, for example, a SharePoint 
 repository and feed them into Solr. QuickStart should work well for the low 
 end or in the early stages of evaluation, but the user would prefer to 
 evaluate the real thing with something resembling a production crawl of 
 thousands of documents. Such a user might not be a hard-core developer or be 
 comfortable fiddling with a lot of software components simply to do one 
 conceptually simple operation.
 It should still be possible for the user to supply database server settings 
 to override the defaults, but the LCF package should have all of the 
 best-practice settings deemed appropriate for use with LCF.
 One downside is that installation and deployment will be platform-specific 
 since there are multiple processes and PostgreSQL itself requires a 
 platform-specific installation.
 This proposal presumes that PostgreSQL is the best option for the foreseeable 
 future, but nothing here is intended to preclude support for other database 
 servers in futures releases.
 This proposal should not have any impact on QuickStart packaging or 
 deployment.
 Note: This issue is part of Phase 1 of the CONNECTORS-50 umbrella issue.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (CONNECTORS-55) Bundle database server with LCF packaged product

2010-07-09 Thread Jack Krupansky (JIRA)

[ 
https://issues.apache.org/jira/browse/CONNECTORS-55?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12886724#action_12886724
 ] 

Jack Krupansky commented on CONNECTORS-55:
--

When Karl says It *does* limit your ability to use other commands 
simultaneously (referring to use of embedded Derby), he is referring to 
commands executed using the executecommand shell script, such as registering 
and unregistering connectors, which is something typically done once before 
starting the UI or once every blue moon when you want to support a new type of 
repository, but not done on as regular a basis as editing connections and jobs 
and running jobs. The java classes to execute those commands would be, by 
definition, outside of the LCF process.

 Bundle database server with LCF packaged product
 

 Key: CONNECTORS-55
 URL: https://issues.apache.org/jira/browse/CONNECTORS-55
 Project: Lucene Connector Framework
  Issue Type: Improvement
  Components: Framework core
Reporter: Jack Krupansky

 The current requirement that the user install and deploy a PostgreSQL server 
 complicates the installation and deployment of LCF for the user. Installation 
 and deployment of LCF should be as simple as Solr itself. QuickStart is great 
 for the low-end and basic evaluation, but a comparable level of simplified 
 installation and deployment is still needed for full-blown, high-end 
 environments that need the full performance of a ProstgreSQL-class database 
 server. So, PostgreSQL should be bundled with the packaged release of LCF so 
 that installation and deployment of LCF will automatically install and deploy 
 a subset of the full PostgreSQL distribution that is sufficient for the needs 
 of LCF. Starting LCF, with or without the LCF UI, should automatically start 
 the database server. Shutting down LCF should also shutdown the database 
 server process.
 A typical use case would be for a non-developer who is comfortable with Solr 
 and simply wants to crawl documents from, for example, a SharePoint 
 repository and feed them into Solr. QuickStart should work well for the low 
 end or in the early stages of evaluation, but the user would prefer to 
 evaluate the real thing with something resembling a production crawl of 
 thousands of documents. Such a user might not be a hard-core developer or be 
 comfortable fiddling with a lot of software components simply to do one 
 conceptually simple operation.
 It should still be possible for the user to supply database server settings 
 to override the defaults, but the LCF package should have all of the 
 best-practice settings deemed appropriate for use with LCF.
 One downside is that installation and deployment will be platform-specific 
 since there are multiple processes and PostgreSQL itself requires a 
 platform-specific installation.
 This proposal presumes that PostgreSQL is the best option for the foreseeable 
 future, but nothing here is intended to preclude support for other database 
 servers in futures releases.
 This proposal should not have any impact on QuickStart packaging or 
 deployment.
 Note: This issue is part of Phase 1 of the CONNECTORS-50 umbrella issue.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (CONNECTORS-55) Bundle database server with LCF packaged product

2010-07-08 Thread Jack Krupansky (JIRA)

[ 
https://issues.apache.org/jira/browse/CONNECTORS-55?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12886490#action_12886490
 ] 

Jack Krupansky commented on CONNECTORS-55:
--

I was using the term install loosely, not so much the way a typical package 
has a GUI wizard and lots of stuff going on, but more in the sense of raw Solr 
where you download, unzip, and files are in sub directories right where they 
need to be. In that sense, the theory is that a subset of PostgreSQL could be 
in a subdirectory.

Some enterprising vendor, such as Lucid Imagination, might want to have a fancy 
GUI install, but that would be beyond the scope of what I intended here.


 Bundle database server with LCF packaged product
 

 Key: CONNECTORS-55
 URL: https://issues.apache.org/jira/browse/CONNECTORS-55
 Project: Lucene Connector Framework
  Issue Type: Improvement
  Components: Framework core
Reporter: Jack Krupansky

 The current requirement that the user install and deploy a PostgreSQL server 
 complicates the installation and deployment of LCF for the user. Installation 
 and deployment of LCF should be as simple as Solr itself. QuickStart is great 
 for the low-end and basic evaluation, but a comparable level of simplified 
 installation and deployment is still needed for full-blown, high-end 
 environments that need the full performance of a ProstgreSQL-class database 
 server. So, PostgreSQL should be bundled with the packaged release of LCF so 
 that installation and deployment of LCF will automatically install and deploy 
 a subset of the full PostgreSQL distribution that is sufficient for the needs 
 of LCF. Starting LCF, with or without the LCF UI, should automatically start 
 the database server. Shutting down LCF should also shutdown the database 
 server process.
 A typical use case would be for a non-developer who is comfortable with Solr 
 and simply wants to crawl documents from, for example, a SharePoint 
 repository and feed them into Solr. QuickStart should work well for the low 
 end or in the early stages of evaluation, but the user would prefer to 
 evaluate the real thing with something resembling a production crawl of 
 thousands of documents. Such a user might not be a hard-core developer or be 
 comfortable fiddling with a lot of software components simply to do one 
 conceptually simple operation.
 It should still be possible for the user to supply database server settings 
 to override the defaults, but the LCF package should have all of the 
 best-practice settings deemed appropriate for use with LCF.
 One downside is that installation and deployment will be platform-specific 
 since there are multiple processes and PostgreSQL itself requires a 
 platform-specific installation.
 This proposal presumes that PostgreSQL is the best option for the foreseeable 
 future, but nothing here is intended to preclude support for other database 
 servers in futures releases.
 This proposal should not have any impact on QuickStart packaging or 
 deployment.
 Note: This issue is part of Phase 1 of the CONNECTORS-50 umbrella issue.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (CONNECTORS-56) All features should be accessible through an API

2010-07-08 Thread Jack Krupansky (JIRA)
All features should be accessible through an API


 Key: CONNECTORS-56
 URL: https://issues.apache.org/jira/browse/CONNECTORS-56
 Project: Lucene Connector Framework
  Issue Type: Improvement
  Components: Framework core
Reporter: Jack Krupansky


LCF consists of a full-featured crawling engine and a full-featured user 
interface to access the features of that engine, but some applications are 
better served with a full API that lets the application control the crawling 
engine, including creation and editing of connections and creation, editing, 
and control of jobs. Put simply, everything that a user can accomplish via the 
LCF UI should be doable through an LCF API. All LCF objects should be queryable 
through the API.

A primary use case is Solr applications which currently use Aperture for 
crawling, but would prefer the full-featured capabilities of LCF as a crawling 
engine over Aperture.

I do not wish to over-specify the API in this initial description, but I think 
the LCF API should probably be a traditional REST API., with some of the API 
elements specified via the context path, some parameters via URL query 
parameters, and complex, detailed structures as JSON (or similar.). The precise 
details of the API are beyond the scope of this initial description and will be 
added incrementally once the high-level approach to the API becomes reasonably 
settled.

A job status and event reporting scheme is also needed in conjunction with the 
LCF API. That requirement has already been captured as CONNECTORS-41.

The intention for the API is to create, edit, access, and control all of the 
objects managed by LCF. The main focus is on repositories, jobs, and status, 
and less about document-specific crawling information, but there may be some 
benefit to querying crawling status for individual documents as well.

Nothing in this proposal should in any way limit or constrain the features that 
will be available in the LCF UI. The intent is that LCF should continue to have 
a full-featured UI, but in addition to a full-featured API.

Note: This issue is part of Phase 2 of the CONNECTORS-50 umbrella issue.


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (CONNECTORS-50) Proposal for initial two releases of LCF, including packaged product and full API

2010-06-30 Thread Jack Krupansky (JIRA)
Proposal for initial two releases of LCF, including packaged product and full 
API
-

 Key: CONNECTORS-50
 URL: https://issues.apache.org/jira/browse/CONNECTORS-50
 Project: Lucene Connector Framework
  Issue Type: New Feature
  Components: Framework core
Reporter: Jack Krupansky


Currently, LCF has a relatively high-bar or evaluation and use, requiring 
developer expertise. Also, although LCF has a comprehensive UI, it is not 
currently packaged for use as a crawling engine for advanced applications.

A small set of individual feature requests are needed to address these issues. 
They are summarized briefly to show how they fit together for two initial 
releases of LCF, but will be broken out into individual LCF Jira issues.

Goals:

1. LCF as a standalone, downloadable, usable-out-of-the-box product (much as 
Solr is today)
2. LCF as a toolkit for developers needing customized crawling and repository 
access
3. An API-based crawling engine that can be integrated with applications (as 
Aperture is today)

Larger goals:

1. Make it very easy for users to evaluate LCF.
2. Make it very easy for developers to customize LCF.
3. Make it very easy for appplications to fully manage and control LCF in 
operation.

Two phases:

1) Standalone, packaged app that is super-easy to evaluate and deploy. Call it 
LCF 0.5.
2) API-based crawling engine for applications for which the UI might not be 
appropriate. Call it LCF 1.0.


Phase 1
---

LCF 0.5 right out of the box would interface loosely with Solr 1.4 or later.
It would contain roughly the features that are currently in place or currently 
underway, plus a little more.

Specifically, LCF 0.5 would contain these additional capabilities:

1. Plug-in architecture for connectors (already underway)
2. Packaged app ready to run with embedded Jetty app server (I think this has 
been agreed to)
3. Bundled with database - PostgreSQL or derby - ready to run without 
additional manual setup
4. Mini-API to initially configure default connections and example jobs for 
file system and web crawl
5. Agent process started automatically (platform-specific startup required)
6. Solr output connector option to commit at end of job, by default

Installation and basic evaluation of LCF would be essentially as simple as Solr 
is today. The example
connections and jobs would permit the user to initiate example crawls of a file 
system example
directory and an example web on the LCF web site with just a couple of clicks 
(as opposed to the
detailed manual setup required today to create repository and output 
connections and jobs.

It is worth considering whether the SharePoint connector could also be included 
as part of the default package.

Users could then add additional connectors and repositories and jobs as desired.

Timeframe for release? Level of effort?

Phase 2
---

The essence of Phase 2 is that LCF would be split to allow direct, full API 
access to LCF as a
crawling engine, in additional to the full LCF UI. Call this LCF 1.0.

Specifically, LCF 1.0 would contain these additional capabilities:

1. Full API for LCF as a crawling engine
2. LCF can be bundled within an app (such as the default LCF package itself 
with its UI)
3. LCF event and activity notification for full control by an application 
(already a Jira request)

Overall, LCF will offer roughly the same crawling capabilities as with LCF 0.5, 
plus whatever bug
fixes and minor enhancements might also be added.

Timeframe for release? Level of effort?

-

Issues:

- Can we package PostgreSQL with LCF so LCF can set it up?
  - Or do we need Derby for that purpose?
- Managing multiple processes (UI, database, agent, app processes)
- What exactly would the API look like? (URL, XML, JSON, YAML?)


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (CONNECTORS-37) LCF should use an XML configuration file, not the simple name/value config file it currently has

2010-06-01 Thread Jack Krupansky (JIRA)

[ 
https://issues.apache.org/jira/browse/CONNECTORS-37?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12874029#action_12874029
 ] 

Jack Krupansky commented on CONNECTORS-37:
--

I'll defer to the community on the logging issue, other than to simply say that 
it should be as standard as possible and relatively compatible with how Solr 
does logging so that it will not surprise people.

I don't have a problem with the LCF .properties file per se, other than the 
fact that since it is restricted to being strictly keyword/value pairs it 
cannot contain more complex, structured configuration information.

The main thing I'd like to see is that the current executecommand 
configuration setup, such as which output connectors and crawlers to register, 
be done using descriptions in a config file rather than discrete shell commands 
to manually execute. The default config file from svn checkout should have a 
default set of connectors, crawlers, etc., and have commented-out entries for 
other connectors that people can un-comment and edit as desired.

A key advantage of having such a config file is that when people do report 
problems here we can ask them to provide their config file rather than ask them 
to try to remember and re-type whatever commands they might remember that they 
intended to type.

Whether connections and jobs can be initially created from a config file is a 
larger discussion. The main point here is simply that it be easy to get LCF 
initialized and configured for the really basic stuff needed for a typical 
initial evaluation (comparable to what occurs in a Solr tutorial.) The 
proverbial zero-hour experience.


 LCF should use an XML configuration file, not the simple name/value config 
 file it currently has
 

 Key: CONNECTORS-37
 URL: https://issues.apache.org/jira/browse/CONNECTORS-37
 Project: Lucene Connector Framework
  Issue Type: Improvement
  Components: Framework core
Reporter: Karl Wright

 LCF's configuration file is limited in what it can specify, and XML 
 configuration files seem to offer more flexibility and are the modern norm.  
 Before backwards compatibility becomes an issue, it may therefore be worth 
 converting the property file reader to use XML rather than name/value format. 
  It would also be nice to be able to fold the logging configuration into the 
 same file, if this seems possible.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.