Proposal for simple LCF deployment model

2010-05-28 Thread karl.wright
The current LCF standard deployment model requires a number of moving parts, 
which are probably necessary in some cases, but simply introduce complexity in 
others.  It has occurred to me that it may be possible to provide an alternate 
deployment model involving Jetty, which would reduce the number of moving parts 
by one (by eliminating Tomcat).  A simple LCF deployment could then, in 
principle, look pretty much like Solr's.

In order for this to work, the following has to be true:

(1) jetty's basic JSP support must be comparable to Tomcat's.
(2) the class loader that jetty uses for webapp's must provide class isolation 
similar to Tomcat's.  If this condition is not met, we'd need to build both a 
Tomcat and a Jetty version of each webapp.

The overall set of changes that would be required would be the following:
(a) An alternative start entry point would need to be coded, which would 
start Jetty running the lcf-crawler-ui and lcf-authority-service webapps  
before bringing up the agents engine.
(b) The alternative starting point should probably autocreate the database, and 
should also autoregister all connectors.  This will require a list, somewhere, 
of the connectors and authorities that are included, and their preferred UI 
names for that installation.  This could come from the configuration 
information, or from some other place.  Any ideas?
(c) There would need to an additional jar produced by the build process, which 
would be the equivalent of the solr start.jar, so as to make running the whole 
stack trivial.
(d) An LCF API web application, which provides access to all of the current 
LCF commands, would also be an obvious requirement to go forward with this 
model.

What are the disadvantages?  Well, I think that the main problem would be 
security.  This deployment model, though simple, does not control access to LCF 
is any way.  You'd need to introduce another moving part to do that.

Bear in mind that this change would still not allow LCF to run using only one 
process.  There are still separate RMI-based processes needed for some 
connectors (Documentum and FileNet).  Although these could in theory be started 
up using Java Activation, a main reason for a separate process in Documentum's 
case is that DFC randomly crashes the JVM under which it runs, and thus needs 
to be independently restarted if and when it dies.  If anyone has experience 
with Java Activation and wants to contribute their time to develop 
infrastructure that can deal with that problem, please let me know.

Finally, there is no way around the fact that LCF requires a well-performing 
database, which constitutes an independent moving part of its own.  This 
proposal does nothing to change that at all.

Please note that I'm not proposing that the current model go away, but rather 
that we support both.

Thoughts?
Karl


Re: Proposal for simple LCF deployment model

2010-05-28 Thread Jack Krupansky
(b) The alternative starting point should probably autocreate the 
database,
and should also autoregister all connectors.  This will require a list, 
somewhere,
of the connectors and authorities that are included, and their preferred 
UI

names for that installation.  This could come from the configuration
information, or from some other place.  Any ideas?


I would like to see two things: 1) A way to request LCF to dump all 
configuration parameters, including parameters for all output connections, 
repositories,  jobs, et al to an LCF config file, and 2) The ability to 
start from scratch with a fresh deployment of LCF and feed it that config 
file to then create all of the output connections, repository connections, 
and jobs to match the LCF configuration state desired.


Now, whether that config file is simple XML ala solrconfig.xml can be a 
matter for debate. Whether it is a separate file from the current config 
file can also be a matter for debate.


But, in short, the answer to your question would be that there would be an 
LCF config file (not just the simple keyword/value file that LCF has for 
global configuration settings) to see the initial output connections, 
repository connections, et al.


Maybe this config file is a little closer to the Solr schema file. I think 
it feels that way. OTOH, the list of registered connectors, as opposed to 
the user-created connections that use those connectors, seems more like Solr 
request handlers that are in solrconfig.xml, so maybe the initial 
configuration would be split into two separate files as in Solr. Or, 
maybe, the Solr guys have a better proposal for how they would have managed 
that split in Solr if they had it to do all over again. My preference would 
be one file for the whole configuration.


Another advantage of such a config file is that it is easier for people to 
post problem reports that show exactly how they set up LCF.


-- Jack Krupansky

--
From: karl.wri...@nokia.com
Sent: Friday, May 28, 2010 5:48 AM
To: connectors-dev@incubator.apache.org
Subject: Proposal for simple LCF deployment model

The current LCF standard deployment model requires a number of moving 
parts, which are probably necessary in some cases, but simply introduce 
complexity in others.  It has occurred to me that it may be possible to 
provide an alternate deployment model involving Jetty, which would reduce 
the number of moving parts by one (by eliminating Tomcat).  A simple LCF 
deployment could then, in principle, look pretty much like Solr's.


In order for this to work, the following has to be true:

(1) jetty's basic JSP support must be comparable to Tomcat's.
(2) the class loader that jetty uses for webapp's must provide class 
isolation similar to Tomcat's.  If this condition is not met, we'd need to 
build both a Tomcat and a Jetty version of each webapp.


The overall set of changes that would be required would be the following:
(a) An alternative start entry point would need to be coded, which would 
start Jetty running the lcf-crawler-ui and lcf-authority-service webapps 
before bringing up the agents engine.
(b) The alternative starting point should probably autocreate the 
database, and should also autoregister all connectors.  This will require 
a list, somewhere, of the connectors and authorities that are included, 
and their preferred UI names for that installation.  This could come from 
the configuration information, or from some other place.  Any ideas?
(c) There would need to an additional jar produced by the build process, 
which would be the equivalent of the solr start.jar, so as to make running 
the whole stack trivial.
(d) An LCF API web application, which provides access to all of the 
current LCF commands, would also be an obvious requirement to go forward 
with this model.


What are the disadvantages?  Well, I think that the main problem would be 
security.  This deployment model, though simple, does not control access 
to LCF is any way.  You'd need to introduce another moving part to do 
that.


Bear in mind that this change would still not allow LCF to run using only 
one process.  There are still separate RMI-based processes needed for some 
connectors (Documentum and FileNet).  Although these could in theory be 
started up using Java Activation, a main reason for a separate process in 
Documentum's case is that DFC randomly crashes the JVM under which it 
runs, and thus needs to be independently restarted if and when it dies. 
If anyone has experience with Java Activation and wants to contribute 
their time to develop infrastructure that can deal with that problem, 
please let me know.


Finally, there is no way around the fact that LCF requires a 
well-performing database, which constitutes an independent moving part of 
its own.  This proposal does nothing to change that at all.


Please note that I'm not proposing that the current model go away, but 
rather that we support 

Re: Proposal for simple LCF deployment model

2010-05-28 Thread karl.wright
You forget that building lcf in its entirety requires that you supply 
proprietary client components from third-party vendors.  So i think it is 
unrealistic to expect canned builds that contain everything that you just 
deploy.  For lcf i think the build cycle will thus be very common.

Getting rid of the database requirement is also obviously not an option.

Karl

--- original message ---
From: ext Jack Krupansky jack.krupan...@lucidimagination.com
Subject: Re: Proposal for simple LCF deployment model
Date: May 28, 2010
Time: 10:42:17  AM


A simple deployment ala Solr is a good goal. Integrating Jetty with the LCF
deployment will go a long way towards that goal. The database software
deployment (PostgreSQL) is the other half of the hassle with deploying LCF.

I think there are three distinct goals here: 1) A super-easy Solr-style
deployment for initial evaluation of LCF, 2) deployment of the LCF
components for full-blown application development where app server and
database might need to be different from the initial evaluation, and 3)
deployment of LCF components for production deployment of the full
application.

Right now, evaluation of LCF requires deployment of the source code and
building artifacts - Solr evaluation does not require that step. Eliminated
the source and build step will certainly help simplify the evaluation
process.

Another possible consideration is that although some of us are especially
interested in integration with Solr and doing so easily and robustly, Solr
is just one of the output connections and LCF could be deployed for
applications that do not involve Solr at all. So, maybe there should be an
extra deployment wiki page for Solr guys that focuses on use of LCF with
Solr and related issues. Whether that should be the default presentation in
the doc is a matter for debate. Right now, I see no harm with a Solr bias.
At least it is a convenient way to demonstrate end-to-end use of LCF.

-- Jack Krupansky

--
From: karl.wri...@nokia.com
Sent: Friday, May 28, 2010 5:48 AM
To: connectors-dev@incubator.apache.org
Subject: Proposal for simple LCF deployment model

 The current LCF standard deployment model requires a number of moving
 parts, which are probably necessary in some cases, but simply introduce
 complexity in others.  It has occurred to me that it may be possible to
 provide an alternate deployment model involving Jetty, which would reduce
 the number of moving parts by one (by eliminating Tomcat).  A simple LCF
 deployment could then, in principle, look pretty much like Solr's.

 In order for this to work, the following has to be true:

 (1) jetty's basic JSP support must be comparable to Tomcat's.
 (2) the class loader that jetty uses for webapp's must provide class
 isolation similar to Tomcat's.  If this condition is not met, we'd need to
 build both a Tomcat and a Jetty version of each webapp.

 The overall set of changes that would be required would be the following:
 (a) An alternative start entry point would need to be coded, which would
 start Jetty running the lcf-crawler-ui and lcf-authority-service webapps
 before bringing up the agents engine.
 (b) The alternative starting point should probably autocreate the
 database, and should also autoregister all connectors.  This will require
 a list, somewhere, of the connectors and authorities that are included,
 and their preferred UI names for that installation.  This could come from
 the configuration information, or from some other place.  Any ideas?
 (c) There would need to an additional jar produced by the build process,
 which would be the equivalent of the solr start.jar, so as to make running
 the whole stack trivial.
 (d) An LCF API web application, which provides access to all of the
 current LCF commands, would also be an obvious requirement to go forward
 with this model.

 What are the disadvantages?  Well, I think that the main problem would be
 security.  This deployment model, though simple, does not control access
 to LCF is any way.  You'd need to introduce another moving part to do
 that.

 Bear in mind that this change would still not allow LCF to run using only
 one process.  There are still separate RMI-based processes needed for some
 connectors (Documentum and FileNet).  Although these could in theory be
 started up using Java Activation, a main reason for a separate process in
 Documentum's case is that DFC randomly crashes the JVM under which it
 runs, and thus needs to be independently restarted if and when it dies.
 If anyone has experience with Java Activation and wants to contribute
 their time to develop infrastructure that can deal with that problem,
 please let me know.

 Finally, there is no way around the fact that LCF requires a
 well-performing database, which constitutes an independent moving part of
 its own.  This proposal does nothing to change that at all.

 Please note that I'm not proposing that the current model go away, 

Re: Proposal for simple LCF deployment model

2010-05-28 Thread Jack Krupansky
But for a basic, early evaluation, test drive, just the file system and 
web repository connectors should be sufficient. And if there is a clean 
database abstraction, a basic database package (e.g., derby) should be 
sufficient for such a basic evaluation.


Are there technical reasons why third-party repository connectors cannot be 
supported using a Solr-style plug-in approach? Or, worst case, as separate 
processes with a clean inter-process API? Maybe not in the near-term, but as 
a longer-term vision.


-- Jack Krupansky

--
From: karl.wri...@nokia.com
Sent: Friday, May 28, 2010 11:10 AM
To: connectors-dev@incubator.apache.org
Subject: Re: Proposal for simple LCF deployment model

You forget that building lcf in its entirety requires that you supply 
proprietary client components from third-party vendors.  So i think it is 
unrealistic to expect canned builds that contain everything that you just 
deploy.  For lcf i think the build cycle will thus be very common.


Getting rid of the database requirement is also obviously not an option.

Karl

--- original message ---
From: ext Jack Krupansky jack.krupan...@lucidimagination.com
Subject: Re: Proposal for simple LCF deployment model
Date: May 28, 2010
Time: 10:42:17  AM


A simple deployment ala Solr is a good goal. Integrating Jetty with the 
LCF

deployment will go a long way towards that goal. The database software
deployment (PostgreSQL) is the other half of the hassle with deploying 
LCF.


I think there are three distinct goals here: 1) A super-easy Solr-style
deployment for initial evaluation of LCF, 2) deployment of the LCF
components for full-blown application development where app server and
database might need to be different from the initial evaluation, and 3)
deployment of LCF components for production deployment of the full
application.

Right now, evaluation of LCF requires deployment of the source code and
building artifacts - Solr evaluation does not require that step. 
Eliminated

the source and build step will certainly help simplify the evaluation
process.

Another possible consideration is that although some of us are especially
interested in integration with Solr and doing so easily and robustly, Solr
is just one of the output connections and LCF could be deployed for
applications that do not involve Solr at all. So, maybe there should be an
extra deployment wiki page for Solr guys that focuses on use of LCF with
Solr and related issues. Whether that should be the default presentation 
in

the doc is a matter for debate. Right now, I see no harm with a Solr bias.
At least it is a convenient way to demonstrate end-to-end use of LCF.

-- Jack Krupansky

--
From: karl.wri...@nokia.com
Sent: Friday, May 28, 2010 5:48 AM
To: connectors-dev@incubator.apache.org
Subject: Proposal for simple LCF deployment model


The current LCF standard deployment model requires a number of moving
parts, which are probably necessary in some cases, but simply introduce
complexity in others.  It has occurred to me that it may be possible to
provide an alternate deployment model involving Jetty, which would reduce
the number of moving parts by one (by eliminating Tomcat).  A simple LCF
deployment could then, in principle, look pretty much like Solr's.

In order for this to work, the following has to be true:

(1) jetty's basic JSP support must be comparable to Tomcat's.
(2) the class loader that jetty uses for webapp's must provide class
isolation similar to Tomcat's.  If this condition is not met, we'd need 
to

build both a Tomcat and a Jetty version of each webapp.

The overall set of changes that would be required would be the following:
(a) An alternative start entry point would need to be coded, which 
would

start Jetty running the lcf-crawler-ui and lcf-authority-service webapps
before bringing up the agents engine.
(b) The alternative starting point should probably autocreate the
database, and should also autoregister all connectors.  This will require
a list, somewhere, of the connectors and authorities that are included,
and their preferred UI names for that installation.  This could come from
the configuration information, or from some other place.  Any ideas?
(c) There would need to an additional jar produced by the build process,
which would be the equivalent of the solr start.jar, so as to make 
running

the whole stack trivial.
(d) An LCF API web application, which provides access to all of the
current LCF commands, would also be an obvious requirement to go forward
with this model.

What are the disadvantages?  Well, I think that the main problem would be
security.  This deployment model, though simple, does not control access
to LCF is any way.  You'd need to introduce another moving part to do
that.

Bear in mind that this change would still not allow LCF to run using only
one process.  There are still separate RMI-based processes needed for 

RE: Proposal for simple LCF deployment model

2010-05-28 Thread karl.wright
I've been fighting with Derby for two days.  It's missing a significant amount 
of important functionality, and its user and database model are radically 
different from all other databases I know of.  (I'm also getting nonsense 
exceptions from it, but that's another matter.)  So regardless of how good the 
database abstraction layer is, expecting all databases to have sufficient 
functionality to get anything done is ridiculous.  If I get Derby working, I 
will let you know whether it is feasible at all to run LCF on in under any 
circumstances or not, but that *cannot* be the primary database people use with 
this project.  I'm also still waiting for a use-case from you as to how getting 
rid of the Postgresql database makes your life easier at all - and if your use 
case involves using Derby for anything serious, I'll have to say that I don't 
think that's realistic.

LCF has a very clean connector abstraction today.  So all we're really talking 
about is the build process here - whether it is possible to separate build and 
deployment of the framework and some connectors from the builds of other 
connectors.  Having each connector run as a separate process seems like 
overkill and would also impact performance pretty dramatically, as well as 
requiring quite a bit of additional configuration.  The Solr plug-in model is 
a bit better and requires only the addition of a custom classloader that 
explicitly loads any plugin classes and any classes that those use.  The 
required defines that some libraries need would have to be solved, but that 
needs doing anyway and I think I can have individual connectors set these as 
needed.

Karl



-Original Message-
From: ext Jack Krupansky [mailto:jack.krupan...@lucidimagination.com] 
Sent: Friday, May 28, 2010 1:49 PM
To: connectors-dev@incubator.apache.org
Subject: Re: Proposal for simple LCF deployment model

But for a basic, early evaluation, test drive, just the file system and 
web repository connectors should be sufficient. And if there is a clean 
database abstraction, a basic database package (e.g., derby) should be 
sufficient for such a basic evaluation.

Are there technical reasons why third-party repository connectors cannot be 
supported using a Solr-style plug-in approach? Or, worst case, as separate 
processes with a clean inter-process API? Maybe not in the near-term, but as 
a longer-term vision.

-- Jack Krupansky

--
From: karl.wri...@nokia.com
Sent: Friday, May 28, 2010 11:10 AM
To: connectors-dev@incubator.apache.org
Subject: Re: Proposal for simple LCF deployment model

 You forget that building lcf in its entirety requires that you supply 
 proprietary client components from third-party vendors.  So i think it is 
 unrealistic to expect canned builds that contain everything that you just 
 deploy.  For lcf i think the build cycle will thus be very common.

 Getting rid of the database requirement is also obviously not an option.

 Karl

 --- original message ---
 From: ext Jack Krupansky jack.krupan...@lucidimagination.com
 Subject: Re: Proposal for simple LCF deployment model
 Date: May 28, 2010
 Time: 10:42:17  AM


 A simple deployment ala Solr is a good goal. Integrating Jetty with the 
 LCF
 deployment will go a long way towards that goal. The database software
 deployment (PostgreSQL) is the other half of the hassle with deploying 
 LCF.

 I think there are three distinct goals here: 1) A super-easy Solr-style
 deployment for initial evaluation of LCF, 2) deployment of the LCF
 components for full-blown application development where app server and
 database might need to be different from the initial evaluation, and 3)
 deployment of LCF components for production deployment of the full
 application.

 Right now, evaluation of LCF requires deployment of the source code and
 building artifacts - Solr evaluation does not require that step. 
 Eliminated
 the source and build step will certainly help simplify the evaluation
 process.

 Another possible consideration is that although some of us are especially
 interested in integration with Solr and doing so easily and robustly, Solr
 is just one of the output connections and LCF could be deployed for
 applications that do not involve Solr at all. So, maybe there should be an
 extra deployment wiki page for Solr guys that focuses on use of LCF with
 Solr and related issues. Whether that should be the default presentation 
 in
 the doc is a matter for debate. Right now, I see no harm with a Solr bias.
 At least it is a convenient way to demonstrate end-to-end use of LCF.

 -- Jack Krupansky

 --
 From: karl.wri...@nokia.com
 Sent: Friday, May 28, 2010 5:48 AM
 To: connectors-dev@incubator.apache.org
 Subject: Proposal for simple LCF deployment model

 The current LCF standard deployment model requires a number of moving
 parts, which are probably necessary in some cases, but simply 

Re: Proposal for simple LCF deployment model

2010-05-28 Thread Jack Krupansky

The use cases I was considering for database issues are:

1) Desire for a very simple evaluation install process. See the Solr 
tutorial.
2) Desire for less complex and faster application deployment install 
process. PostgreSQL has a reputation for having a large footprint.


Now, as machines and software evolve, it is not completely clear to me how 
bad PostgreSQL is these days, but having a separate deployment step to 
accommodate PostgreSQL interferes with use case #1.


That said, I am not sure that I would hold up getting the first official 
release of LCF out the door. After all, leading-edge (bleeding-edge) users 
are used to more than a little inconvenience. Still, a Solr-simple 
evaluation install would be... sweet.


-- Jack Krupansky

--
From: karl.wri...@nokia.com
Sent: Friday, May 28, 2010 2:17 PM
To: connectors-dev@incubator.apache.org
Subject: RE: Proposal for simple LCF deployment model

I've been fighting with Derby for two days.  It's missing a significant 
amount of important functionality, and its user and database model are 
radically different from all other databases I know of.  (I'm also getting 
nonsense exceptions from it, but that's another matter.)  So regardless of 
how good the database abstraction layer is, expecting all databases to 
have sufficient functionality to get anything done is ridiculous.  If I 
get Derby working, I will let you know whether it is feasible at all to 
run LCF on in under any circumstances or not, but that *cannot* be the 
primary database people use with this project.  I'm also still waiting for 
a use-case from you as to how getting rid of the Postgresql database makes 
your life easier at all - and if your use case involves using Derby for 
anything serious, I'll have to say that I don't think that's realistic.


LCF has a very clean connector abstraction today.  So all we're really 
talking about is the build process here - whether it is possible to 
separate build and deployment of the framework and some connectors from 
the builds of other connectors.  Having each connector run as a separate 
process seems like overkill and would also impact performance pretty 
dramatically, as well as requiring quite a bit of additional 
configuration.  The Solr plug-in model is a bit better and requires only 
the addition of a custom classloader that explicitly loads any plugin 
classes and any classes that those use.  The required defines that some 
libraries need would have to be solved, but that needs doing anyway and I 
think I can have individual connectors set these as needed.


Karl



-Original Message-
From: ext Jack Krupansky [mailto:jack.krupan...@lucidimagination.com]
Sent: Friday, May 28, 2010 1:49 PM
To: connectors-dev@incubator.apache.org
Subject: Re: Proposal for simple LCF deployment model

But for a basic, early evaluation, test drive, just the file system and
web repository connectors should be sufficient. And if there is a clean
database abstraction, a basic database package (e.g., derby) should be
sufficient for such a basic evaluation.

Are there technical reasons why third-party repository connectors cannot 
be
supported using a Solr-style plug-in approach? Or, worst case, as 
separate
processes with a clean inter-process API? Maybe not in the near-term, but 
as

a longer-term vision.

-- Jack Krupansky

--
From: karl.wri...@nokia.com
Sent: Friday, May 28, 2010 11:10 AM
To: connectors-dev@incubator.apache.org
Subject: Re: Proposal for simple LCF deployment model


You forget that building lcf in its entirety requires that you supply
proprietary client components from third-party vendors.  So i think it is
unrealistic to expect canned builds that contain everything that you just
deploy.  For lcf i think the build cycle will thus be very common.

Getting rid of the database requirement is also obviously not an option.

Karl

--- original message ---
From: ext Jack Krupansky jack.krupan...@lucidimagination.com
Subject: Re: Proposal for simple LCF deployment model
Date: May 28, 2010
Time: 10:42:17  AM


A simple deployment ala Solr is a good goal. Integrating Jetty with the
LCF
deployment will go a long way towards that goal. The database software
deployment (PostgreSQL) is the other half of the hassle with deploying
LCF.

I think there are three distinct goals here: 1) A super-easy Solr-style
deployment for initial evaluation of LCF, 2) deployment of the LCF
components for full-blown application development where app server and
database might need to be different from the initial evaluation, and 3)
deployment of LCF components for production deployment of the full
application.

Right now, evaluation of LCF requires deployment of the source code and
building artifacts - Solr evaluation does not require that step.
Eliminated
the source and build step will certainly help simplify the evaluation
process.

Another possible consideration is 

Re: Proposal for simple LCF deployment model

2010-05-28 Thread karl.wright
Dump and restore functionality already exists, but the format is not xml.

Providing and xml dump and restore is straightforward.  Making such a file 
operate like a true config file is not.

This, by the way, has nothing to do with registering connectors, which is a 
datatbase initialization operation.

Karl

--- original message ---
From: ext Jack Krupansky jack.krupan...@lucidimagination.com
Subject: Re: Proposal for simple LCF deployment model
Date: May 28, 2010
Time: 10:33:34  AM


 (b) The alternative starting point should probably autocreate the
 database,
 and should also autoregister all connectors.  This will require a list,
 somewhere,
 of the connectors and authorities that are included, and their preferred
 UI
 names for that installation.  This could come from the configuration
 information, or from some other place.  Any ideas?

I would like to see two things: 1) A way to request LCF to dump all
configuration parameters, including parameters for all output connections,
repositories,  jobs, et al to an LCF config file, and 2) The ability to
start from scratch with a fresh deployment of LCF and feed it that config
file to then create all of the output connections, repository connections,
and jobs to match the LCF configuration state desired.

Now, whether that config file is simple XML ala solrconfig.xml can be a
matter for debate. Whether it is a separate file from the current config
file can also be a matter for debate.

But, in short, the answer to your question would be that there would be an
LCF config file (not just the simple keyword/value file that LCF has for
global configuration settings) to see the initial output connections,
repository connections, et al.

Maybe this config file is a little closer to the Solr schema file. I think
it feels that way. OTOH, the list of registered connectors, as opposed to
the user-created connections that use those connectors, seems more like Solr
request handlers that are in solrconfig.xml, so maybe the initial
configuration would be split into two separate files as in Solr. Or,
maybe, the Solr guys have a better proposal for how they would have managed
that split in Solr if they had it to do all over again. My preference would
be one file for the whole configuration.

Another advantage of such a config file is that it is easier for people to
post problem reports that show exactly how they set up LCF.

-- Jack Krupansky

--
From: karl.wri...@nokia.com
Sent: Friday, May 28, 2010 5:48 AM
To: connectors-dev@incubator.apache.org
Subject: Proposal for simple LCF deployment model

 The current LCF standard deployment model requires a number of moving
 parts, which are probably necessary in some cases, but simply introduce
 complexity in others.  It has occurred to me that it may be possible to
 provide an alternate deployment model involving Jetty, which would reduce
 the number of moving parts by one (by eliminating Tomcat).  A simple LCF
 deployment could then, in principle, look pretty much like Solr's.

 In order for this to work, the following has to be true:

 (1) jetty's basic JSP support must be comparable to Tomcat's.
 (2) the class loader that jetty uses for webapp's must provide class
 isolation similar to Tomcat's.  If this condition is not met, we'd need to
 build both a Tomcat and a Jetty version of each webapp.

 The overall set of changes that would be required would be the following:
 (a) An alternative start entry point would need to be coded, which would
 start Jetty running the lcf-crawler-ui and lcf-authority-service webapps
 before bringing up the agents engine.
 (b) The alternative starting point should probably autocreate the
 database, and should also autoregister all connectors.  This will require
 a list, somewhere, of the connectors and authorities that are included,
 and their preferred UI names for that installation.  This could come from
 the configuration information, or from some other place.  Any ideas?
 (c) There would need to an additional jar produced by the build process,
 which would be the equivalent of the solr start.jar, so as to make running
 the whole stack trivial.
 (d) An LCF API web application, which provides access to all of the
 current LCF commands, would also be an obvious requirement to go forward
 with this model.

 What are the disadvantages?  Well, I think that the main problem would be
 security.  This deployment model, though simple, does not control access
 to LCF is any way.  You'd need to introduce another moving part to do
 that.

 Bear in mind that this change would still not allow LCF to run using only
 one process.  There are still separate RMI-based processes needed for some
 connectors (Documentum and FileNet).  Although these could in theory be
 started up using Java Activation, a main reason for a separate process in
 Documentum's case is that DFC randomly crashes the JVM under which it
 runs, and thus needs to be independently restarted if 

RE: Proposal for simple LCF deployment model

2010-05-28 Thread karl.wright
I already posted a response to this, but since it didn't seem to appear I'm 
going to try again.

LCF already has dump and restore commands, but they don't currently write XML, 
they write binary data.  Providing a way to write and read XML would be 
relatively straightforward.  But this is *not* the same thing as defining 
everything in a global configuration file.  LCF's connection, authority, and 
job definitions belong in the database.

Another proposal that would be much more Solr-like would be to allow you to 
define such things through a servlet API.  This is the approach I'd thought 
would work the best for the most people.

Note that this is a very different question to the question of registering 
connectors and authorities.  The latter operation is more akin to database 
initialization, and would be done in lieu of the current series of connector 
registration commands that you need to do to install connectors into LCF.  It 
may even be that the proper answer is still not to do this step at all on the 
quick start, although I personally think the ideal would be some kind of 
automatic registration of all connectors and authorities that had been built 
during the last build step.

Given this analysis, can you clarify your request?  I'd also like to see use 
cases because without them we're just shooting the breeze.

Karl

-Original Message-
From: ext Jack Krupansky [mailto:jack.krupan...@lucidimagination.com] 
Sent: Friday, May 28, 2010 10:33 AM
To: connectors-dev@incubator.apache.org
Subject: Re: Proposal for simple LCF deployment model

 (b) The alternative starting point should probably autocreate the 
 database,
 and should also autoregister all connectors.  This will require a list, 
 somewhere,
 of the connectors and authorities that are included, and their preferred 
 UI
 names for that installation.  This could come from the configuration
 information, or from some other place.  Any ideas?

I would like to see two things: 1) A way to request LCF to dump all 
configuration parameters, including parameters for all output connections, 
repositories,  jobs, et al to an LCF config file, and 2) The ability to 
start from scratch with a fresh deployment of LCF and feed it that config 
file to then create all of the output connections, repository connections, 
and jobs to match the LCF configuration state desired.

Now, whether that config file is simple XML ala solrconfig.xml can be a 
matter for debate. Whether it is a separate file from the current config 
file can also be a matter for debate.

But, in short, the answer to your question would be that there would be an 
LCF config file (not just the simple keyword/value file that LCF has for 
global configuration settings) to see the initial output connections, 
repository connections, et al.

Maybe this config file is a little closer to the Solr schema file. I think 
it feels that way. OTOH, the list of registered connectors, as opposed to 
the user-created connections that use those connectors, seems more like Solr 
request handlers that are in solrconfig.xml, so maybe the initial 
configuration would be split into two separate files as in Solr. Or, 
maybe, the Solr guys have a better proposal for how they would have managed 
that split in Solr if they had it to do all over again. My preference would 
be one file for the whole configuration.

Another advantage of such a config file is that it is easier for people to 
post problem reports that show exactly how they set up LCF.

-- Jack Krupansky

--
From: karl.wri...@nokia.com
Sent: Friday, May 28, 2010 5:48 AM
To: connectors-dev@incubator.apache.org
Subject: Proposal for simple LCF deployment model

 The current LCF standard deployment model requires a number of moving 
 parts, which are probably necessary in some cases, but simply introduce 
 complexity in others.  It has occurred to me that it may be possible to 
 provide an alternate deployment model involving Jetty, which would reduce 
 the number of moving parts by one (by eliminating Tomcat).  A simple LCF 
 deployment could then, in principle, look pretty much like Solr's.

 In order for this to work, the following has to be true:

 (1) jetty's basic JSP support must be comparable to Tomcat's.
 (2) the class loader that jetty uses for webapp's must provide class 
 isolation similar to Tomcat's.  If this condition is not met, we'd need to 
 build both a Tomcat and a Jetty version of each webapp.

 The overall set of changes that would be required would be the following:
 (a) An alternative start entry point would need to be coded, which would 
 start Jetty running the lcf-crawler-ui and lcf-authority-service webapps 
 before bringing up the agents engine.
 (b) The alternative starting point should probably autocreate the 
 database, and should also autoregister all connectors.  This will require 
 a list, somewhere, of the connectors and authorities that are included, 
 and their 

Re: Proposal for simple LCF deployment model

2010-05-28 Thread Jack Krupansky
I meant the lcf.agents.RegisterOutput org.apache.lcf.agents.output.* and 
lcf.crawler.Register org.apache.lcf.crawler.connectors.* types of operations 
that are currently executed as standalone commands, as well as the 
connections created using the UI. So, you would have config file entries for 
both the registration of connector classes and the definition of the 
actual connections in some new form of config file. Sure, the connector 
registration initializes the database, but it is all part of the collection 
of operations that somebody has to perform to go from scratch to an LCF 
configuration that is ready to Start a crawl. Better to have one (or two 
or three if necessary) config file that encompasses the entire 
configuration setup rather than separate manual steps.


Whether it is high enough priority for the first release is a matter for 
debate.


-- Jack Krupansky

--
From: karl.wri...@nokia.com
Sent: Friday, May 28, 2010 11:16 AM
To: connectors-dev@incubator.apache.org
Subject: Re: Proposal for simple LCF deployment model


Dump and restore functionality already exists, but the format is not xml.

Providing and xml dump and restore is straightforward.  Making such a file 
operate like a true config file is not.


This, by the way, has nothing to do with registering connectors, which is 
a datatbase initialization operation.


Karl

--- original message ---
From: ext Jack Krupansky jack.krupan...@lucidimagination.com
Subject: Re: Proposal for simple LCF deployment model
Date: May 28, 2010
Time: 10:33:34  AM



(b) The alternative starting point should probably autocreate the
database,
and should also autoregister all connectors.  This will require a list,
somewhere,
of the connectors and authorities that are included, and their preferred
UI
names for that installation.  This could come from the configuration
information, or from some other place.  Any ideas?


I would like to see two things: 1) A way to request LCF to dump all
configuration parameters, including parameters for all output connections,
repositories,  jobs, et al to an LCF config file, and 2) The ability to
start from scratch with a fresh deployment of LCF and feed it that config
file to then create all of the output connections, repository connections,
and jobs to match the LCF configuration state desired.

Now, whether that config file is simple XML ala solrconfig.xml can be a
matter for debate. Whether it is a separate file from the current config
file can also be a matter for debate.

But, in short, the answer to your question would be that there would be an
LCF config file (not just the simple keyword/value file that LCF has for
global configuration settings) to see the initial output connections,
repository connections, et al.

Maybe this config file is a little closer to the Solr schema file. I think
it feels that way. OTOH, the list of registered connectors, as opposed to
the user-created connections that use those connectors, seems more like 
Solr

request handlers that are in solrconfig.xml, so maybe the initial
configuration would be split into two separate files as in Solr. Or,
maybe, the Solr guys have a better proposal for how they would have 
managed
that split in Solr if they had it to do all over again. My preference 
would

be one file for the whole configuration.

Another advantage of such a config file is that it is easier for people to
post problem reports that show exactly how they set up LCF.

-- Jack Krupansky

--
From: karl.wri...@nokia.com
Sent: Friday, May 28, 2010 5:48 AM
To: connectors-dev@incubator.apache.org
Subject: Proposal for simple LCF deployment model


The current LCF standard deployment model requires a number of moving
parts, which are probably necessary in some cases, but simply introduce
complexity in others.  It has occurred to me that it may be possible to
provide an alternate deployment model involving Jetty, which would reduce
the number of moving parts by one (by eliminating Tomcat).  A simple LCF
deployment could then, in principle, look pretty much like Solr's.

In order for this to work, the following has to be true:

(1) jetty's basic JSP support must be comparable to Tomcat's.
(2) the class loader that jetty uses for webapp's must provide class
isolation similar to Tomcat's.  If this condition is not met, we'd need 
to

build both a Tomcat and a Jetty version of each webapp.

The overall set of changes that would be required would be the following:
(a) An alternative start entry point would need to be coded, which 
would

start Jetty running the lcf-crawler-ui and lcf-authority-service webapps
before bringing up the agents engine.
(b) The alternative starting point should probably autocreate the
database, and should also autoregister all connectors.  This will require
a list, somewhere, of the connectors and authorities that are included,
and their preferred UI names for that