I meant the lcf.agents.RegisterOutput org.apache.lcf.agents.output.* and lcf.crawler.Register org.apache.lcf.crawler.connectors.* types of operations that are currently executed as standalone commands, as well as the connections created using the UI. So, you would have config file entries for both the registration of connector "classes" and the definition of the actual connections in some new form of "config" file. Sure, the connector registration initializes the database, but it is all part of the collection of operations that somebody has to perform to go from scratch to an LCF configuration that is ready to "Start" a crawl. Better to have one (or two or three if necessary) "config" file that encompasses the entire configuration setup rather than separate manual steps.

Whether it is high enough priority for the first release is a matter for debate.

-- Jack Krupansky

--------------------------------------------------
From: <karl.wri...@nokia.com>
Sent: Friday, May 28, 2010 11:16 AM
To: <connectors-dev@incubator.apache.org>
Subject: Re: Proposal for simple LCF deployment model

Dump and restore functionality already exists, but the format is not xml.

Providing and xml dump and restore is straightforward. Making such a file operate like a true config file is not.

This, by the way, has nothing to do with registering connectors, which is a datatbase initialization operation.

Karl

--- original message ---
From: "ext Jack Krupansky" <jack.krupan...@lucidimagination.com>
Subject: Re: Proposal for simple LCF deployment model
Date: May 28, 2010
Time: 10:33:34  AM


(b) The alternative starting point should probably autocreate the
database,
and should also autoregister all connectors.  This will require a list,
somewhere,
of the connectors and authorities that are included, and their preferred
UI
names for that installation.  This could come from the configuration
information, or from some other place.  Any ideas?

I would like to see two things: 1) A way to request LCF to "dump" all
configuration parameters, including parameters for all output connections,
repositories,  jobs, et al to an "LCF config file", and 2) The ability to
start from scratch with a fresh deployment of LCF and feed it that config
file to then create all of the output connections, repository connections,
and jobs to match the LCF configuration state desired.

Now, whether that config file is simple XML ala solrconfig.xml can be a
matter for debate. Whether it is a separate file from the current config
file can also be a matter for debate.

But, in short, the answer to your question would be that there would be an
LCF config file (not just the simple keyword/value file that LCF has for
global configuration settings) to see the initial output connections,
repository connections, et al.

Maybe this config file is a little closer to the Solr schema file. I think
it feels that way. OTOH, the list of registered connectors, as opposed to
the user-created connections that use those connectors, seems more like Solr
request handlers that are in solrconfig.xml, so maybe the initial
"configuration" would be split into two separate files as in Solr. Or,
maybe, the Solr guys have a better proposal for how they would have managed that split in Solr if they had it to do all over again. My preference would
be one file for the whole configuration.

Another advantage of such a config file is that it is easier for people to
post problem reports that show exactly how they set up LCF.

-- Jack Krupansky

--------------------------------------------------
From: <karl.wri...@nokia.com>
Sent: Friday, May 28, 2010 5:48 AM
To: <connectors-dev@incubator.apache.org>
Subject: Proposal for simple LCF deployment model

The current LCF standard deployment model requires a number of moving
parts, which are probably necessary in some cases, but simply introduce
complexity in others.  It has occurred to me that it may be possible to
provide an alternate deployment model involving Jetty, which would reduce
the number of moving parts by one (by eliminating Tomcat).  A simple LCF
deployment could then, in principle, look pretty much like Solr's.

In order for this to work, the following has to be true:

(1) jetty's basic JSP support must be comparable to Tomcat's.
(2) the class loader that jetty uses for webapp's must provide class
isolation similar to Tomcat's. If this condition is not met, we'd need to
build both a Tomcat and a Jetty version of each webapp.

The overall set of changes that would be required would be the following:
(a) An alternative "start" entry point would need to be coded, which would
start Jetty running the lcf-crawler-ui and lcf-authority-service webapps
before bringing up the agents engine.
(b) The alternative starting point should probably autocreate the
database, and should also autoregister all connectors.  This will require
a list, somewhere, of the connectors and authorities that are included,
and their preferred UI names for that installation.  This could come from
the configuration information, or from some other place.  Any ideas?
(c) There would need to an additional jar produced by the build process,
which would be the equivalent of the solr start.jar, so as to make running
the whole stack trivial.
(d) An "LCF API" web application, which provides access to all of the
current LCF commands, would also be an obvious requirement to go forward
with this model.

What are the disadvantages?  Well, I think that the main problem would be
security.  This deployment model, though simple, does not control access
to LCF is any way.  You'd need to introduce another moving part to do
that.

Bear in mind that this change would still not allow LCF to run using only
one process. There are still separate RMI-based processes needed for some
connectors (Documentum and FileNet).  Although these could in theory be
started up using Java Activation, a main reason for a separate process in
Documentum's case is that DFC randomly crashes the JVM under which it
runs, and thus needs to be independently restarted if and when it dies.
If anyone has experience with Java Activation and wants to contribute
their time to develop infrastructure that can deal with that problem,
please let me know.

Finally, there is no way around the fact that LCF requires a
well-performing database, which constitutes an independent moving part of
its own.  This proposal does nothing to change that at all.

Please note that I'm not proposing that the current model go away, but
rather that we support both.

Thoughts?
Karl

Reply via email to