I looked into inserting a formal validation step in o.a.solr.core.Config and ran some preliminary simple tests. The code is fairly simple; just a couple of gotchas:

1) to use the RNC validation language (my preference), we would need to pull in a couple of new jars, one of which is over 600K. Also, support for RNC in the XML world is not very widespread: it's gotten more interest from researchers and less uptake more broadly, so it might not be the best choice, even if, aesthetically it is superior IMO.

2) The other alternatives are XML Schema and DTD. I think DTD is a non-starter since it just can't allow things like arbitrary attributes on an element (you have to list them explicitly). Schema is probably the best choice all things considered: support for it is built into the XML tools already in use, and it is widely adopted. The drawback is that it's a baroque and unwieldy syntax designed by an indecisive committee that loaded it down with excessive featuritis, and someone will end up having to maintain this: every time you add a new configuration option to the schema (or solrconfig, etc), then the schema-schema (validation schema?) will have to be updated to reflect that.

3) Finally, to get good error reporting it's important to show file name and line number where an error occurred. Although you can validate a constructed XML tree (a DOM), it's better to run validation on a Stream so the line numbers are available. Therefore it will probably be necessary to run two passes (one to validate, and one to construct the DOM), which means buffering the config. Doesn't seem like a big deal: these are small files that only get loaded once, but this is a cost of validation, I think.

Of course the benefit is that users would actually get fast-failing specific and informative error messages covering a wide variety of misconfigurations: I would hope we could be restrictive enough to catch mis-spelled versions of known element and attribute names, or places where elements are out of order.

I'd be willing to work this up, develop a preliminary schema (of whichever sort we choose), and send in a patch, but other folks would probably end up having to maintain it from time to time if it's to have any value at all and not just get disabled, so I just want to make sure this is something you all think is worth while before going any further.

-Mike



On 05/17/2011 09:04 AM, Michael McCandless wrote:
https://issues.apache.org/jira/browse/SOLR-2119 is a good example
where we are failing to catch mis-configuration on startup.

Is there some way we can baby step here?  EG use one of these XML
validation packages, incrementally, on only sub-strings from the XML?
(Or simpler is to just do the checking ourselves w/ custom code).

Mike

http://blog.mikemccandless.com

On Wed, May 4, 2011 at 10:50 PM, Michael Sokolov<soko...@ifactory.com>  wrote:
I'm not sure you will find anyone wanting to put in this effort now, but
another suggestion for a general approach might be:

1 very basic static analysis to catch what you can - this should be a pretty
minimal effort only given what can reasonably be achieved

2 throw runtime errors as Hoss says (probably already doing this well
enough, but maybe some incremental improvements are needed?)

3 an option to run a "configtest" like httpd provides that preloads all
declared handlers/plugins/modules etc, instantiates them and gives them an
opportunity to read their config and throw whatever errors they find.  This
way you can set a standard (error on unrecognized parameter, say) in some
core areas, and distribute the effort.  This is a hugely useful sanity check
to be able to run when you want to make config changes and not have your
server fall over when it starts (or worse - later).

-Mike "kibitzer" Sokolov

On 5/4/2011 6:55 PM, Chris Hostetter wrote:
As i said: any improvements to help catch the mistakes we can identify
would be great, but we should maintain perspective of the effort/gain
tradeoff given that there is likely nothing we can do about the basic
problem of "a string that won't be evaluated until runtime"


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to