Re: Solr Config XML DTD's
I looked into inserting a formal validation step in o.a.solr.core.Config and ran some preliminary simple tests. The code is fairly simple; just a couple of gotchas: 1) to use the RNC validation language (my preference), we would need to pull in a couple of new jars, one of which is over 600K. Also, support for RNC in the XML world is not very widespread: it's gotten more interest from researchers and less uptake more broadly, so it might not be the best choice, even if, aesthetically it is superior IMO. 2) The other alternatives are XML Schema and DTD. I think DTD is a non-starter since it just can't allow things like arbitrary attributes on an element (you have to list them explicitly). Schema is probably the best choice all things considered: support for it is built into the XML tools already in use, and it is widely adopted. The drawback is that it's a baroque and unwieldy syntax designed by an indecisive committee that loaded it down with excessive featuritis, and someone will end up having to maintain this: every time you add a new configuration option to the schema (or solrconfig, etc), then the schema-schema (validation schema?) will have to be updated to reflect that. 3) Finally, to get good error reporting it's important to show file name and line number where an error occurred. Although you can validate a constructed XML tree (a DOM), it's better to run validation on a Stream so the line numbers are available. Therefore it will probably be necessary to run two passes (one to validate, and one to construct the DOM), which means buffering the config. Doesn't seem like a big deal: these are small files that only get loaded once, but this is a cost of validation, I think. Of course the benefit is that users would actually get fast-failing specific and informative error messages covering a wide variety of misconfigurations: I would hope we could be restrictive enough to catch mis-spelled versions of known element and attribute names, or places where elements are out of order. I'd be willing to work this up, develop a preliminary schema (of whichever sort we choose), and send in a patch, but other folks would probably end up having to maintain it from time to time if it's to have any value at all and not just get disabled, so I just want to make sure this is something you all think is worth while before going any further. -Mike On 05/17/2011 09:04 AM, Michael McCandless wrote: https://issues.apache.org/jira/browse/SOLR-2119 is a good example where we are failing to catch mis-configuration on startup. Is there some way we can baby step here? EG use one of these XML validation packages, incrementally, on only sub-strings from the XML? (Or simpler is to just do the checking ourselves w/ custom code). Mike http://blog.mikemccandless.com On Wed, May 4, 2011 at 10:50 PM, Michael Sokolov wrote: I'm not sure you will find anyone wanting to put in this effort now, but another suggestion for a general approach might be: 1 very basic static analysis to catch what you can - this should be a pretty minimal effort only given what can reasonably be achieved 2 throw runtime errors as Hoss says (probably already doing this well enough, but maybe some incremental improvements are needed?) 3 an option to run a "configtest" like httpd provides that preloads all declared handlers/plugins/modules etc, instantiates them and gives them an opportunity to read their config and throw whatever errors they find. This way you can set a standard (error on unrecognized parameter, say) in some core areas, and distribute the effort. This is a hugely useful sanity check to be able to run when you want to make config changes and not have your server fall over when it starts (or worse - later). -Mike "kibitzer" Sokolov On 5/4/2011 6:55 PM, Chris Hostetter wrote: As i said: any improvements to help catch the mistakes we can identify would be great, but we should maintain perspective of the effort/gain tradeoff given that there is likely nothing we can do about the basic problem of "a string that won't be evaluated until runtime" - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: Solr Config XML DTD's
https://issues.apache.org/jira/browse/SOLR-2119 is a good example where we are failing to catch mis-configuration on startup. Is there some way we can baby step here? EG use one of these XML validation packages, incrementally, on only sub-strings from the XML? (Or simpler is to just do the checking ourselves w/ custom code). Mike http://blog.mikemccandless.com On Wed, May 4, 2011 at 10:50 PM, Michael Sokolov wrote: > I'm not sure you will find anyone wanting to put in this effort now, but > another suggestion for a general approach might be: > > 1 very basic static analysis to catch what you can - this should be a pretty > minimal effort only given what can reasonably be achieved > > 2 throw runtime errors as Hoss says (probably already doing this well > enough, but maybe some incremental improvements are needed?) > > 3 an option to run a "configtest" like httpd provides that preloads all > declared handlers/plugins/modules etc, instantiates them and gives them an > opportunity to read their config and throw whatever errors they find. This > way you can set a standard (error on unrecognized parameter, say) in some > core areas, and distribute the effort. This is a hugely useful sanity check > to be able to run when you want to make config changes and not have your > server fall over when it starts (or worse - later). > > -Mike "kibitzer" Sokolov > > On 5/4/2011 6:55 PM, Chris Hostetter wrote: >> >> As i said: any improvements to help catch the mistakes we can identify >> would be great, but we should maintain perspective of the effort/gain >> tradeoff given that there is likely nothing we can do about the basic >> problem of "a string that won't be evaluated until runtime" >> > > > - > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org > For additional commands, e-mail: dev-h...@lucene.apache.org > > - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: Solr Config XML DTD's
I'm not sure you will find anyone wanting to put in this effort now, but another suggestion for a general approach might be: 1 very basic static analysis to catch what you can - this should be a pretty minimal effort only given what can reasonably be achieved 2 throw runtime errors as Hoss says (probably already doing this well enough, but maybe some incremental improvements are needed?) 3 an option to run a "configtest" like httpd provides that preloads all declared handlers/plugins/modules etc, instantiates them and gives them an opportunity to read their config and throw whatever errors they find. This way you can set a standard (error on unrecognized parameter, say) in some core areas, and distribute the effort. This is a hugely useful sanity check to be able to run when you want to make config changes and not have your server fall over when it starts (or worse - later). -Mike "kibitzer" Sokolov On 5/4/2011 6:55 PM, Chris Hostetter wrote: As i said: any improvements to help catch the mistakes we can identify would be great, but we should maintain perspective of the effort/gain tradeoff given that there is likely nothing we can do about the basic problem of "a string that won't be evaluated until runtime" - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: Re: Solr Config XML DTD's
: This looks compelling! I'm also not sure what, specifically, we can : validate in Solr's configuration... and I also don't know how much : validation we do today. What hard errors does Solr produce on startup : when configuration is wrong? once upon a time solr would log some config errors, but then happily start up anyway. that "feature" has since been removed (as far as i know) and solr will now fail on any concrete error it encounters in the config. what solr doesn't currently identify as an error is "unused" config (ie: someone types instead of ) ... allthough i seem to recall a patch committed not to long ago that started checking for unused field/fieldtype attributes when parsing schema.xml, along the lines of what mccandles described... : do something like this: when a plugin "claims" a certain attr/element, : this is recorded. If at the end of loading the config, there are : unclaimed attrs/elements, then that's an error. One potential problem with generalizing this approach is that we support "lazy" initializatin of some plugins (it might just be RequestHandlers ... i don't remember off hand) so we'd need to watch out for that -- the whole point is to prevent hte need for instantiating expensive plugins unless/untill they are actually used, so you wouldn't wnat to force them to startup just to read/claim their configs. As to the larger question... : More generally, before we hash out an approach here, I'd like to know : if anyone disagree that we should move Solr to more strict error : checking of its configuration on startup. I think being silent on : configuration errors is the wrong choice... and I think that's : generally Solr's approach today (I think? Or do we catch : configuration errors w/ a hard error and clear message?). ...as i mentioned: if solr sees an error, it should already fail hard and loud. I would love to be able to do either static validation or more agressive sanity checking of potential typos/unused configs on startup in a way that would catch the cases we currently miss -- w/o preventing plugins from having their own options (i regret not using something like xml namespaces for this from day 1) -- but i suspect that it could wind up being a lot of work for little gain Anecdotaly, the most common "config mistake" peoples i see people making are along the lines of this: explicit paramValueWithTypo ... ...i don't know of any static way we could validate the configs that would deal with what are ultimately going to be runtime params. As i said: any improvements to help catch the mistakes we can identify would be great, but we should maintain perspective of the effort/gain tradeoff given that there is likely nothing we can do about the basic problem of "a string that won't be evaluated until runtime" -Hoss - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: Re: Solr Config XML DTD's
> > if anyone disagree that we should move Solr to more strict error > checking of its configuration on startup. I think being silent on > configuration errors is the wrong choice... and I think that's +1 for validation/ warning/ error messages from config files. Excellent link, Michael (Sokolov), I didn't know about this at all. Dawid
Re: Re: Solr Config XML DTD's
Hi Michael, This looks compelling! I'm also not sure what, specifically, we can validate in Solr's configuration... and I also don't know how much validation we do today. What hard errors does Solr produce on startup when configuration is wrong? I know one challenge is the fact that plugins can reach in and claim attrs/elements, which makes validation more interesting. But we could do something like this: when a plugin "claims" a certain attr/element, this is recorded. If at the end of loading the config, there are unclaimed attrs/elements, then that's an error. More generally, before we hash out an approach here, I'd like to know if anyone disagree that we should move Solr to more strict error checking of its configuration on startup. I think being silent on configuration errors is the wrong choice... and I think that's generally Solr's approach today (I think? Or do we catch configuration errors w/ a hard error and clear message?). Mike http://blog.mikemccandless.com On Sun, May 1, 2011 at 7:34 PM, Michael Sokolov wrote: > My first post too - but if I can offer a suggestion - there are more modern > XML validation technologies available than DTD. I would heartily recommend > RelaxNG/Compact notation (see > http://relaxng.org/compact-tutorial-20030326.html) - you can generate Relax > from a DTD, but it is more expressive, while still being easy on the eyes > (uses curly-brace syntax), and much simpler than XML schema. > > In particular it lets you express wildcard constraints like: > > start = anyElement > anyElement = > element * { > (attribute * { text } > | text > | anyElement)* > } > > which matches absolutely anything. > > I'm not sure what kinds of constraints can actually be applied to solr's > configuration in practice? > > But using a formal constraint language will give decent error reporting out > of the box. > > Java-based tools for Relax validation and conversion are available here: > http://code.google.com/p/jing-trang/ > > -Mike S > > On 2:59 PM, Michael McCandless wrote: > >> If not a DTD, can we put some more "customized" form of validation for >> Solr's configuration? >> >> In general, I think servers should be anal on startup, refusing to >> start if there's anything off in their configuration. >> >> (Of course, along with this, the error messaging has to be *excellent* >> so you know precisely where the problem is, what's wrong, how to fix >> it). >> >> If you take the lenient/forgiving approach then you wind up with Solr >> instances in unknown states -- the app developer thinks they turned X >> on, everything starts fine, but then, silently, inexplicably, it's not >> working. This then leads to frustration, thinking Solr is buggy, not >> using this feature, blogging about problems, etc. >> >> Mike >> >> http://blog.mikemccandless.com >> >> On Tue, Mar 29, 2011 at 7:15 PM, Chris Hostetter >> wrote: >>> >>> : Hi, this is my first post to the mailing list. I'm working on a >>> commercial >>> >>> Welcome! >>> >>> : My DTD works for our internal version of queryElevation.xml, but since >>> the >>> : ATTRIB name of the tag could be anything, I'm not sure how to >>> write a >>> : DTD that would validate any valid query elevation file. >>> >>> right .. this is one of the reasons why we've never tried to publish a >>> DTD >>> for the solrconfig.xml or schema.xml files either. there are lots of >>> cases where plugins can define arbitrary attributes on the XML nodes. >>> >>> If i had the chance to do it all over again, and i better understood xml >>> back when yonik first showed me what the configs would look like, i would >>> have suggested using xml namespaces .. but that ship kind of sailed a >>> while ago. >>> >>> we're getting a little better -- moving towards using the same type of >>> "NamedList" backed XML for the initialization anytime new plugins are >>> added, but i don't see it being feasible to have a config DTD anytime >>> soon. >>> >>> -Hoss >>> >>> - >>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org >>> For additional commands, e-mail: dev-h...@lucene.apache.org >>> >>> > > > - > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org > For additional commands, e-mail: dev-h...@lucene.apache.org > > - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: Re: Solr Config XML DTD's
My first post too - but if I can offer a suggestion - there are more modern XML validation technologies available than DTD. I would heartily recommend RelaxNG/Compact notation (see http://relaxng.org/compact-tutorial-20030326.html) - you can generate Relax from a DTD, but it is more expressive, while still being easy on the eyes (uses curly-brace syntax), and much simpler than XML schema. In particular it lets you express wildcard constraints like: start = anyElement anyElement = element * { (attribute * { text } | text | anyElement)* } which matches absolutely anything. I'm not sure what kinds of constraints can actually be applied to solr's configuration in practice? But using a formal constraint language will give decent error reporting out of the box. Java-based tools for Relax validation and conversion are available here: http://code.google.com/p/jing-trang/ -Mike S On 2:59 PM, Michael McCandless wrote: If not a DTD, can we put some more "customized" form of validation for Solr's configuration? In general, I think servers should be anal on startup, refusing to start if there's anything off in their configuration. (Of course, along with this, the error messaging has to be *excellent* so you know precisely where the problem is, what's wrong, how to fix it). If you take the lenient/forgiving approach then you wind up with Solr instances in unknown states -- the app developer thinks they turned X on, everything starts fine, but then, silently, inexplicably, it's not working. This then leads to frustration, thinking Solr is buggy, not using this feature, blogging about problems, etc. Mike http://blog.mikemccandless.com On Tue, Mar 29, 2011 at 7:15 PM, Chris Hostetter wrote: : Hi, this is my first post to the mailing list. I'm working on a commercial Welcome! : My DTD works for our internal version of queryElevation.xml, but since the : ATTRIB name of the tag could be anything, I'm not sure how to write a : DTD that would validate any valid query elevation file. right .. this is one of the reasons why we've never tried to publish a DTD for the solrconfig.xml or schema.xml files either. there are lots of cases where plugins can define arbitrary attributes on the XML nodes. If i had the chance to do it all over again, and i better understood xml back when yonik first showed me what the configs would look like, i would have suggested using xml namespaces .. but that ship kind of sailed a while ago. we're getting a little better -- moving towards using the same type of "NamedList" backed XML for the initialization anytime new plugins are added, but i don't see it being feasible to have a config DTD anytime soon. -Hoss - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: Solr Config XML DTD's
If not a DTD, can we put some more "customized" form of validation for Solr's configuration? In general, I think servers should be anal on startup, refusing to start if there's anything off in their configuration. (Of course, along with this, the error messaging has to be *excellent* so you know precisely where the problem is, what's wrong, how to fix it). If you take the lenient/forgiving approach then you wind up with Solr instances in unknown states -- the app developer thinks they turned X on, everything starts fine, but then, silently, inexplicably, it's not working. This then leads to frustration, thinking Solr is buggy, not using this feature, blogging about problems, etc. Mike http://blog.mikemccandless.com On Tue, Mar 29, 2011 at 7:15 PM, Chris Hostetter wrote: > > : Hi, this is my first post to the mailing list. I'm working on a commercial > > Welcome! > > : My DTD works for our internal version of queryElevation.xml, but since the > : ATTRIB name of the tag could be anything, I'm not sure how to write a > : DTD that would validate any valid query elevation file. > > right .. this is one of the reasons why we've never tried to publish a DTD > for the solrconfig.xml or schema.xml files either. there are lots of > cases where plugins can define arbitrary attributes on the XML nodes. > > If i had the chance to do it all over again, and i better understood xml > back when yonik first showed me what the configs would look like, i would > have suggested using xml namespaces .. but that ship kind of sailed a > while ago. > > we're getting a little better -- moving towards using the same type of > "NamedList" backed XML for the initialization anytime new plugins are > added, but i don't see it being feasible to have a config DTD anytime > soon. > > -Hoss > > - > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org > For additional commands, e-mail: dev-h...@lucene.apache.org > > - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: Solr Config XML DTD's
: Hi, this is my first post to the mailing list. I'm working on a commercial Welcome! : My DTD works for our internal version of queryElevation.xml, but since the : ATTRIB name of the tag could be anything, I'm not sure how to write a : DTD that would validate any valid query elevation file. right .. this is one of the reasons why we've never tried to publish a DTD for the solrconfig.xml or schema.xml files either. there are lots of cases where plugins can define arbitrary attributes on the XML nodes. If i had the chance to do it all over again, and i better understood xml back when yonik first showed me what the configs would look like, i would have suggested using xml namespaces .. but that ship kind of sailed a while ago. we're getting a little better -- moving towards using the same type of "NamedList" backed XML for the initialization anytime new plugins are added, but i don't see it being feasible to have a config DTD anytime soon. -Hoss - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Solr Config XML DTD's
Hi, this is my first post to the mailing list. I'm working on a commercial implementation of a Solr project and would like to share some of my work, although it's not really much. I wrote a halting DTD for the Solr config file queryElevation.xml and would like to eventually write a DTD for the config file. Who do I need to talk to about reviewing my work and perhaps getting a little help. My DTD works for our internal version of queryElevation.xml, but since the ATTRIB name of the tag could be anything, I'm not sure how to write a DTD that would validate any valid query elevation file. Anyway, thanks. I put pressure on our company to redo our customer facing search using Solr. It launches soon and I've impressed everyone all the way up to the CEO most of the credit goes to the Solr and Lucene devs for making it so easy on me. Daniel