Solr Config XML DTD's

2011-03-16 Thread Daniel Talsky
Hi, this is my first post to the mailing list.  I'm working on a commercial
implementation of a Solr project and would like to share some of my work,
although it's not really much.

I wrote a halting DTD for the Solr config file queryElevation.xml and would
like to eventually write a DTD for the config file.  Who do I need to talk
to about reviewing my work and perhaps getting a little help.

My DTD works for our internal version of queryElevation.xml, but since the
ATTRIB name of the  tag could be anything, I'm not sure how to write a
DTD that would validate any valid query elevation file.

Anyway, thanks.  I put pressure on our company to redo our customer facing
search using Solr.  It launches soon and I've impressed everyone all the way
up to the CEO most of the credit goes to the Solr and Lucene devs for
making it so easy on me.

Daniel


Re: Solr Config XML DTD's

2011-03-29 Thread Chris Hostetter

: Hi, this is my first post to the mailing list.  I'm working on a commercial

Welcome!

: My DTD works for our internal version of queryElevation.xml, but since the
: ATTRIB name of the  tag could be anything, I'm not sure how to write a
: DTD that would validate any valid query elevation file.

right .. this is one of the reasons why we've never tried to publish a DTD 
for the solrconfig.xml or schema.xml files either.  there are lots of 
cases where plugins can define arbitrary attributes on the XML nodes.

If i had the chance to do it all over again, and i better understood xml 
back when yonik first showed me what the configs would look like, i would 
have suggested using xml namespaces .. but that ship kind of sailed a 
while ago.

we're getting a little better -- moving towards using the same type of 
"NamedList" backed XML for the initialization anytime new plugins are 
added, but i don't see it being feasible to have a config DTD anytime 
soon.

-Hoss

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: Solr Config XML DTD's

2011-05-01 Thread Michael McCandless
If not a DTD, can we put some more "customized" form of validation for
Solr's configuration?

In general, I think servers should be anal on startup, refusing to
start if there's anything off in their configuration.

(Of course, along with this, the error messaging has to be *excellent*
so you know precisely where the problem is, what's wrong, how to fix
it).

If you take the lenient/forgiving approach then you wind up with Solr
instances in unknown states -- the app developer thinks they turned X
on, everything starts fine, but then, silently, inexplicably, it's not
working.  This then leads to frustration, thinking Solr is buggy, not
using this feature, blogging about problems, etc.

Mike

http://blog.mikemccandless.com

On Tue, Mar 29, 2011 at 7:15 PM, Chris Hostetter
 wrote:
>
> : Hi, this is my first post to the mailing list.  I'm working on a commercial
>
> Welcome!
>
> : My DTD works for our internal version of queryElevation.xml, but since the
> : ATTRIB name of the  tag could be anything, I'm not sure how to write a
> : DTD that would validate any valid query elevation file.
>
> right .. this is one of the reasons why we've never tried to publish a DTD
> for the solrconfig.xml or schema.xml files either.  there are lots of
> cases where plugins can define arbitrary attributes on the XML nodes.
>
> If i had the chance to do it all over again, and i better understood xml
> back when yonik first showed me what the configs would look like, i would
> have suggested using xml namespaces .. but that ship kind of sailed a
> while ago.
>
> we're getting a little better -- moving towards using the same type of
> "NamedList" backed XML for the initialization anytime new plugins are
> added, but i don't see it being feasible to have a config DTD anytime
> soon.
>
> -Hoss
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org
>
>

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: Solr Config XML DTD's

2011-05-04 Thread Michael Sokolov
I'm not sure you will find anyone wanting to put in this effort now, but 
another suggestion for a general approach might be:


1 very basic static analysis to catch what you can - this should be a 
pretty minimal effort only given what can reasonably be achieved


2 throw runtime errors as Hoss says (probably already doing this well 
enough, but maybe some incremental improvements are needed?)


3 an option to run a "configtest" like httpd provides that preloads all 
declared handlers/plugins/modules etc, instantiates them and gives them 
an opportunity to read their config and throw whatever errors they 
find.  This way you can set a standard (error on unrecognized parameter, 
say) in some core areas, and distribute the effort.  This is a hugely 
useful sanity check to be able to run when you want to make config 
changes and not have your server fall over when it starts (or worse - 
later).


-Mike "kibitzer" Sokolov

On 5/4/2011 6:55 PM, Chris Hostetter wrote:


As i said: any improvements to help catch the mistakes we can identify
would be great, but we should maintain perspective of the effort/gain
tradeoff given that there is likely nothing we can do about the basic
problem of "a string that won't be evaluated until runtime"




-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: Solr Config XML DTD's

2011-05-17 Thread Michael McCandless
https://issues.apache.org/jira/browse/SOLR-2119 is a good example
where we are failing to catch mis-configuration on startup.

Is there some way we can baby step here?  EG use one of these XML
validation packages, incrementally, on only sub-strings from the XML?
(Or simpler is to just do the checking ourselves w/ custom code).

Mike

http://blog.mikemccandless.com

On Wed, May 4, 2011 at 10:50 PM, Michael Sokolov  wrote:
> I'm not sure you will find anyone wanting to put in this effort now, but
> another suggestion for a general approach might be:
>
> 1 very basic static analysis to catch what you can - this should be a pretty
> minimal effort only given what can reasonably be achieved
>
> 2 throw runtime errors as Hoss says (probably already doing this well
> enough, but maybe some incremental improvements are needed?)
>
> 3 an option to run a "configtest" like httpd provides that preloads all
> declared handlers/plugins/modules etc, instantiates them and gives them an
> opportunity to read their config and throw whatever errors they find.  This
> way you can set a standard (error on unrecognized parameter, say) in some
> core areas, and distribute the effort.  This is a hugely useful sanity check
> to be able to run when you want to make config changes and not have your
> server fall over when it starts (or worse - later).
>
> -Mike "kibitzer" Sokolov
>
> On 5/4/2011 6:55 PM, Chris Hostetter wrote:
>>
>> As i said: any improvements to help catch the mistakes we can identify
>> would be great, but we should maintain perspective of the effort/gain
>> tradeoff given that there is likely nothing we can do about the basic
>> problem of "a string that won't be evaluated until runtime"
>>
>
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org
>
>

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: Solr Config XML DTD's

2011-05-18 Thread Mike Sokolov
I looked into inserting a formal validation step in o.a.solr.core.Config 
and ran some preliminary simple tests.  The code is fairly simple; just 
a couple of gotchas:


1) to use the RNC validation language (my preference), we would need to 
pull in a couple of new jars, one of which is over 600K.  Also, support 
for RNC in the XML world is not very widespread: it's gotten more 
interest from researchers and less uptake more broadly, so it might not 
be the best choice, even if, aesthetically it is superior IMO.


2) The other alternatives are XML Schema and DTD.  I think DTD is a 
non-starter since it just can't allow things like arbitrary attributes 
on an element (you have to list them explicitly).  Schema is probably 
the best choice all things considered: support for it is built into the 
XML tools already in use, and it is widely adopted.  The drawback is 
that it's a baroque and unwieldy syntax designed by an indecisive 
committee that loaded it down with excessive featuritis, and someone 
will end up having to maintain this: every time you add a new 
configuration option to the schema (or solrconfig, etc), then the 
schema-schema (validation schema?) will have to be updated to reflect that.


3) Finally, to get good error reporting it's important to show file name 
and line number where an error occurred.  Although you can validate a 
constructed XML tree (a DOM), it's better to run validation on a Stream 
so the line numbers are available.  Therefore it will probably be 
necessary to run two passes (one to validate, and one to construct the 
DOM), which means buffering the config.  Doesn't seem like a big deal: 
these are small files that only get loaded once, but this is a cost of 
validation, I think.


Of course the benefit is that users would actually get fast-failing 
specific and informative error messages covering a wide variety of 
misconfigurations: I would hope we could be restrictive enough to catch 
mis-spelled versions of known element and attribute names, or places 
where elements are out of order.


I'd be willing to work this up, develop a preliminary schema (of 
whichever sort we choose), and send in a patch, but other folks would 
probably end up having to maintain it from time to time if it's to have 
any value at all and not just get disabled, so I just want to make sure 
this is something you all think is worth while before going any further.


-Mike



On 05/17/2011 09:04 AM, Michael McCandless wrote:

https://issues.apache.org/jira/browse/SOLR-2119 is a good example
where we are failing to catch mis-configuration on startup.

Is there some way we can baby step here?  EG use one of these XML
validation packages, incrementally, on only sub-strings from the XML?
(Or simpler is to just do the checking ourselves w/ custom code).

Mike

http://blog.mikemccandless.com

On Wed, May 4, 2011 at 10:50 PM, Michael Sokolov  wrote:
   

I'm not sure you will find anyone wanting to put in this effort now, but
another suggestion for a general approach might be:

1 very basic static analysis to catch what you can - this should be a pretty
minimal effort only given what can reasonably be achieved

2 throw runtime errors as Hoss says (probably already doing this well
enough, but maybe some incremental improvements are needed?)

3 an option to run a "configtest" like httpd provides that preloads all
declared handlers/plugins/modules etc, instantiates them and gives them an
opportunity to read their config and throw whatever errors they find.  This
way you can set a standard (error on unrecognized parameter, say) in some
core areas, and distribute the effort.  This is a hugely useful sanity check
to be able to run when you want to make config changes and not have your
server fall over when it starts (or worse - later).

-Mike "kibitzer" Sokolov

On 5/4/2011 6:55 PM, Chris Hostetter wrote:
 

As i said: any improvements to help catch the mistakes we can identify
would be great, but we should maintain perspective of the effort/gain
tradeoff given that there is likely nothing we can do about the basic
problem of "a string that won't be evaluated until runtime"

   


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org


 

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

   


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: Re: Solr Config XML DTD's

2011-05-01 Thread Michael Sokolov
My first post too - but if I can offer a suggestion - there are more 
modern XML validation technologies available than DTD.  I would heartily 
recommend RelaxNG/Compact notation (see 
http://relaxng.org/compact-tutorial-20030326.html) - you can generate 
Relax from a DTD, but it is more expressive, while still being easy on 
the eyes (uses curly-brace syntax), and much simpler than XML schema.


In particular it lets you express wildcard constraints like:

start = anyElement
anyElement =
  element * {
(attribute * { text }
 | text
 | anyElement)*
  }

which matches absolutely anything.

I'm not sure what kinds of constraints can actually be applied to solr's 
configuration in practice?

But using a formal constraint language will give decent error reporting out of 
the box.

Java-based tools for Relax validation and conversion are available here: 
http://code.google.com/p/jing-trang/

-Mike S

On 2:59 PM, Michael McCandless wrote:


If not a DTD, can we put some more "customized" form of validation for
Solr's configuration?

In general, I think servers should be anal on startup, refusing to
start if there's anything off in their configuration.

(Of course, along with this, the error messaging has to be *excellent*
so you know precisely where the problem is, what's wrong, how to fix
it).

If you take the lenient/forgiving approach then you wind up with Solr
instances in unknown states -- the app developer thinks they turned X
on, everything starts fine, but then, silently, inexplicably, it's not
working.  This then leads to frustration, thinking Solr is buggy, not
using this feature, blogging about problems, etc.

Mike

http://blog.mikemccandless.com

On Tue, Mar 29, 2011 at 7:15 PM, Chris Hostetter
  wrote:

: Hi, this is my first post to the mailing list.  I'm working on a commercial

Welcome!

: My DTD works for our internal version of queryElevation.xml, but since the
: ATTRIB name of the  tag could be anything, I'm not sure how to write a
: DTD that would validate any valid query elevation file.

right .. this is one of the reasons why we've never tried to publish a DTD
for the solrconfig.xml or schema.xml files either.  there are lots of
cases where plugins can define arbitrary attributes on the XML nodes.

If i had the chance to do it all over again, and i better understood xml
back when yonik first showed me what the configs would look like, i would
have suggested using xml namespaces .. but that ship kind of sailed a
while ago.

we're getting a little better -- moving towards using the same type of
"NamedList" backed XML for the initialization anytime new plugins are
added, but i don't see it being feasible to have a config DTD anytime
soon.

-Hoss

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org





-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: Re: Solr Config XML DTD's

2011-05-04 Thread Michael McCandless
Hi Michael,

This looks compelling!  I'm also not sure what, specifically, we can
validate in Solr's configuration... and I also don't know how much
validation we do today.  What hard errors does Solr produce on startup
when configuration is wrong?

I know one challenge is the fact that plugins can reach in and claim
attrs/elements, which makes validation more interesting.  But we could
do something like this: when a plugin "claims" a certain attr/element,
this is recorded.  If at the end of loading the config, there are
unclaimed attrs/elements, then that's an error.

More generally, before we hash out an approach here, I'd like to know
if anyone disagree that we should move Solr to more strict error
checking of its configuration on startup.  I think being silent on
configuration errors is the wrong choice... and I think that's
generally Solr's approach today (I think?  Or do we catch
configuration errors w/ a hard error and clear message?).

Mike

http://blog.mikemccandless.com

On Sun, May 1, 2011 at 7:34 PM, Michael Sokolov  wrote:
> My first post too - but if I can offer a suggestion - there are more modern
> XML validation technologies available than DTD.  I would heartily recommend
> RelaxNG/Compact notation (see
> http://relaxng.org/compact-tutorial-20030326.html) - you can generate Relax
> from a DTD, but it is more expressive, while still being easy on the eyes
> (uses curly-brace syntax), and much simpler than XML schema.
>
> In particular it lets you express wildcard constraints like:
>
> start = anyElement
> anyElement =
>  element * {
>    (attribute * { text }
>     | text
>     | anyElement)*
>  }
>
> which matches absolutely anything.
>
> I'm not sure what kinds of constraints can actually be applied to solr's
> configuration in practice?
>
> But using a formal constraint language will give decent error reporting out
> of the box.
>
> Java-based tools for Relax validation and conversion are available here:
> http://code.google.com/p/jing-trang/
>
> -Mike S
>
> On 2:59 PM, Michael McCandless wrote:
>
>> If not a DTD, can we put some more "customized" form of validation for
>> Solr's configuration?
>>
>> In general, I think servers should be anal on startup, refusing to
>> start if there's anything off in their configuration.
>>
>> (Of course, along with this, the error messaging has to be *excellent*
>> so you know precisely where the problem is, what's wrong, how to fix
>> it).
>>
>> If you take the lenient/forgiving approach then you wind up with Solr
>> instances in unknown states -- the app developer thinks they turned X
>> on, everything starts fine, but then, silently, inexplicably, it's not
>> working.  This then leads to frustration, thinking Solr is buggy, not
>> using this feature, blogging about problems, etc.
>>
>> Mike
>>
>> http://blog.mikemccandless.com
>>
>> On Tue, Mar 29, 2011 at 7:15 PM, Chris Hostetter
>>   wrote:
>>>
>>> : Hi, this is my first post to the mailing list.  I'm working on a
>>> commercial
>>>
>>> Welcome!
>>>
>>> : My DTD works for our internal version of queryElevation.xml, but since
>>> the
>>> : ATTRIB name of the  tag could be anything, I'm not sure how to
>>> write a
>>> : DTD that would validate any valid query elevation file.
>>>
>>> right .. this is one of the reasons why we've never tried to publish a
>>> DTD
>>> for the solrconfig.xml or schema.xml files either.  there are lots of
>>> cases where plugins can define arbitrary attributes on the XML nodes.
>>>
>>> If i had the chance to do it all over again, and i better understood xml
>>> back when yonik first showed me what the configs would look like, i would
>>> have suggested using xml namespaces .. but that ship kind of sailed a
>>> while ago.
>>>
>>> we're getting a little better -- moving towards using the same type of
>>> "NamedList" backed XML for the initialization anytime new plugins are
>>> added, but i don't see it being feasible to have a config DTD anytime
>>> soon.
>>>
>>> -Hoss
>>>
>>> -
>>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>>> For additional commands, e-mail: dev-h...@lucene.apache.org
>>>
>>>
>
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org
>
>

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: Re: Solr Config XML DTD's

2011-05-04 Thread Dawid Weiss
>
> if anyone disagree that we should move Solr to more strict error
> checking of its configuration on startup.  I think being silent on
> configuration errors is the wrong choice... and I think that's


+1 for validation/ warning/ error messages from config files. Excellent
link, Michael (Sokolov), I didn't know about this at all.

Dawid


Re: Re: Solr Config XML DTD's

2011-05-04 Thread Chris Hostetter
: This looks compelling!  I'm also not sure what, specifically, we can
: validate in Solr's configuration... and I also don't know how much
: validation we do today.  What hard errors does Solr produce on startup
: when configuration is wrong?

once upon a time solr would log some config errors, but then happily start 
up anyway.

that "feature" has since been removed (as far as i know) and solr will now 
fail on any concrete error it encounters in the config.

what solr doesn't currently identify as an error is "unused" config 
(ie: someone types  instead of 
) ... allthough i seem to recall a patch committed not to 
long ago that started checking for unused field/fieldtype attributes when 
parsing schema.xml, along the lines of what mccandles described...

: do something like this: when a plugin "claims" a certain attr/element,
: this is recorded.  If at the end of loading the config, there are
: unclaimed attrs/elements, then that's an error.

One potential problem with generalizing this approach is that we support 
"lazy" initializatin of some plugins (it might just be RequestHandlers ... 
i don't remember off hand) so we'd need to watch out for that -- the whole 
point is to prevent hte need for instantiating expensive plugins 
unless/untill they are actually used, so you wouldn't wnat to force them 
to startup just to read/claim their configs.

As to the larger question...

: More generally, before we hash out an approach here, I'd like to know
: if anyone disagree that we should move Solr to more strict error
: checking of its configuration on startup.  I think being silent on
: configuration errors is the wrong choice... and I think that's
: generally Solr's approach today (I think?  Or do we catch
: configuration errors w/ a hard error and clear message?).

...as i mentioned: if solr sees an error, it should already fail hard and 
loud.

I would love to be able to do either static validation or more agressive 
sanity checking of potential typos/unused configs on startup in a way that 
would catch the cases we currently miss -- w/o preventing plugins from 
having their own options (i regret not using something like xml 
namespaces for this from day 1) -- but i suspect that it could wind up 
being a lot of work for little gain

Anecdotaly, the most common "config mistake" peoples i see people making 
are along the lines of this:

  
 
   explicit
   paramValueWithTypo
...

...i don't know of any static way we could validate the configs that would 
deal with what are ultimately going to be runtime params.

As i said: any improvements to help catch the mistakes we can identify 
would be great, but we should maintain perspective of the effort/gain 
tradeoff given that there is likely nothing we can do about the basic 
problem of "a string that won't be evaluated until runtime"


-Hoss

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org