[ 
https://issues.apache.org/jira/browse/SOLR-1758?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13038355#comment-13038355
 ] 

Mike Sokolov commented on SOLR-1758:
------------------------------------

This was originally reported in the context of DIH, but as the OP said, it 
applies equally well to all configuration.

The config-validation.patch includes changes to Config that validate all XML 
configuration files loaded there.  The patch includes a schema with rules for 
<config/>, <schema>, <solr/>, <elevate/> and <root/> (used in tests).  It could 
be extended for other files as well.  The change causes Config to look in 
solr.home for a file called config.xsd.  If found, it is loaded and used to 
validate whatever configuration file is being loaded.  If a validation error 
occurs, an exception is raised (and logged? this seemed to be the way it was 
done before, although it seemed odd to me - I'd have thought exception logging 
would want to be handled at an outermost layer).

The Solr XML usage seems to be very flexible in practice.  Therefore the schema 
attempts to allow a fair amount of flexibility: for elements marked as 
"plugins" in the Wiki documentation, I've allowed pretty much arbitrary child 
content. The wildcards in the schema are "lax" which means that they allow any 
element, even unknown elements, but when known elements are found, they are 
validated against the model in the schema (eg: <str> is not allowed to have any 
child elements).

All the Solr tests but one pass with the patch, which means that the 
configuration in the solr example, as well as the various test configurations 
in solr/src/test-files/solr/conf, are all valid according to the schema.  The 
exception is one solrconfig.xml with a
luceneMatchVersion=4.0; I think this should LUCENE_40?  The patch also includes 
one new test of an invalid schema; it probably should have a few more.

However, my knowledge of Solr configuration options is far from encyclopedic - 
I spent a while with the documentation and examples - and there are almost 
certainly additional  configuration options out there that are in use and 
should be accounted for in the "standard" schema, eg some elements that should 
accept any attribute that don't currently.

In general I expect the schema could be evolved to be looser in some areas, and 
perhaps, tighter in others.

To help with that, I created some ant rules to convert the schema from Relax NG 
Compact syntax to XML Schema.  I find Relax easier to maintain, but including 
runtime validation support for Relax would require a large jar to be added to 
solr.  In this patch is dev-tools/schema; in there is a config.rnc, which is 
the source schema, and build.xml which compiles config.xsd from that using the 
trang.jar library and copies it into a few
places in the solr source tree.

Some TODOs:

It might be better to have separate schema files for separate configuration 
documents - this way the decision to validate could be made on a per-file 
basis, rather than globally for all configuration.

There is no model for <highlighting> in the schema - it's just a big wildcard 
right now.


> schema definition for configuration files
> -----------------------------------------
>
>                 Key: SOLR-1758
>                 URL: https://issues.apache.org/jira/browse/SOLR-1758
>             Project: Solr
>          Issue Type: New Feature
>          Components: contrib - DataImportHandler
>    Affects Versions: 1.4
>            Reporter: Jorg Heymans
>         Attachments: config-validation-20110523.patch
>
>
> A schema definition would be able to spot the subtle error in below config 
> {code}
>     <dataSource name="ora" driver="oracle.jdbc.OracleDriver" url="...." />
>     <datasource name="orablob" type="FieldStreamDataSource" />
>     <document name="mydoc">
>         <entity dataSource="ora" name="meta" query="select id, filename, 
> bytes from documents" >            
>             <field column="ID" name="id" />
>             <field column="FILENAME" name="filename" />
>             <entity dataSource="orablob" processor="TikaEntityProcessor" 
> url="bytes" dataField="meta.BYTES">
>               <field column="text" name="mainDocument"/>
>             </entity>
>          </entity>
>      </document>
> {code}
> Also, many xml editors support auto completion based on schema definition so 
> it would be easier to create configuration without constantly having to refer 
> to javadoc or samples from the distribution.
> This applies equally to schema.xml and solr-config.xml

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to