Re: Dynamic schema design: feedback requested

2013-03-11 Thread Yonik Seeley
On Wed, Mar 6, 2013 at 7:50 PM, Chris Hostetter
hossman_luc...@fucit.org wrote:
 2) If you wish to use the /schema REST API for read and write operations,
 then schema information will be persisted under the covers in a data store
 whose format is an implementation detail just like the index file format.

This really needs to be driven by costs and benefits...
There are clear benefits to having a simple human readable / editable
file for the schema (whether it's on the local filesystem or on ZK).

 The ability to say my schema is a config file and i own it should always 
 exist (remove it over my dead body)

There are clear benefits to this being the persistence mechanism for
the REST API.

Even if the REST API persisted it's data in some binary format for
example, then there would still need to be import/export mechanisms
for the human readable/editable config file config file that should
always exist.  Why would we want any other intermediate format (i.e.
data that is not human readable)?  Seems like we should only
introduce that extra complexity if the benefits are great enough.
Actually, I just realized we already have this intermediate
representation - it's the in-memory IndexSchema object.

-Yonik
http://lucidworks.com


Re: Dynamic schema design: feedback requested

2013-03-11 Thread Chris Hostetter

:  2) If you wish to use the /schema REST API for read and write operations,
:  then schema information will be persisted under the covers in a data store
:  whose format is an implementation detail just like the index file format.
: 
: This really needs to be driven by costs and benefits...
: There are clear benefits to having a simple human readable / editable
: file for the schema (whether it's on the local filesystem or on ZK).

The cost is the user complexity of understanding what changes are 
respected and when, and in hte implementation complexity of dealing with 
changes coming from multiple code paths (both files changed on disk and 
REST based request changes)

in the current model, the config file on disk is hte authority, it is read 
in it's entirety on core init/reload, and users have total ownership of 
that file -- changes are funneled through the user, into the config, and 
solr is a read only participant.  Since solr knows the only way schema 
information will ever change is when it reads that file, it can make 
internal assumptions about the consistency of that data.

in a model where a public REST API might be modifying solr's in memory 
state, solr can't neccessarily make those same assumptions, and the 
complexity of the system becomes a lot simpler if the Solr is 
the authority of the information about the schema, and we don't have to 
worry about what happens if comflicts arrise, eg: someone modifies the 
schema on disk, but hasn't (yet?) done a core reload, when a new REST 
request comes in to modify the schema data in some other way.



-Hoss


Re: Dynamic schema design: feedback requested

2013-03-11 Thread Chris Hostetter

To revisit sarowes comment about how/when to decide if we are using the   
config file version of schema info (and hte API is read only) vs
internal managed state data version of schema info (and the API is
read/write)...

On Wed, 6 Mar 2013, Steve Rowe wrote:

: Two possible approaches:
: 
: a. When schema.xml is present, ...
...
: b. Alternatively, the reverse: ...
...
: I like option a. better, since it provides a stable situation for users 
: who don't want the new dynamic schema modification feature, and who want 
: to continue to hand edit schema.xml.  Users who want the new feature 
: would use a command-line tool to convert their schema.xml to 
: schema.json, then remove schema.xml from conf/.


The more i think about it, the less I like either a or b because both 
are completley implicit.

I think practically speaking, from a support standpoint, we should require 
an more explicit configuration of what *type* of schema management 
should be used, and then have code that sanity checks that and warns/fails 
if the configuraiton setting doesn't match what is found in the ./conf 
dir.

The situation i worry about, is whan a novice solr user takes over 
maintence of an existing setup that is using REST based schema management, 
and therefore has no schema.xml file.  The novice is reading 
docs/tutorials talking about how to achieve some goal, which make refrence 
to editing the schema.xml or adding XXX to the schema.xml or even 
worse in the cases of some CMSs: To upgrade to FooCMS vX.Y, replace your 
schema.xml with this file... but they have no schema.xml or any clear and 
obvious indication looking at what configs they do have of *why* there is 
no schema.xml, so maybe they try to add one.

I think it would be better to add some new option in solroconfig.xml that 
requires the user to be explicit about what type of management thye want 
to use, defaulting to schema.xml for back compat...

  schema type=conf 
  [maybe an optional file=path/to/schema.xml ?] /

...vs...

  schema type=managed 
  [this is where the mutable=true|false sarowe mentioned could live] 
/

The on core load:

1) if the configured schema type is file but there is no schema.xml 
file, ERROR loudly and fail fast.

2) if we see that the the configured schema type is file but we detected 
the existence of managed internal schema info (schema.json, zk nodes, 
whatever) then we should WARN that the managed internal data is being 
ignored.

3) if the configured schema type is managed but there is no manged 
internal schema info (schema.json, zk nodes, whatever) then ERROR loudly 
and fail fase (or maybe we create an empty schema for them?)

4) if we see that the the configured schema type is managed but we 
also detected the existence of a schema.xml config file, then
whatever) then we should WARN that the schema.xml is being 
ignored.

...although i could easily be convinced that all of those WARN 
sitautions should really be hard failures to reduce confusion -- depends 
on how easy we can make it to let users delete all internally manged 
schema info before switching to a type=conf schema.xml approach.


-Hoss


Re: Dynamic schema design: feedback requested

2013-03-11 Thread Yonik Seeley
On Mon, Mar 11, 2013 at 2:50 PM, Chris Hostetter
hossman_luc...@fucit.org wrote:

 :  2) If you wish to use the /schema REST API for read and write operations,
 :  then schema information will be persisted under the covers in a data store
 :  whose format is an implementation detail just like the index file format.
 :
 : This really needs to be driven by costs and benefits...
 : There are clear benefits to having a simple human readable / editable
 : file for the schema (whether it's on the local filesystem or on ZK).

 The cost is the user complexity of understanding what changes are
 respected and when

There is going to be a cost to understanding any feature.  This
doesn't deal with the answer to the question are we better off with
or without this feature.

, and in hte implementation complexity of dealing with
 changes coming from multiple code paths (both files changed on disk and
 REST based request changes)

Right - and these should be quantifiable going forward.
In ZK mode, we need concurrency control anyway, so depending on the
design, there may be really no cost at all.
In local FS mode, it might be a very low cost (simply check the
timestamp on the file for example).  Code to re-read the schema and
merge changes needs to be there anyway for cloud mode it seems.  *If*
we needed to, we could just assert that the schema file is the
persistence mechanism, as opposed to the system of record, hence if
you hand edit it and then use the API to change it, your hand edit may
be lost.  Or we may decide to do away with local FS mode altogether.

I guess my main point is, we shouldn't decide a priori that using the
API means you can no longer hand edit.

My thoughts on this are probably heavily influenced on how I initially
envisioned implementation working in cloud mode (which I thought about
first since it's harder).  A human readable file on ZK that represents
the system of record for the schema seemed to be the best.  I never
even considered making it non-human readable (and thus non-editable by
hand).

-Yonik
http://lucidworks.com


Re: Dynamic schema design: feedback requested

2013-03-11 Thread Chris Hostetter

: we needed to, we could just assert that the schema file is the
: persistence mechanism, as opposed to the system of record, hence if
: you hand edit it and then use the API to change it, your hand edit may
: be lost.  Or we may decide to do away with local FS mode altogether.

presuming that it's just a persistence mechanism, but also assuming that 
the user may edit directly, still creates burdens/complexity in when solr 
reads/writes to that file -- even if we say that user edits to that file 
might be overridden (ie: does solr garuntee if/when that the file will be 
written to if you use the REST api to modify things? -- that's going to be 
important if we let people read//edit that file)

: I guess my main point is, we shouldn't decide a priori that using the
: API means you can no longer hand edit.

and my point is we should build a feature where solr has the ability to 
read/write some piece of information, we should start with the asumption 
that it's OK for us to decide that a priori, and not walk into things 
assuming we have to support a lot of much more complicated uses cases.  if 
at some point during the implementation we find that supporting a more lax 
it's ok, you can edit this by hand approach won't be a burden, then so 
be it -- we can relax that a priori assertion.

: My thoughts on this are probably heavily influenced on how I initially

my thoughts on this are based directly on:

A) the observations of the confusion  implementation complexity 
observed in the dual nature of solr.xml over the years.

B) having spent a lot of time maintining code that did programatic 
read/writing of solr schema.xml files while also trying to treat them as 
config files that users were allowed to hand edit -- it's a pain in the 
ass.

: envisioned implementation working in cloud mode (which I thought about
: first since it's harder).  A human readable file on ZK that represents
: the system of record for the schema seemed to be the best.  I never

1) i never said the data couldn't/shouldn't be human readable -- i said it 
should be an implementation detail (ie: subject to change automaticly on 
upgrade just like hte index format), and that end users shouldn't be 
allowed to edit it arbitrarily

2) cloud mode, as i understand it, is actaully much *easier* (if you want 
to allow arbitrary user edits to these files) because you can set ZK 
watches on those nodes, so any code that is maintaining interal state 
based on them (ie: REST API round trip serialization code that just read 
the file in to modify the DOM before writing it back out) can be notified 
if the file has changed.  I also beleive i was told that writes to files
in ZK are atomic, which also means you never have to wory about reading 
partial data in the middle of someone else's write.

in the general situation of config files on disk we can't even try to 
enforce a lock file type approach, because we shouldn't assume a user will 
remember to obey our locks before editing the file.

If you  sarowe  others feel that:

1) it's important to allow arbitrary user editing of schema.xml files in 
zk mode even when REST read/writes are enabled
2) that allowing arbitrary user edits w/o risk of conflict or complexity 
in the REST read/write code is easy to implement in ZK mode
3) it's reasonable to require ZK mode in order to suppot read/write mode 
in the REST API

...that that would certainly resolve my concern's stemming from B 
above.  i'm still worried about A, but perhaps the ZK nature of things 
and the watches  atomicity provided there will reduce confusion.

But as long as we are talking about this REST api supporting reads  
writes to schema info even when running in single node mode with files on 
disk -- i think it is a *HUGE* fucking mistake to start with the 
assumption that the serialization mechanism of the REST api needs to be 
able to play nicely with arbitrary user editing of schema.xml.


-Hoss


Re: Dynamic schema design: feedback requested

2013-03-11 Thread Yonik Seeley
On Mon, Mar 11, 2013 at 5:51 PM, Chris Hostetter
hossman_luc...@fucit.org wrote:
 : I guess my main point is, we shouldn't decide a priori that using the
 : API means you can no longer hand edit.

 and my point is we should build a feature where solr has the ability to
 read/write some piece of information, we should start with the asumption
 that it's OK for us to decide that a priori, and not walk into things
 assuming we have to support a lot of much more complicated uses cases.  if
 at some point during the implementation we find that supporting a more lax
 it's ok, you can edit this by hand approach won't be a burden, then so
 be it -- we can relax that a priori assertion.

I guess I like a more breadth-first method (or at least that's what it
feels like to me).
You keep both options in mind as you proceed, and don't start off a
hard assertion either way.
It would be nice to support editing by hand... but if it becomes too
burdensome, c'est la vie.

If the persistence format we're going to use is nicely human readable,
then I'm good.  We can disagree on philosophies, but I'm not sure that
it amounts to much in the way of concrete differences at this point.
What concerned me was talk of starting to treat this as more of a
black box.

-Yonik
http://lucidworks.com


Re: Dynamic schema design: feedback requested

2013-03-08 Thread Steve Rowe
Hi Jan,

On Mar 6, 2013, at 4:50 PM, Jan Høydahl jan@cominvent.com wrote:
 Will ZK get pushed the serialized monolithic schema.xml / schema.json from 
 the node which changed it, and then trigger an update to the rest of the 
 cluster?

Yes.

 I was kind of hoping that once we have introduced ZK into the mix as our 
 centralized config server, we could start using it as such consistently. And 
 so instead of ZK storing a plain xml file, we split up the schema as native 
 ZK nodes […]

Erik Hatcher made the same suggestion on SOLR-3251: 
https://issues.apache.org/jira/browse/SOLR-3251?focusedCommentId=13571713page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13571713

My response on the issue: 
https://issues.apache.org/jira/browse/SOLR-3251?focusedCommentId=13572774page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13572774

In short, I'm not sure it's a good idea, and in any event, I don't want to 
implement this as part of the initial implementation - it could be added on 
later.

 multiple collections may share the same config set and thus schema, so what 
 happens if someone does not know this and hits PUT 
 localhost:8983/solr/collection1/schema and it affects also the schema for 
 collection2?

Hmm, that's a great question.  Querying against a named config rather than a 
collection/core would not be an improvement, though, since the relationship 
between the two wouldn't be represented in the request.

Maybe if there were requests that returned the collections using a particular 
named config, and vice versa, people could at least discover problematic 
dependencies before they send schema modificaiton requests?  Or maybe such 
requests already exist?

Steve

Re: Dynamic schema design: feedback requested

2013-03-08 Thread Steve Rowe
On Mar 6, 2013, at 7:50 PM, Chris Hostetter hossman_luc...@fucit.org wrote:
 I think it would make a lot of sense -- not just in terms of 
 implementation but also for end user clarity -- to have some simple, 
 straightforward to understand caveats about maintaining schema 
 information...
 
 1) If you want to keep schema information in an authoritative config file 
 that you can manually edit, then the /schema REST API will be read only. 
 
 2) If you wish to use the /schema REST API for read and write operations, 
 then schema information will be persisted under the covers in a data store 
 whose format is an implementation detail just like the index file format.
 
 3) If you are using a schema config file and you wish to switch to using 
 the /schema REST API for managing schema information, there is a 
 tool/command/API you can run to so.
 
 4) if you are using the /schema REST API for managing schema information, 
 and you wish to switch to using a schema config file, there is a 
 tool/command/API you can run to export the schema info if a config file 
 format.

+1

 ...wether of not the under the covers in a data store used by the REST 
 API is JSON, or some binary data, or an XML file just schema.xml w/o 
 whitespace/comments should be an implementation detail.  Likewise is the 
 question of wether some new config file formats are added -- it shouldn't 
 matter.
 
 If it's config it's config and the user owns it.
 If it's data it's data and the system owns it.

Calling the system-owned file 'schema.dat', rather than 'schema.json' (i.e., 
extension=format), would help to reinforce this black-box view.

Steve



Re: Dynamic schema design: feedback requested

2013-03-08 Thread Steve Rowe
On Mar 8, 2013, at 2:57 PM, Steve Rowe sar...@gmail.com wrote:
 multiple collections may share the same config set and thus schema, so what 
 happens if someone does not know this and hits PUT 
 localhost:8983/solr/collection1/schema and it affects also the schema for 
 collection2?
 
 Hmm, that's a great question.  Querying against a named config rather than a 
 collection/core would not be an improvement, though, since the relationship 
 between the two wouldn't be represented in the request.
 
 Maybe if there were requests that returned the collections using a particular 
 named config, and vice versa, people could at least discover problematic 
 dependencies before they send schema modificaiton requests?  Or maybe such 
 requests already exist?

Also, this doesn't have to be either/or (collection/core vs. config) - we could 
have another API that's config-specific, e.g. for the fields resource:

collection-specific:http://localhost:8983/solr/collection1/schema/fields
 
config-specific:http://localhost:8983/solr/configs/configA/schema/fields

Steve

Dynamic schema design: feedback requested

2013-03-06 Thread Steve Rowe
I'm working on SOLR-3251 https://issues.apache.org/jira/browse/SOLR-3251, to 
dynamically add fields to the Solr schema.

I posted a rough outline of how I propose to do this: 
https://issues.apache.org/jira/browse/SOLR-3251?focusedCommentId=13572875page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13572875.
  

So far, I've finished the first item - schema information REST requests, along 
with Restlet integration (moved up from the last item in the outline) - in 
SOLR-4503 https://issues.apache.org/jira/browse/SOLR-4503.

There are two specific concerns that I'd like feedback on: 1) schema 
serialization format and 2) disabling/enabling runtime schema modifications via 
REST API calls.  (I'd also be happy to get feedback on other aspects of this 
feature!)


1) Item #2 on the outline (Change Solr schema serialization from XML to JSON, 
and provide an XML-JSON conversion tool) seems like it might be 
controversial, in that using JSON as the serialization format implies that Solr 
owns the configuration, and that direct user modification would no longer be 
the standard way to change the schema.

For most users, if a change is to be made, the transition will be an issue.  I 
think a hard break is off the table: whatever else happens, Solr will need to 
continue to be able to parse schema.xml, at least for all of 4.X and maybe 5.X 
too.

Two possible approaches:

a. When schema.xml is present, schema.json (if any) will be ignored.  Users 
could in this way signal whether dynamic schema modification is enabled: the 
presence of schema.xml indicates that the dynamic schema modification feature 
will be disabled.

b. Alternatively, the reverse: when schema.json is present, schema.xml will be 
ignored.  The first time schema.xml is found but schema.json isn't, schema.xml 
is automatically converted to schema.json.

I like option a. better, since it provides a stable situation for users who 
don't want the new dynamic schema modification feature, and who want to 
continue to hand edit schema.xml.  Users who want the new feature would use a 
command-line tool to convert their schema.xml to schema.json, then remove 
schema.xml from conf/.


2) Since the REST APIs to modify the schema will not be registerable 
RequestHandlers, there is no plan (yet) to disable schema modification 
requests.  Three possibilities come to mind:

a. A configuration setting in solrconfig.xml - this would be changeable only 
after restarting a node, e.g. top-level schema mutable=true/false/ 

b. A REST API call that allows for runtime querying and setting of the mutable 
status, http://localhost:8983/solr/schema/status would return current status, 
and adding query ?mutable=true/false would change it.

c. A combination of the above two: a configuration item in solrconfig.xml to 
enable the REST API, e.g. schema enableMutable=true/false/, and then a REST 
API to query current status and dis/allow modifications at runtime: 
/solr/schema/status for current mutable status, and with query 
?mutable=true/false to change it.  The mutable status would always be false 
at startup, so the flow to make modifications would involve first making a REST 
PUT to /solr/schema?mutable=true

I like option c. the best, since it would address concerns of users who don't 
want the schema to be modifiable.


I look forward to hearing others' thoughts on these and any other issues 
related to dynamic schema modification.

Thanks,
Steve



Re: Dynamic schema design: feedback requested

2013-03-06 Thread Mark Miller
bq. Change Solr schema serialization from XML to JSON, and provide an XML-JSON 
conversion tool.

What is the motivation for the change? I think if you are sitting down and 
looking to design a schema, working with the XML is fairly nice and fast. I 
picture that a lot of people would start by working with the XML file to get it 
how they want, and then perhaps do future changes with the rest API. When you 
are developing, starting with the rest API feels fairly cumbersome if you have 
to make a lot of changes/additions/removals.

So why not just keep the XML and add the rest API? Do we gain much by switching 
it to JSON? I like JSON when it comes to rest, but when I think about editing a 
large schema doc locally, XML seems much easier to deal with.

- Mark

Re: Dynamic schema design: feedback requested

2013-03-06 Thread Steve Rowe
In response to my thoughts about using DOM as an intermediate representation 
for schema elements, for use in lazy re-loading on schema change, Erik Hatcher 
argued against (solely) using XML for schema serialization 
(https://issues.apache.org/jira/browse/SOLR-3251?focusedCommentId=13571631page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13571631):

IMO - The XMLness of the current Solr schema needs to be isolated
to only one optional way of constructing an IndexSchema instance.
We want less XML rather than more. (for example, it should be
possible to have a relational database that contains a model of
a schema and load it that way)

I was hoping to avoid dealing with round-tripping XML comments (of which there 
are many in schema.xml).  My thought was that an XML-JSON conversion tool 
would insert description properties on the enclosing/adjacent object when it 
encounters comments.  But I suppose the same process could be applied to 
schema.xml: XML comments could be converted to description elements, and then 
when serializing changes, any user-inserted comments would be stripped.

The other concern is about schema ownership: dealing with schemas that mix 
hand-editing with Solr modification/serialization would likely be harder than 
supporting just one of them.  But I suppose there is already a set of validity 
checks, so maybe this wouldn't be so bad? 

Steve

On Mar 6, 2013, at 1:35 PM, Mark Miller markrmil...@gmail.com wrote:

 bq. Change Solr schema serialization from XML to JSON, and provide an 
 XML-JSON conversion tool.
 
 What is the motivation for the change? I think if you are sitting down and 
 looking to design a schema, working with the XML is fairly nice and fast. I 
 picture that a lot of people would start by working with the XML file to get 
 it how they want, and then perhaps do future changes with the rest API. When 
 you are developing, starting with the rest API feels fairly cumbersome if you 
 have to make a lot of changes/additions/removals.
 
 So why not just keep the XML and add the rest API? Do we gain much by 
 switching it to JSON? I like JSON when it comes to rest, but when I think 
 about editing a large schema doc locally, XML seems much easier to deal with.
 
 - Mark



Re: Dynamic schema design: feedback requested

2013-03-06 Thread Mark Miller
Hmm…I think I'm missing some pieces.

I agree with Erick that you should be able to load a schema from any object - a 
DB, a file in ZooKeeper, you name it. But it seems by default, having that 
object be schema.xml seems nicest to me. That doesn't mean you have to use DOM 
or XML internally - just that you have a serializer/deserializer for it. If you 
wanted to do it from a database, that would just be another 
serialize/deserialze impl. Internally, it could all be JSON or Java objects, or 
whatever.

As far as a user editing the file AND rest API access, I think that seems fine. 
Yes, the user is in trouble if they break the file, but that is the risk they 
take if they want to manually edit it - it's no different than today when you 
edit the file and do a Core reload or something. I think we can improve some 
validation stuff around that, but it doesn't seem like a show stopper to me.

At a minimum, I think the user should be able to start with a hand modified 
file. Many people *heavily* modify the example schema to fit their use case. If 
you have to start doing that by making 50 rest API calls, that's pretty rough. 
Once you get your schema nice and happy, you might script out those rest calls, 
but initially, it's much faster/easier to whack the schema into place in a text 
editor IMO.

Like I said though, I may be missing something…

- Mark

On Mar 6, 2013, at 11:17 AM, Steve Rowe sar...@gmail.com wrote:

 In response to my thoughts about using DOM as an intermediate representation 
 for schema elements, for use in lazy re-loading on schema change, Erik 
 Hatcher argued against (solely) using XML for schema serialization 
 (https://issues.apache.org/jira/browse/SOLR-3251?focusedCommentId=13571631page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13571631):
 
   IMO - The XMLness of the current Solr schema needs to be isolated
   to only one optional way of constructing an IndexSchema instance.
   We want less XML rather than more. (for example, it should be
   possible to have a relational database that contains a model of
   a schema and load it that way)
 
 I was hoping to avoid dealing with round-tripping XML comments (of which 
 there are many in schema.xml).  My thought was that an XML-JSON conversion 
 tool would insert description properties on the enclosing/adjacent object 
 when it encounters comments.  But I suppose the same process could be applied 
 to schema.xml: XML comments could be converted to description elements, and 
 then when serializing changes, any user-inserted comments would be stripped.
 
 The other concern is about schema ownership: dealing with schemas that mix 
 hand-editing with Solr modification/serialization would likely be harder than 
 supporting just one of them.  But I suppose there is already a set of 
 validity checks, so maybe this wouldn't be so bad? 
 
 Steve
 
 On Mar 6, 2013, at 1:35 PM, Mark Miller markrmil...@gmail.com wrote:
 
 bq. Change Solr schema serialization from XML to JSON, and provide an 
 XML-JSON conversion tool.
 
 What is the motivation for the change? I think if you are sitting down and 
 looking to design a schema, working with the XML is fairly nice and fast. I 
 picture that a lot of people would start by working with the XML file to get 
 it how they want, and then perhaps do future changes with the rest API. When 
 you are developing, starting with the rest API feels fairly cumbersome if 
 you have to make a lot of changes/additions/removals.
 
 So why not just keep the XML and add the rest API? Do we gain much by 
 switching it to JSON? I like JSON when it comes to rest, but when I think 
 about editing a large schema doc locally, XML seems much easier to deal with.
 
 - Mark
 



Re: Dynamic schema design: feedback requested

2013-03-06 Thread Steve Rowe
I'm not sure what pieces you might be missing, sorry.

I had thought about adding a web UI for schema composition, but that would be a 
major effort, and not in scope here.

I agree, though, especially without a full schema modification REST API, that 
hand editing will have to be supported.

Steve

On Mar 6, 2013, at 2:49 PM, Mark Miller markrmil...@gmail.com wrote:

 Hmm…I think I'm missing some pieces.
 
 I agree with Erick that you should be able to load a schema from any object - 
 a DB, a file in ZooKeeper, you name it. But it seems by default, having that 
 object be schema.xml seems nicest to me. That doesn't mean you have to use 
 DOM or XML internally - just that you have a serializer/deserializer for it. 
 If you wanted to do it from a database, that would just be another 
 serialize/deserialze impl. Internally, it could all be JSON or Java objects, 
 or whatever.
 
 As far as a user editing the file AND rest API access, I think that seems 
 fine. Yes, the user is in trouble if they break the file, but that is the 
 risk they take if they want to manually edit it - it's no different than 
 today when you edit the file and do a Core reload or something. I think we 
 can improve some validation stuff around that, but it doesn't seem like a 
 show stopper to me.
 
 At a minimum, I think the user should be able to start with a hand modified 
 file. Many people *heavily* modify the example schema to fit their use case. 
 If you have to start doing that by making 50 rest API calls, that's pretty 
 rough. Once you get your schema nice and happy, you might script out those 
 rest calls, but initially, it's much faster/easier to whack the schema into 
 place in a text editor IMO.
 
 Like I said though, I may be missing something…
 
 - Mark
 
 On Mar 6, 2013, at 11:17 AM, Steve Rowe sar...@gmail.com wrote:
 
 In response to my thoughts about using DOM as an intermediate representation 
 for schema elements, for use in lazy re-loading on schema change, Erik 
 Hatcher argued against (solely) using XML for schema serialization 
 (https://issues.apache.org/jira/browse/SOLR-3251?focusedCommentId=13571631page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13571631):
 
  IMO - The XMLness of the current Solr schema needs to be isolated
  to only one optional way of constructing an IndexSchema instance.
  We want less XML rather than more. (for example, it should be
  possible to have a relational database that contains a model of
  a schema and load it that way)
 
 I was hoping to avoid dealing with round-tripping XML comments (of which 
 there are many in schema.xml).  My thought was that an XML-JSON conversion 
 tool would insert description properties on the enclosing/adjacent object 
 when it encounters comments.  But I suppose the same process could be 
 applied to schema.xml: XML comments could be converted to description 
 elements, and then when serializing changes, any user-inserted comments 
 would be stripped.
 
 The other concern is about schema ownership: dealing with schemas that mix 
 hand-editing with Solr modification/serialization would likely be harder 
 than supporting just one of them.  But I suppose there is already a set of 
 validity checks, so maybe this wouldn't be so bad? 
 
 Steve
 
 On Mar 6, 2013, at 1:35 PM, Mark Miller markrmil...@gmail.com wrote:
 
 bq. Change Solr schema serialization from XML to JSON, and provide an 
 XML-JSON conversion tool.
 
 What is the motivation for the change? I think if you are sitting down and 
 looking to design a schema, working with the XML is fairly nice and fast. I 
 picture that a lot of people would start by working with the XML file to 
 get it how they want, and then perhaps do future changes with the rest API. 
 When you are developing, starting with the rest API feels fairly cumbersome 
 if you have to make a lot of changes/additions/removals.
 
 So why not just keep the XML and add the rest API? Do we gain much by 
 switching it to JSON? I like JSON when it comes to rest, but when I think 
 about editing a large schema doc locally, XML seems much easier to deal 
 with.
 
 - Mark
 
 



Re: Dynamic schema design: feedback requested

2013-03-06 Thread Mark Miller

On Mar 6, 2013, at 12:08 PM, Steve Rowe sar...@gmail.com wrote:

 I'm not sure what pieces you might be missing, sorry.

My main confusion is around this:

bq. When schema.xml is present, schema.json (if any) will be ignored.

Basically, why have schema.json? Perhaps it's just me, but a json schema seems 
like a lot harder to deal with as a human than an XML schema file.

Hence the rest of my comments - just because we don't use the DOM or XML 
internally doesn't seem to mean we need to do JSON through the entire pipeline 
(eg the serialized representation)

- Mark



Re: Dynamic schema design: feedback requested

2013-03-06 Thread Steve Rowe
On Mar 6, 2013, at 3:33 PM, Mark Miller markrmil...@gmail.com wrote:
 On Mar 6, 2013, at 12:08 PM, Steve Rowe sar...@gmail.com wrote:
 I'm not sure what pieces you might be missing, sorry.
 
 My main confusion is around this:
 
 bq. When schema.xml is present, schema.json (if any) will be ignored.
 
 Basically, why have schema.json? Perhaps it's just me, but a json schema 
 seems like a lot harder to deal with as a human than an XML schema file.

Right, absolutely, the existence of schema.json assumes no human editing for 
exactly this reason, so it's in direct conflict with the need to continue to 
allow hand editing.

 Hence the rest of my comments - just because we don't use the DOM or XML 
 internally doesn't seem to mean we need to do JSON through the entire 
 pipeline (eg the serialized representation)

I agree.

This all revolves around whether the schema serialization is an implementation 
detail that users don't have to care about.  We're not there yet, obviously.

Steve




Re: Dynamic schema design: feedback requested

2013-03-06 Thread Jan Høydahl
How will this all work with ZooKeeper and cloud?

Will ZK get pushed the serialized monolithic schema.xml / schema.json from the 
node which changed it, and then trigger an update to the rest of the cluster?

I was kind of hoping that once we have introduced ZK into the mix as our 
centralized config server, we could start using it as such consistently. And so 
instead of ZK storing a plain xml file, we split up the schema as native ZK 
nodes:

configs
 +configA
   +--schema
  +--version: 1.5
  +--fieldTypes
  |  +---text_en tokenizer:foo, filters: [{name: foo, class: 
solr.StrField...}, {name: bar...}]}
  |  +---text_no tokenizer:foo, filters: [{name: foo, class: 
solr.StrField...}, {name: bar...}]}
  +--fields
 +---title ...

Then we or 3rd parties can build various tools to interact with the schema. 
Your REST service would read and update these manageable chunks in ZK, and it 
will all be in sync. It is also more 1:1 with how things are wired, multiple 
collections may share the same config set and thus schema, so what happens if 
someone does not know this and hits PUT localhost:8983/solr/collection1/schema 
and it affects also the schema for collection2? These relationships are already 
maintained in ZK.

I imagine we can do the same with solrconfig too. Split it up in small 
information pieces kept it ZK. Then SolrCloud can have a compat mode 
serializing this info as the old familiar files for those who need an export to 
plain singlenode or the opposite. Perhaps we can use ZK to keep N revisions 
too, so you could roll back a series of changes?

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com
Solr Training - www.solrtraining.com

6. mars 2013 kl. 19:35 skrev Mark Miller markrmil...@gmail.com:

 bq. Change Solr schema serialization from XML to JSON, and provide an 
 XML-JSON conversion tool.
 
 What is the motivation for the change? I think if you are sitting down and 
 looking to design a schema, working with the XML is fairly nice and fast. I 
 picture that a lot of people would start by working with the XML file to get 
 it how they want, and then perhaps do future changes with the rest API. When 
 you are developing, starting with the rest API feels fairly cumbersome if you 
 have to make a lot of changes/additions/removals.
 
 So why not just keep the XML and add the rest API? Do we gain much by 
 switching it to JSON? I like JSON when it comes to rest, but when I think 
 about editing a large schema doc locally, XML seems much easier to deal with.
 
 - Mark



Re: Dynamic schema design: feedback requested

2013-03-06 Thread Chris Hostetter

: As far as a user editing the file AND rest API access, I think that 
: seems fine. Yes, the user is in trouble if they break the file, but that 

Ignoring for a moment what format is used to persist schema information, I 
think it's important to have a conceptual distinction between data that 
is managed by applications and manipulated by a REST API, and config 
that is managed by the user and loaded by solr on init -- or via an 
explicit reload config REST API.

Past experience with how users percieve(d) solr.xml has heavily reinforced 
this opinion: on one hand, it's a place users must specify some config 
information -- so people wnat to be able to keep it in version control 
with other config files.  On the other hand it's a live data file that 
is rewritten by solr when cores are added.  (God help you if you want do a 
rolling deploy a new version of solr.xml where you've edited some of the 
config values while simultenously clients are creating new SolrCores)

As we move forward towards having REST APIs that treat schema information 
as data that can be manipulated, I anticipate the same types of 
confusion, missunderstanding, and grumblings if we try to use the same 
pattern of treating the existing schema.xml (or some new schema.json) as a 
hybrid configs  data file.  Edit it by hand if you want, the /schema/* 
REST API will too!  ... Even assuming we don't make any of the same 
technical mistakes that have caused problems with solr.xml round tripping 
in hte past (ie: losing comments, reading new config options that we 
forget to write back out, etc...) i'm fairly certain there is still going 
to be a lot of things that will loook weird and confusing to people.

(XML may bave been designed to be both human readable  writable and 
machine readable  writable, but practically speaking it's hard have a 
single XML file be machine and human readable  writable)

I think it would make a lot of sense -- not just in terms of 
implementation but also for end user clarity -- to have some simple, 
straightforward to understand caveats about maintaining schema 
information...

1) If you want to keep schema information in an authoritative config file 
that you can manually edit, then the /schema REST API will be read only. 

2) If you wish to use the /schema REST API for read and write operations, 
then schema information will be persisted under the covers in a data store 
whose format is an implementation detail just like the index file format.

3) If you are using a schema config file and you wish to switch to using 
the /schema REST API for managing schema information, there is a 
tool/command/API you can run to so.

4) if you are using the /schema REST API for managing schema information, 
and you wish to switch to using a schema config file, there is a 
tool/command/API you can run to export the schema info if a config file 
format.


...wether of not the under the covers in a data store used by the REST 
API is JSON, or some binary data, or an XML file just schema.xml w/o 
whitespace/comments should be an implementation detail.  Likewise is the 
question of wether some new config file formats are added -- it shouldn't 
matter.

If it's config it's config and the user owns it.
If it's data it's data and the system owns it.

: is the risk they take if they want to manually edit it - it's no 
: different than today when you edit the file and do a Core reload or 
: something. I think we can improve some validation stuff around that, but 
: it doesn't seem like a show stopper to me.

The new risk is multiple actors (both the user, and Solr) editing the 
file concurrently, and info that might be lost due to Solr reading the 
file, manpulating internal state, and then writing the file back out.  

Eg: User hand edits may be lost if they happen on disk during Solr's 
internal manpulation of data.  API edits may be reflected in the internal 
state, but lost if the User writes the file directly and then does a core 
reload, etc

: At a minimum, I think the user should be able to start with a hand 
: modified file. Many people *heavily* modify the example schema to fit 
: their use case. If you have to start doing that by making 50 rest API 
: calls, that's pretty rough. Once you get your schema nice and happy, you 
: might script out those rest calls, but initially, it's much 
: faster/easier to whack the schema into place in a text editor IMO.

I don't think there is any disagreement about that.  The ability to say 
my schema is a config file and i own it should always exist (remove 
it over my dead body) 

The question is what trade offs to expect/require for people who would 
rather use an API to manipulate these things -- i don't think it's 
unreasable to say if you would like to manipulate the schema using an 
API, then you give up the ability to manipulate it as a config file on 
disk

(if you want the /schema API to drive your car, you have to take your 
foot of hte pedals and let go of the steering 

Re: Dynamic schema design: feedback requested

2013-03-06 Thread Mark Miller

On Mar 6, 2013, at 4:50 PM, Chris Hostetter hossman_luc...@fucit.org wrote:

 i don't think it's 
 unreasable to say if you would like to manipulate the schema using an 
 API, then you give up the ability to manipulate it as a config file on 
 disk

As long as you can initially work with an easily editable file, I have no 
problem then requiring you stick with hand editing or move to using the rest 
api.

- Mark