Putting this up top so people will read it ;) Perhaps this is all just overthinking. Is the crux of the matter that schemaless is the default? Would it suffice to make it something that had to be explicitly enabled, rather than be something in solrconfig? In essence, flip the current way we do things where we can _disable_ schemaless via "bin/solr config -c mycollection -p 8983 -action set-user-property -property update.autoCreateFields -value false” and instead have it off by default and require that people _enable_ it when desired?
I think my antipathy is rooted in the fact that OOB, Solr enables schemaless. New users then have to somewhere find out that buried in the 1,500 pages of the ref guide that they can’t search is a caution that you shouldn’t take Solr to production as it’s configured OOB. It’s far too easy to miss. At least if we required that people explicitly enable it they’d have some incentive to look at https://lucene.apache.org/solr/guide/8_5/schemaless-mode.html where we call out not using it in production. Currently there isn’t any incentive to understand anything about schemaless before blithely going to production. OK, on to my antipathy, some of which directly contradicts the above…. Just because we have other “getting started” tools that aren’t recommended for production isn’t a justification for keeping something as problematic as schemaless. ExtractingRequestHandler is probably the closest in that it can unexpectedly blow up down the road. bin/post is reasonably safe, just inefficient. Gus’s point about implementing something before removing it is well taken, but we can deprecate it immediately without removing it. Gus’s point about dynamic fields not being found until later in the cycle is well taken, but not enough to persuade me. I’m not enthusiastic about multiple getting started schemas. The whole motivation behind schemaless is that the user doesn’t need to know about schemas to get started. By providing multiple “getting started” schemas we require them to become aware of schemas again. Sorry, Anshum, but "This feature isn't trappy unless people use it in ways it was not intended “ is not persuasive at all. If we have such intentions, we should enforce them. How, I don’t quite know however. How are users supposed to understand that some feature is or is not intended? All that said, maybe we could rethink the approach. My two objections are: 1> schemaless, by updating the schema based on a very small sample set is very susceptible to failing early and often 2> Constantly updating the config in ZK and reloading the collections seems very hard to get right. So I can imagine a “getting started” mode that indexed to the glob field while creating a schema. Ideally, it would be necessary to enable it specifically rather than have it be the default. I’d imagine this being coupled with some kind of “export schema” button. So the process would be > start Solr with -Dsolr.learningmode.confg=some_config_name. > index a bunch of documents, perhaps prototyping the search app on the dynamic > glob field. > The admin UI should have a big, intrusive banner saying “RUNNING IN LEARNING > MODE” with instructions on what to do next. > In that mode there’d need to be a “save schema” button or something. What I’d > like that to do would be examine the index and write a new schema somewhere. > If ths was the mode, then you’d be able to run it any time. > On Aug 3, 2020, at 2:39 PM, Gus Heck <[email protected]> wrote: > > I almost never use schemaless mode (better named "schema guessing mode") and > I would never recommend it for use beyond prototyping. The primary use I see > for it is to throw a bunch of data at it to get a starting point for a > schema... say for example you want to see what tika's going to produce for > metadata before solidifying what you will and will not rely on. I think the > ability to suggest a schema is valuable and shouldn't go away. I'm all for > not having it be the default configuration however, and I really like the > suggestions linked in the ticket for features that consider a number of > documents before trying to guess the schema and if we implement one of those > I'd be for deprecation and eventual removal, but not before. > > The ticket contains a suggestion of adding a catch all '*' dynamic field, but > we should make sure to indicate that that ALSO is not typically good for > production use because one garbage (or malicious) document can explode the > number of fields in the index, or cause cases where forgetting to add a > properly typed field makes it much further down the development cycle before > getting caught. (i.e. not caught until a user tries to sort on it and gets 1, > 10, 11, 2,... ), and dev churn due to data silently indexed into typo > variants.... etc. > > Perhaps we should distribute more than one pre-baked config set and label > none of them as "default"? I'd suggest maybe > • guessing-proto --> our current _default possibly refined, for > protoytping > • dynamic-proto --> a schema based on dynamic fields with a * default > to text-general as an alternative prototyping tool less dependent on data > order, but requiring more editing > • managed-min --> A base on which to build a production quality managed > schema > • static-min --> A base on which to build a production quality classic > (non-managed) schema > Also +1 to renaming the feature away from "Schemaless" to "Schema Guessing" > > -Gus > > On Mon, Aug 3, 2020 at 11:33 AM Marcus Eagan <[email protected]> wrote: > Community, > > There are many of us that have had to deal with the pain of managing the > schemaless mode of operation in Solr. I'm curious to get others thoughts > about how well it is working for them and if they would like to continue to > use it. > > I for one don't think Schemaless works as intended and favor deprecating it > and replacing it with some more usable but I am sure others have thoughts > here. > > Is anyone on this list using schemaless mode in production or have you tried > to? > > A preliminary discussion has occurred in this Jira ticket: > https://issues.apache.org/jira/browse/SOLR-14701 > > Thank you all, > > Marcus Eagan > > > > -- > http://www.needhamsoftware.com (work) > http://www.the111shift.com (play) --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
