As David said, I did a lot of breaking apart of default configuration and it is a bit of a mess in there. (if anybody wants to review the breakdown for Solr 6: https://www.slideshare.net/arafalov/rebuilding-solr-6-examples-layer-by-layer-lucenesolrrevolution-2016, slide 19 is the kicker)
I certainly agree with others that said that it is very hard for a user to figure out what a 'production' schema should look like and they just keep the one we give, including the schemaless part and all. This seems to crop-up on the User list over and over again. My +1 is SOLR-11741 (Offline training mode) and on it being an explicit configuration to let users define their own chain/type-widening sequence. So, the user would throw a subset (or all) of the data at a separate end-point and receive back the suggested schema addition commands to support the data. Perhaps this learning mode should not live in a default schema either but in a kitchen sync one that also has all the extra type definitions (separate discussion, especially since DIH and 5 DIH schemas are going away as well). Regards, Alex. On Wed, 5 Aug 2020 at 01:01, David Smiley <dsmi...@apache.org> wrote: > > Thanks for starting this thread Marcus! For a historical note, the current > _default configSet being "data driven" (aka "schemaless", a worse name) is > largely because of SOLR-10272 Maybe I should have fought harder against it > then. I threatened to veto but I was placated by it being easily disabled. > And it's true; you can disable it, and there are some loud warnings on the > CLI so... yeah. > > I think my views most align with Gus. The name "default" is suggestive of > good settings you ought to change if you know what you are doing. Perhaps > there simply can be no reasonable "default" for a search platform. There > might be "basic minimal blah blah" etc. that _is_ the default choice if you > don't specify it but naming the configSet itself as "default" gives too much > blessing to it. I've seen too many configs with tons of stuff that were > there because it was inherited, and then it's hard to guess what's _actually_ > being used. Alexandre Rafalov had done some great work in figuring out how > to minimize configs. There's more to do there. > > I'd be happy to see basically any change though; even a simple change from > opt-out to opt-in to "data driven" URPs. I don't like the status quo. > > BTW I've also seen people try to take "bin/solr -e cloud" to production :-( > "Hey look, this is how a tutorial told me to run SolrCloud" (so the logic > goes). > > ~ David Smiley > Apache Lucene/Solr Search Developer > http://www.linkedin.com/in/davidwsmiley > > > On Tue, Aug 4, 2020 at 2:24 PM Jan Høydahl <jan....@cominvent.com> wrote: >> >> Learning mode won’t work if you have 10 existing collections and want to >> create #11. We could rather have a SchemaLearningUpdateHandler so people >> could explicitly post documents to say /schema-guess to modify the schema. >> We could even have this implicit. Then the _default config would have just >> _root_, is and a few more, and if you want guessing you first send a number >> of docs to /schema-guess endpoint and then inspect in schema browser what >> you got. That handler could support a Parma &reset=true which would wipe the >> schema to start guessing from scratch. >> >> Jan Høydahl >> >> 4. aug. 2020 kl. 15:30 skrev Gus Heck <gus.h...@gmail.com>: >> >> >> Interesting read. Might have changed now that we have authentication >> capabilities... but let's not thread jack :) >> >> On Tue, Aug 4, 2020 at 8:28 AM Erick Erickson <erickerick...@gmail.com> >> wrote: >>> >>> Having the admin UI allow uploads may not be secure. When I had a similar >>> idea a long time ago it got shot down, see the discussion at: >>> https://issues.apache.org/jira/browse/SOLR-5287. >>> >>> I _think_ this is a different issue if the configs have to be residing on >>> the system, not coming in from outside, just FYI... >>> >>> > On Aug 3, 2020, at 7:03 PM, Gus Heck <gus.h...@gmail.com> wrote: >>> > >>> > >>> > >>> > On Mon, Aug 3, 2020 at 5:03 PM Erick Erickson <erickerick...@gmail.com> >>> > wrote: >>> > Gus’s point about implementing something before removing it is well >>> > taken, but we can deprecate it immediately without removing it. Gus’s >>> > point about dynamic fields not being found until later in the cycle is >>> > well taken, but not enough to persuade me. >>> > >>> > Fair enough :) >>> > >>> > I’m not enthusiastic about multiple getting started schemas. The whole >>> > motivation behind schemaless is that the user doesn’t need to know about >>> > schemas to get started. By providing multiple “getting started” schemas >>> > we require them to become aware of schemas again. >>> > >>> > Here's my theory (which may or may not be persuasive :) ) >>> > >>> > My thinking in that suggestion is that the majority of the problem is due >>> > to the fact that people new to a technology will tend to latch onto the >>> > defaults that come with something as being something that should be held >>> > onto until you have a good reason to change it. This is reasonable >>> > because changing things you don't understand willy nilly is often a road >>> > to pain. And people DO want a safe starting point and we should give it >>> > to them because it makes their life easier once they get a little further >>> > down the road, but this is not compatible with the easy-start schemaless >>> > mode. Looking at >>> > https://lucene.apache.org/solr/guide/8_5/solr-tutorial.html I see that >>> > the initial tutorial experience is fully scripted, and the user won't >>> > likely notice if they are told to ignore _default or guessing-proto in >>> > favor of the tech products config set... BUT when they do get to the >>> > point of looking at the config name they'll see the more descriptive >>> > name. So rather than seeing "_default" and thinking "Ah ha! Here's >>> > something I can take as gospel and not change until I have a reason!" >>> > they'll see "guessing-proto" or "dynamic-proto" and say "Hunh, I wonder >>> > what that means?" which is a good question for them to ask I think. >>> > >>> > The concept of a default lays in a strong bias of not touching it (IMHO) >>> > which will be wrong most of the time no matter what we give them as a >>> > default. If something must be a default I'd favor a non-managed, >>> > non-dynamic, non-guessing minimal schema with the required fields, and an >>> > id field, maybe a _text_ field, and a comment pointing to the section of >>> > the ref guide where they can copy and paste in all the stuff that's >>> > currently in our base schema as example (things like the text_ga type), >>> > IF they want it. I get really tired of seeing mile long schemas that have >>> > a ton of unused stuff that is retained because people didn't know if they >>> > needed it or not... >>> > >>> > Note that not having some default would break back compat, on bin/solr >>> > but changing the default is also a break of sorts. >>> > >>> > >>> > All that said, maybe we could rethink the approach. My two objections are: >>> > 1> schemaless, by updating the schema based on a very small sample set is >>> > very susceptible to failing early and often >>> > 2> Constantly updating the config in ZK and reloading the collections >>> > seems very hard to get right. >>> > >>> > I have for some time thought the inability to upload and download a >>> > config (or files within a config) via the web UI was a gap. But I found >>> > it easier to write >>> > https://plugins.gradle.org/plugin/com.needhamsoftware.solr-gradle than >>> > add that feature to the UI :) >>> > >>> > So I can imagine a “getting started” mode that indexed to the glob field >>> > while creating a schema. Ideally, it would be necessary to enable it >>> > specifically rather than have it be the default. I’d imagine this being >>> > coupled with some kind of “export schema” button. So the process would be >>> > > start Solr with -Dsolr.learningmode.confg=some_config_name. >>> > > index a bunch of documents, perhaps prototyping the search app on the >>> > > dynamic glob field. >>> > > The admin UI should have a big, intrusive banner saying “RUNNING IN >>> > > LEARNING MODE” with instructions on what to do next. >>> > > In that mode there’d need to be a “save schema” button or something. >>> > > What I’d like that to do would be examine the index and write a new >>> > > schema somewhere. If ths was the mode, then you’d be able to run it any >>> > > time. >>> > >>> > +1 for anything that makes a round-trip of working with the schema >>> > easier, but not really a fan of learning mode. >>> > >>> > >>> > >>> >>> >>> --------------------------------------------------------------------- >>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org >>> For additional commands, e-mail: dev-h...@lucene.apache.org >>> >> >> >> -- >> http://www.needhamsoftware.com (work) >> http://www.the111shift.com (play) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org