Re: Deprecate Schemaless Mode?

Alexandre Rafalovitch Wed, 05 Aug 2020 09:18:26 -0700

As David said, I did a lot of breaking apart of default configuration
and it is a bit of a mess in there. (if anybody wants to review the
breakdown for Solr 6:
https://www.slideshare.net/arafalov/rebuilding-solr-6-examples-layer-by-layer-lucenesolrrevolution-2016,
slide 19 is the kicker)


I certainly agree with others that said that it is very hard for a
user to figure out what a 'production' schema should look like and
they just keep the one we give, including the schemaless part and all.
This seems to crop-up on the User list over and over again.

My +1 is SOLR-11741 (Offline training mode) and on it being an
explicit configuration to let users define their own
chain/type-widening sequence. So, the user would throw a subset (or
all) of the data at a separate end-point and receive back the
suggested schema addition commands to support the data. Perhaps this
learning mode should not live in a default schema either but in a
kitchen sync one that also has all the extra type definitions
(separate discussion, especially since DIH and 5 DIH schemas are going
away as well).

Regards,
   Alex.

On Wed, 5 Aug 2020 at 01:01, David Smiley <dsmi...@apache.org> wrote:
>
> Thanks for starting this thread Marcus!  For a historical note, the current 
> _default configSet being "data driven" (aka "schemaless", a worse name) is 
> largely because of SOLR-10272  Maybe I should have fought harder against it 
> then.  I threatened to veto but I was placated by it being easily disabled.  
> And it's true; you can disable it, and there are some loud warnings on the 
> CLI so... yeah.
>
> I think my views most align with Gus.  The name "default" is suggestive of 
> good settings you ought to change if you know what you are doing.  Perhaps 
> there simply can be no reasonable "default" for a search platform.  There 
> might be "basic minimal blah blah" etc. that _is_ the default choice if you 
> don't specify it but naming the configSet itself as "default" gives too much 
> blessing to it.  I've seen too many configs with tons of stuff that were 
> there because it was inherited, and then it's hard to guess what's _actually_ 
> being used.  Alexandre Rafalov had done some great work in figuring out how 
> to minimize configs.  There's more to do there.
>
> I'd be happy to see basically any change though; even a simple change from 
> opt-out to opt-in to "data driven" URPs.  I don't like the status quo.
>
> BTW I've also seen people try to take "bin/solr -e cloud" to production :-(   
> "Hey look, this is how a tutorial told me to run SolrCloud" (so the logic 
> goes).
>
> ~ David Smiley
> Apache Lucene/Solr Search Developer
> http://www.linkedin.com/in/davidwsmiley
>
>
> On Tue, Aug 4, 2020 at 2:24 PM Jan Høydahl <jan....@cominvent.com> wrote:
>>
>> Learning mode won’t work if you have 10 existing collections and want to 
>> create #11. We could rather have a SchemaLearningUpdateHandler so people 
>> could explicitly post documents to say  /schema-guess to modify the schema. 
>> We could even have this implicit. Then the _default config would have just 
>> _root_, is and a few more, and if you want guessing you first send a number 
>> of docs to /schema-guess endpoint and then inspect in schema browser what 
>> you got. That handler could support a Parma &reset=true which would wipe the 
>> schema to start guessing from scratch.
>>
>> Jan Høydahl
>>
>> 4. aug. 2020 kl. 15:30 skrev Gus Heck <gus.h...@gmail.com>:
>>
>> 
>> Interesting read. Might have changed now that we have authentication 
>> capabilities... but let's not thread jack :)
>>
>> On Tue, Aug 4, 2020 at 8:28 AM Erick Erickson <erickerick...@gmail.com> 
>> wrote:
>>>
>>> Having the admin UI allow uploads may not be secure. When I had a similar 
>>> idea a long time ago it got shot down, see the discussion at: 
>>> https://issues.apache.org/jira/browse/SOLR-5287.
>>>
>>> I _think_ this is a different issue if the configs have to be residing on 
>>> the system, not coming in from outside, just FYI...
>>>
>>> > On Aug 3, 2020, at 7:03 PM, Gus Heck <gus.h...@gmail.com> wrote:
>>> >
>>> >
>>> >
>>> > On Mon, Aug 3, 2020 at 5:03 PM Erick Erickson <erickerick...@gmail.com> 
>>> > wrote:
>>> > Gus’s point about implementing something before removing it is well 
>>> > taken, but we can deprecate it immediately without removing it. Gus’s 
>>> > point about dynamic fields not being found until later in the cycle is 
>>> > well taken, but not enough to persuade me.
>>> >
>>> > Fair enough :)
>>> >
>>> > I’m not enthusiastic about multiple getting started schemas. The whole 
>>> > motivation behind schemaless is that the user doesn’t need to know about 
>>> > schemas to get started. By providing multiple “getting started” schemas 
>>> > we require them to become aware of schemas again.
>>> >
>>> > Here's my theory (which may or may not be persuasive :) )
>>> >
>>> > My thinking in that suggestion is that the majority of the problem is due 
>>> > to the fact that people new to a technology will tend to latch onto the 
>>> > defaults that come with something as being something that should be held 
>>> > onto until you have a good reason to change it. This is reasonable 
>>> > because changing things you don't understand willy nilly is often a road 
>>> > to pain. And people DO want a safe starting point and we should give it 
>>> > to them because it makes their life easier once they get a little further 
>>> > down the road, but this is not compatible with the easy-start schemaless 
>>> > mode. Looking at 
>>> > https://lucene.apache.org/solr/guide/8_5/solr-tutorial.html I see that 
>>> > the initial tutorial experience is fully scripted, and the user won't 
>>> > likely notice if they are told to ignore _default or guessing-proto in 
>>> > favor of the tech products config set... BUT when they do get to the 
>>> > point of looking at the config name they'll see the more descriptive 
>>> > name. So rather than seeing "_default" and thinking "Ah ha! Here's 
>>> > something I can take as gospel and not change until I have a reason!" 
>>> > they'll see "guessing-proto" or "dynamic-proto" and say "Hunh, I wonder 
>>> > what that means?" which is a good question for them to ask I think.
>>> >
>>> > The concept of a default lays in a strong bias of not touching it (IMHO) 
>>> > which will be wrong most of the time no matter what we give them as  a 
>>> > default. If something must be a default I'd favor a non-managed, 
>>> > non-dynamic, non-guessing minimal schema with the required fields, and an 
>>> > id field, maybe a _text_ field, and a comment pointing to the section of 
>>> > the ref guide where they can copy and paste in all the stuff that's 
>>> > currently in our base schema as example (things like the text_ga type), 
>>> > IF they want it. I get really tired of seeing mile long schemas that have 
>>> > a ton of unused stuff that is retained because people didn't know if they 
>>> > needed it or not...
>>> >
>>> > Note that not having some default would break back compat, on bin/solr 
>>> > but changing the default is also a break of sorts.
>>> >
>>> >
>>> > All that said, maybe we could rethink the approach. My two objections are:
>>> > 1> schemaless, by updating the schema based on a very small sample set is 
>>> > very susceptible to failing early and often
>>> > 2> Constantly updating the config in ZK and reloading the collections 
>>> > seems very hard to get right.
>>> >
>>> > I have for some time thought the inability to upload and download a 
>>> > config (or files within a config) via the web UI was a gap. But I found 
>>> > it easier to write 
>>> > https://plugins.gradle.org/plugin/com.needhamsoftware.solr-gradle than 
>>> > add that feature to the UI :)
>>> >
>>> > So I can imagine a “getting started” mode that indexed to the glob field 
>>> > while creating a schema. Ideally, it would be necessary to enable it 
>>> > specifically rather than have it be the default. I’d imagine this being 
>>> > coupled with some kind of “export schema” button. So the process would be
>>> > > start Solr with -Dsolr.learningmode.confg=some_config_name.
>>> > > index a bunch of documents, perhaps prototyping the search app on the 
>>> > > dynamic glob field.
>>> > > The admin UI should have a big, intrusive banner saying “RUNNING IN 
>>> > > LEARNING MODE” with instructions on what to do next.
>>> > > In that mode there’d need to be a “save schema” button or something. 
>>> > > What I’d like that to do would be examine the index and write a new 
>>> > > schema somewhere. If ths was the mode, then you’d be able to run it any 
>>> > > time.
>>> >
>>> > +1 for anything that makes a round-trip of working with the schema 
>>> > easier, but not really a fan of learning mode.
>>> >
>>> >
>>> >
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>>> For additional commands, e-mail: dev-h...@lucene.apache.org
>>>
>>
>>
>> --
>> http://www.needhamsoftware.com (work)
>> http://www.the111shift.com (play)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Re: Deprecate Schemaless Mode?

Reply via email to