[jira] [Commented] (SOLR-14701) Deprecate Schemaless Mode (Discussion)

Alexandre Rafalovitch (Jira) Sun, 13 Sep 2020 07:09:19 -0700


    [ 
https://issues.apache.org/jira/browse/SOLR-14701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17195048#comment-17195048
 ]


Alexandre Rafalovitch commented on SOLR-14701:
----------------------------------------------

Eric, these are great thoughts and I look forward to others commenting on 
those. This issue did start as a higher-level discussion actually, so we can 
still adjust/redo/drop the code part.

My thoughts specifically on what you are saying are:
* The crux of the issue. I think the real issue is that we chased somebody 
else's approach because "mindshare" and then paid a similar price they did on 
technical level. But now we cannot just rip out what we have because it is kind 
of half-interesting. So, we are cleaning it up and side-lining it into learning 
schema. Where it has a chance to be actually useful. So, that's the context of 
why I feel it was worth cleaning it up. A use case I see is somebody throwing 
their data into it, and being able to do some faceting/analytics/sorting on it 
to understand what they actually have. 
*  ??Anything is a multivalued string??. 
You can do that already with {{<dynamicField name="*"...>}}. I did [a 
presentation around 
that|https://www.slideshare.net/arafalov/rapid-solr-schema-development-phone-directory]
 a while ago .  By doing this, you don't need this guessing chain and it would 
actually not do anything because it skips any existing field or dynamicField. 
But it could stop skipping dynamic fields and create specific fields while 
still leaving the fallback. Or it could have a section of which dynamic fields 
to override, if you still want something_i to map directly but any others to be 
type-guessed. I think this could be an interesting approach even for production 
schema.
* Additionally, the current implementation accumulates data for all the fields 
and only skips them in commit. Theoretically, it could do comparison of types 
and issue some advice like "currently this field matches a multiValued String 
glob. But it could actually be a single-valued date, as long as you keep the 
date-parsing URP in default chain. It would be easy enough to spit that into a 
log under Warning/Info level.  
* Modify across multiple commits. We have a general issue with modifying field 
definition. Because the lucene-level data structures cannot go from 
single-valued to multi-valued and hand-editing definitions can throw scary 
exception. And because the deleted values are still in the index until it is 
empty/segment is purged. There had been a discussion about "rewriting segment 
filter" that does some of that cleanup/refactoring, but I am not sure if that 
happened on lucene level yet. There are two ways this solution is a bit better 
than existing one however. First is that it does not create definitions until 
commit, so you can feed it all your different spaCy outputs and then do a 
commit on joint definitions. Of course, then you have to rerun the queries to 
get the data in, but that's the price. The second is the corollary of that 
awkwardness; after the schema is created, the data is not in yet, so you can go 
and review the configurations and adjust them as desired.   

> Deprecate Schemaless Mode (Discussion)
> --------------------------------------
>
>                 Key: SOLR-14701
>                 URL: https://issues.apache.org/jira/browse/SOLR-14701
>             Project: Solr
>          Issue Type: Improvement
>          Components: Schema and Analysis
>            Reporter: Marcus Eagan
>            Assignee: Alexandre Rafalovitch
>            Priority: Major
>         Attachments: image-2020-08-04-01-35-03-075.png
>
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> I know this won't be the most popular ticket out there, but I am growing more 
> and more sympathetic to the idea that we should rip many of the freedoms out 
> that cause users more harm than not. One of the freedoms I saw time and time 
> again to cause issues was schemaless mode. It doesn't work as named or 
> documented, so I think it should be deprecated. 
> If you use it in production reliably and in a way that cannot be accomplished 
> another way, I am happy to hear from more knowledgeable folks as to why 
> deprecation is a bad idea. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (SOLR-14701) Deprecate Schemaless Mode (Discussion)

Reply via email to