[jira] [Commented] (SOLR-11741) Offline training mode for schema guessing

David Smiley (Jira) Sun, 15 Aug 2021 22:57:05 -0700


    [ 
https://issues.apache.org/jira/browse/SOLR-11741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17399533#comment-17399533
 ]


David Smiley commented on SOLR-11741:
-------------------------------------

Cassandra:  [I sort of asked 
this|https://issues.apache.org/jira/browse/SOLR-15277?focusedCommentId=17314616&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17314616]
 and Tim thought maybe.  WDYT [~tpot] ?  I just played with the Schema Designer 
on main to get myself a bit more familiar.  I uploaded a CSV file.  It did seem 
to guess it but I didn't look thoroughly.  Tim's comments on that issue said it 
didn't guess so maybe I'm wrong?

If Schema Designer does guess based on the data then I think there is much less 
value in the issue being discussed here, but it still has some.  If it were to 
be committed, I could imagine the Schema Designer being adapted to use it.  But 
I wouldn't want two competing systems to maintain.

Whatever happens; it's a shame to see some promising work by a contributor get 
forgotten after a few years. [~abhidemon] feel free to explicitly ask if you 
needed more attention/feedback.

> Offline training mode for schema guessing
> -----------------------------------------
>
>                 Key: SOLR-11741
>                 URL: https://issues.apache.org/jira/browse/SOLR-11741
>             Project: Solr
>          Issue Type: Improvement
>            Reporter: Ishan Chattopadhyaya
>            Assignee: Ishan Chattopadhyaya
>            Priority: Major
>         Attachments: RuleForMostAccomodatingField.png, SOLR-11741-temp.patch, 
> SOLR-11741.patch, SOLR-11741.patch, SOLR-11741.patch, screenshot-1.png, 
> screenshot-3.png
>
>
> Our data driven schema guessing doesn't work under many situations. For 
> example, if the first document has a field with value "0", it is guessed as 
> Long and subsequent fields with "0.0" are rejected. Similarly, if the same 
> field had alphanumeric contents for a latter document, those documents are 
> rejected. Also, single vs. multi valued field guessing is not ideal.
> Proposing an offline training mode where Solr accepts bunch of documents and 
> returns a guessed schema (without indexing). This schema can then be used for 
> actual indexing. I think the original idea is from Hoss.
> I think initial implementation can be based on an UpdateRequestProcessor. We 
> can hash out the API soon, as we go along.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org
For additional commands, e-mail: issues-h...@solr.apache.org

[jira] [Commented] (SOLR-11741) Offline training mode for schema guessing

Reply via email to