[jira] [Commented] (SOLR-11917) A Potential Roadmap for robust multi-analyzer TextFields w/various options for configuring docValues

Hoss Man (JIRA) Fri, 26 Jan 2018 14:22:43 -0800

    [ 
https://issues.apache.org/jira/browse/SOLR-11917?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16341692#comment-16341692
 ]


Hoss Man commented on SOLR-11917:
---------------------------------

h2. *S2.1*: Easy Multi-Language Querying (SOLR-6492)
h3. *S2.1G*: Goal

Simplified indexing & querying of text in diff languages w/o the query clients 
being _required_ to know about a lot of language specific variant field names. 
At index time we want things to be "easy" for clients wending documents, 
regardless of whether they already know the lang of each field value in 
advance, or if they want solr to do langauge detection.
h3. *S2.1A*: Suggested Approach
{panel:title=Refresher: Summary of Solr In Action (SIA) code linked to from}
SOLR-6492
*What's included & how it works...*
 * custom update process & custom field type
 ** processor is subclass of existing lang detect update processor
 *** super class normally adds a field with languages in doc, or renames fields 
to include language (ie: text => text_de)
 ** field type is subclass of TextField
 *** goes out of it's way to override any Analyzer config with a custom one 
(details below)
 *** configured with a list of mappings from langid to other (existing) field 
types
 * Index Time:
 ** update processor delegates to super to detect languages but instead of (in 
addition to?) super class's behavior of adding a language field to doc, or 
renaming the field with suffix, the custom processor "decorates" the values 
with the detected language(s)...
 ** for any field where the field type is our custom type:
 *** "decorate" each of the field values with either:
 **** the langs of the whole doc
 **** the langs of the field (after re-running lang detect on all values in 
just that field)
 **** the langs of the individual field value (after re-running lang detect on 
just that field value)
 ** other processors can then run as normal, and eventually the IndexSchema is 
asked to build up the IndexableFields for this doc, and it delegates to the 
(custom) field type for these "decorated" fields...
 ** field type's custom analyzer looks for these lang "decorations" on each 
field value
 *** for every lang found, go fetch the analyzer from the mapped field type 
it's configured with
 *** create a token stream that delegates to all the other analyzers & merges 
the resulting token streams
 *** all this custom delegation/merging tokenstream stuff is 
(optionally/wisely) wrapped in RemoveDuplicatesTokenFilter since there can be 
lots of dup tokens for similar languages.
 * Query Time:
 ** the query string provided by the user can be "decorated" with a list of 
languages
 ** the normal plumbing of TextField analyzes the query string, delegating to 
the various analyzers
 *** AFAICT: this means MultiTermPhraseQueries are frequently produced?
 * NOTE: as mentioned in Trey's LR talk for 2014, a "perk" of this solution 
(over using diff fields per languages) is that mixing languages in one field 
value can – in theory – still produce useful phrase queries, even if the 
non-correct analyzers butcher the terms in other languages such that a single 
phrase produced by either language analyzer wouldn't match the original string
 ** [https://www.youtube.com/watch?v=MQ6WtBw8T_U]
 ** BUT: it's not really clear if/how useful/important this is. _Does any one 
have any actual usecases for this???_

*The Fiddly / Awkward / Problematic Bits Of All This Existing Code*
 * language "decoration" is super hackish
 ** index time:
 *** the update processor prepends them as a string
 *** not a lot of easy improvements currently possible given the current 
SolrInputDocument / UpdateProcessor / DocumentBuilder structure / code paths
 **** fixing this "THE RIGHT WAY" would probably require some pretty big 
changes to all this code so SolrInputField could support arbitrary metadata 
(instead of just "boost" like it does today) and passing the SolrInputFields 
all the way to the FieldType's createFields method
 **** the hackish way to do this might be to follow in the footsteps of atomic 
update with "field value may be a map containing magic keys", but...
 ***** this would probably break Atomic Updates (unexpected keys in the Maps it 
thinks it owns)
 ***** this was already a super heinous API hack and hacks this heinous should 
not be reworded by being copied
 ***** Even if we did this, i'm not certain the FieldType's createFields() 
would get the full Map w/o a bunch of other changes in the middle – if we're 
going to have to change existing DocumentBuilder/IndexSchema code to make this 
work, let's not be heinous about it.
 ** query time:
 *** user must prefixing the terms _inside_ the query strings – _after_ the 
field name
 *** example: {{q=my_multi_lang_field:"en,es|Hello there compadre"}}
 *** fixing this in a sane way should be really straight forward...
 **** all of the "public Query getFoo(...)" methods a FieldType must implement 
take in the QParser originating the query
 **** we can ask the QParser for the local/req params
 **** so syntax like "q= \{!field f=body langs='en,es'}Hello there compadre" 
would be easy to support
 * the tokenstream merging slurps in the entire Reader as a String on first 
use, then pre-analyzes using every analyzer and builds up an in memory 
LinkedList<Token>
 ** why is this needed? why can't we just cache _one_ "Token" per Analyzer? 
ie...
 *** each call to incrementTokens calls incrementTokens on any delegate 
analyzer where we don't have a cached token (and says it's not done with the 
input)
 *** then return (and null out) whichever cached token has the lowest position
 ** also: since this is super custom code and we know the way our analyzer is 
getting used is from our custom FieldType, why mess with a Reader -> String at 
all since we know for certain the Reader is a StringReader
 *** ie: bypass the normal "this.getAnalyzer()" and just give the "Analyzer" 
the original String ?{panel}
h4. *S2.1.STRAW1*: Straw Man Proposal #1 – aka "Complex For Users"

Existing SIA Code + query time local params
 * keep most of the existing SOLR-6492 code as is
 ** all indexing code and update processor sub class stay the same
 *** including the hackish way we have to prefix-decorate the langs on field 
values at index time
 *** hopefully fix the analyzer to be more efficient
 ** at query time:
 *** in the FieldType: override the "public Query getFoo(...)" methods to look 
at the local/req params for a langs
 *** use those langs when using our custom (wrapper) analyzer
 *** *NOTE:* This cleaner query time API still has a hitch – see *S2.1.HITCH* 
below
 * This approach seems more "complex" to explain to users then the strawman #2 
(*S2.1.STRAW2*) below
 ** particularly given the dependency on the new update processor (or users 
adding magic field value decoration at index time)
 ** and especially if/when they gain more experience with solr and want to 
understand more what's happening under the covers and how to tweak/customize 
behavior.
 ** see full pro/con list below

h4. *S2.1E.STRAW1*: Hypothetical Example Usage of this *S2.1.STRAW1* Strawman...
{code:xml}
<field name="title" type="langaware" />
<field name="body" type="langaware" />

<fieldType name="langaware" class="solr.MultiLangAwareTextField"
           defaultFieldType="text_general"
           fieldMappings="en:text_english,
                          es:text_latin,
                          fr:text_french"/>
<fieldType name="text_general" ... />
<fieldType name="text_english" ... />
<fieldType name="text_french" ... />
<fieldType name="text_latin" ... />
{code}
{code:xml}
<!-- doc sent by client using new custom update processor -->
<doc>
  <field name="title">Solr In Action</field>
  <field name="body">Ipsum Lorem ... thousands of pages of text</field>
<doc>

<!-- doc sent by client that knows what lang these fields are
<doc>  
  <field name="title">en|Solr In Action</field>
  <field name="body">la|Ipsum Lorem ... thousands of pages of text</field>
<doc>

{code}
{noformat}
# Uses the lang specific analysis the user asked for
/query?q={!lang=la}body:Lorem&fq={!field f=title lang=en}Action

# Falls back to the text_general analysis since no lang is known
/query?q=body:Lorem&fq={!field f=title}Action
{noformat}
h4. *S2.1.STRAW2*: Straw Man Proposal #2 – aka "Simple for Users"

Override only the Query parsing bits of TextField (or huper duper text field)
 * continue using diff fields per lang ague (either dynamic or explicitly) in 
schema
 * continue using the existing clone / lang detect update processors to 
processors copy/rename fields (ie: title => title + title_es)
 * let the analyzer for types like "text_es" do it's regular analysis and 
indexing into the underlying fields like "title_es"
 * let types like "text_multilang" for fields like "title" be a new 
QueryLangAwareProxyTextField that extends TextField
 ** still supports a direct analyzer configuration for it's "default" behavior 
(ie: something simple that is as lang agnostic as possible, aka: text_general)
 ** at index time, just does it's regular indexing with it's configured analyzer
 ** at query time:
 *** if the QParser's params don't indicate a lang ague, do a normal query 
against the specified field
 *** if the QParser does specify some languages:
 **** build up a list lang specific field names using the current field name + 
the languages (ie "title" + "_" + "es")
 ***** fetch the FieldType's for each of those field names from the IndexSchema
 ***** delegate to the equivalent "public Query getFoo" for each of those 
FieldTypes, wrap the results in a DisjunctionMaxQuery
 *** *NOTE:* This use of QParser params still has the same hitch as strawman #1 
(*S2.1.STRAW1*) – see *S2.1.HITCH* below
 * This approach seems simpler to explain to new users then strawman #1 
(*S2.1.STRAW1*)
 ** particularly given that it can be useful (in a clean way) even w/o any 
(new) langid update processors when users already know the language of the 
fields for each doc, but just want simplified querying.
 ** see full pro/con list below

h4. *S2.1E.STRAW2*: Hypothetical Examples of this *S2.1.STRAW2* Strawman #2...
{code:xml}
<field name="title" type="langaware" />
<field name="body" type="langaware" />

<fieldType name="langaware" class="solr.QueryLangAwareProxyTextField">
  <!-- no special mappings needed, just simple lang agnostic default analyzers 
-->
  <analyzer type="index" ... />
  <analyzer type="query" ... />
</fieldType>

<dynamicField name="*_en" type="text_english" ... />
<dynamicField name="*_fr" type="text_french" ... />
<dynamicField name="*_la" type="text_latin" ... />

<fieldType name="text_english" ... />
<fieldType name="text_french" ... />
<fieldType name="text_latin" ... />
{code}
{code:xml}
<!-- sample doc sent by client using langid update processor -->
<!-- title copied to title_en, body copied to body_la -->
<doc>
  <field name="title">Solr In Action</field>
  <field name="body">Ipsum Lorem ... thousands of pages of text</field>
<doc>

<!-- sample doc sent by client that knows what lang these fields are -->
<!-- CloneFieldUpdateProcessor or something simple like can copy these to 
"title" & "body" -->
<doc>  
  <field name="title_en">Solr In Action</field>
  <field name="body_en">Ipsum Lorem ... thousands of pages of text</field>
<doc>

{code}
{noformat}
# rewrites the queries against the lang specific versions using the langs the 
user asked for
/query?q={!lang=la}body:Lorem&fq={!field f=title lang=en}Action

# Falls back to the default analysis (configured on 'langaware' type) since no 
'lang' is specified
/query?q=body:Lorem&fq={!field f=title}Action

# user can still choose to sort on, or filter against, the existence of data in 
specific language fields
/query?q=body_la:Lorem&sort=title_la asc
{noformat}
h4. *S2.1.HITCH*: One Hitch @ Query Time To Both Strawmen

Currently, SolrQueryParserBase/QueryBuilder sometimes uses the "Analyzer" 
(IndexSchema's per field wrapper) directly w/o delegating to 
ft.getFieldQuery(...).

*Best solution I can think of:*
 * SolrQueryParserBase should override createFieldQuery(Analyzer,...) in a way 
that it can delegate to the FieldType
 ** must happen in such a way that the FieldType can make a callback to the low 
level QueryBuilder.createFieldQuery – otherwise we'll have to copy/paste a lot 
of existing code.
 ** NOTE: QueryBuilder.createFieldQuery currently protected.
 * This callback should involve a QParser (like the existing "public Query 
getFoo" methods on FieldType) to access the flags/params to capture some of the 
QueryBuilder state / variables passed to createFieldQuery
 ** in our special case, we ignore the specified Analyzer and pick one at query 
time
 * GENERAL IMPROVEMENT IDEA:
 ** maybe QParser should extend SolrQueryParserBase/QueryBuilder and 
automatically call some QueryBuilder setter methods based on common local 
params (like "df", "f", "q.op", etc...)
 ** some existing QParsers (like LuceneQParser and ExtendedDismaxQParser could 
then be refactored to do their query parsing directly (instead of the QParser 
instantiating a custom subclass of SolrQueryParser)
 ** this would potentially simplify a variety of existing QParser subclasses
 ** could also simplify some FieldType.getFoo methods that currently call "new 
FooQuery" – they could instead delegate back to QParser.newFooQuery
 ** if we did this, then the callback mechanism needed for these strawmen ideas 
would be (mostly?) straight forward:
 *** QParser would override QueryBuilder.createFieldQuery(Analyzer,...) to 
delegate to the FieldType's getFieldQuery, passing in a nested/sub-QParser with 
the various method call specific options included as state/params
 *** QParser would also expose a new public method that the FieldType could 
call back to that would ultimately call super.createFieldQuery(Analyzer,...)

*Hypothetical (Broken) Alternatively:*
 * we could consider eliminating the analyzers "cache" that IndexSchema uses 
(only helpful for non-dynamic fields) and change getQueryAnalyzer to take in a 
QParser can can capture some query/request state so that the FieldType can 
customize the Analyzer behavior
 * then our special field type can delegate to a completely diff FieldType
 * The Problem With This Alternative:
 ** the field *name* that QueryBuilder then uses in the underlying Query 
objects would still be "wrong"
 ** This would _not_ be a problem with the callback approach discussed above, 
because our new FieldType could call callback to the QueryBuilder methods w/any 
field name + analyzer pair it wanted.

h4. *S2.1.PROCON*: Pros/Cons of the two Strawmen
 * The *S2.1.STRAW2* approach seems simpler to understand/explain to users
 ** no special/magic field types they have to declare and reference from 
another field
 *** ie: they declare/manage title, title_en, title_es, title_de fields – the 
only thing special is that querying the "title" field can proxy to the others 
as well _when the query requests it_
 *** this means this approach is also automatically compatible with people who 
want to explicitly index multiple fields for each language:
 **** ie: they already have/know an "english title" and a "translated spanish 
title" in the source docs, and don't need any index side (langdetect/copyfield) 
help – our new field type just helps make the query side simple/easy to use.
 ** also plays nicely if we decide to do the SortableTextField described above 
(*S1.1*) and want to extend it here:
 *** the user still has distinct fields for "title", "title_es", etc... and can 
choose to sort on any of them
 **** even in the trivial case, where they only have one original field value 
per doc (which lang detect also copied to title_XX), they probably always want 
to sort on the general "title" field – "keep simple stuff simple"
 ** can be implemented completely independently/orthogonally from all the ideas 
discussed here
 ** the downsides of *S2.1.STRAW2* are:
 *** doesn't give us the "mixed languages in a single field value" phrase query 
benefit (which seemed out of scope? do we have usecases like this we care 
about?)
 *** doesn't "save space" like single field approach when multi languages 
produce same tokens
 **** Although i'm not convinced that's fundamentally true – or even 
beneficial: since any space savings from diff languages producing the same 
underlying "term" text may be offset by potential false positives in phrase 
matches (since we're assuming that even if multiple (guessed) languages may be 
specified at query time, the query string is expected to be in a single 
language and searching across those multiple languages should be done 
independently

 * the *S2.1.STRAW1* approach seems like it would be harder to explain to 
novice users
 ** ie: the special configuration of refering to (otherwise seemingly unused) 
fieldtypes from special fieldtypes
 *** we've had this in the psat with some things like ExternalFileField & 
CurrencyField and it's always confusing
 ** the special prefix decoration of langs at indexing time also means this 
approach either *requires* users learn about & use the new update processor 
(ie: the features are locked together), or require some explanation of how 
clients must decorate the field values
 ** which also means that this approach would also not play very nicely with 
people who have pre-translated field values at index time
 *** we could potentially offer an "prepend lang code update processor" to make 
it easy to massage their data for them
 **** ex: {{title_es:"Hola Juan", title_en:"Hello John"}} ==> {{title:["es|Hola 
Juan", "en|Hello John"]}}
 ** if we extend the SortableTextField (*S1.1*) idea...
 *** only the trivial usecases (each logical field is only in one language) 
plays nicely with sorting
 *** if a user starts with multiple different "translated" fields – and has to 
consolidate them as multiple field values in a single field (with our 
hypothetical "prepend lang code update processor") then they don't really have 
any way to "sort on the spanish title" with this approach
 **** unless of course they *also* redundantly index every lang variant as it's 
own field – but then most of the benefits of this approach are out the window 
(ie: there are no fieldname/configuration/disk savings as compared to the other 
strawman)
 ** The key upside i can think of for *S2.1.STRAW1*:
 *** If we first focus on *S2.2* (see below), then the schema syntax could 
potentially be simplified to remove the "lang -> some other fieldType name" 
mapping and instead use lots of nested analyzers named after each langauge
 *** this might still be a bit confusing however if people want diff 
index/query(/multiTerm) analyzers for each langague ... would have to use some 
sort of regid naming convention?

{panel}
*NOTE:* If either strawman is implemented, we should strongly consider 
including an additional option/subclass of this new "*LangAwareTextField" to 
automatically use the langid plugin code at query time to try and "guess" the 
lang if it isn't specified in a 'lang' local/request params
 * at least for language-detect (latest version), there are special models 
built just for short inputs
 * we could potentially make the code use the guessed lang at query time only 
if above some configured confidence:
 ** or: if explicit 'lang' param, use only that lang – but if the langauge is 
guessed, query using both the field/analyzer for that specific lang as well as 
the 'default' field/analyzer{panel}
 

 

> A Potential Roadmap for robust multi-analyzer TextFields w/various options 
> for configuring docValues
> ----------------------------------------------------------------------------------------------------
>
>                 Key: SOLR-11917
>                 URL: https://issues.apache.org/jira/browse/SOLR-11917
>             Project: Solr
>          Issue Type: Wish
>      Security Level: Public(Default Security Level. Issues are Public) 
>            Reporter: Hoss Man
>            Assignee: Hoss Man
>            Priority: Major
>
> A while back, I was tasked at my day job to brainstorm & design some "smarter 
> field types" in Solr. In particular to think about:
>  # How to simplify some of the "special things" people have to know about 
> Solr behavior when creating their schemas
>  # How to reduce the number of situations where users have to copy/clone one 
> "logical field" into multiple "schema felds in order to meet diff use cases
> The main result of this thought excercise is a handful of usecases/goals that 
> people seem to have - many of which are already tracked in existing jiras - 
> along with a high level design/roadmap of potential solutions for these goals 
> that can be implemented incrementally to leverage some common changes (and 
> what those changes might look like).
> My intention is to use this jira as a place to share these ideas for broader 
> community discussion, and as a central linkage point for the related jiras. 
> (details to follow in a very looooooong comment)
> ----
> NOTE: I am not (at this point) personally committing to following through on 
> implementing every aspect of these ideas :)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-11917) A Potential Roadmap for robust multi-analyzer TextFields w/various options for configuring docValues

Reply via email to