[
https://issues.apache.org/jira/browse/SOLR-11391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16183317#comment-16183317
]
Hoss Man commented on SOLR-11391:
---------------------------------
My review below on the latest patch. In general I'm very confused about
why/how the "join method" (used in the facet domain) seems to be choosen based
on the "facet method" (for the field faceting) ... i don't really understand
why those 2 things should be connected in any way, and none of the comments so
far in this issue seems to suggest that should (or needs to) be any connection
between them -- other then yonik's suggestion that they should use similar
vocabulary (enum, dv, etc...)
I suspect this is a logic mistake in the design, and that the intent was to add
a "method" option on the "join domain" just like a "method" param was added to
the "join parser" ?
In any case, here's all of my feedback as is...
* JoinQParserPlugin
** boolean useGraphCollector
*** seems like an enum would be better given local param isn't just true/false
*** especially if we anticipate that there may be more "methods" down the road
*** if nothing else: it would make the code easier to read & understand
(especially when the helper method gets used by the facet code, see comments
below)
** AFAICT: "method" is now the first param that JoinQParserPlugin supports that
ScoreJoinQParserPlugin does *not* support
*** which means that the parse() and parseJoin() methods need to be refactored
to give a clear error if someone tries to combine the "score" param with a
"method" param. (before this parser delegates to ScoreJoinQParserPlugin
* JoinQuery
** I'm confused about the {{isPointField()}} points code path...
*** "if points, then use GraphPointsCollector" ... even if
{{false==useGraphCollector}} ?
*** if that's inentional, then shouldn't param validation (in
JoinQParserPlugin) reject any explicit {{method=enum}} for points fields?
** executeJoinQuery
*** needs jdocs
*** modifies the {{List<Query> domainFilters}} passed by caller
**** will be problematic if caller is reusing the same {{domainFilters}} for
multiple JoinQueries
**** jdocs (all the way up) should either be very clear about this, or
JoinQuery should make a defensive copy
**** off the top of my head: i have no idea if any of the existing (caller)
code (in FacetProcessor) may be sharing the {{qlist}} across multiple
facets/code-paths
*** {{//TODO: is it okay to cache this? I guess we were previously caching as
well?}}
**** AFAICT from the existing code: No, it was not cached (code uses
{{getDocSetNC}})
**** in current patch, copy/past comment says "don't cache the resulting
docSet" but then calls {{getDocSet()}} ... which does cache. So either comment
needs fixed, or method needs changed.
* SolrIndexSearcher.getCardinalityForDVField
** needs jdocs
** is this really the "cardinality" ???
*** it looks like it's actually the sum of the cardinalities across all
segments (not the same when terms exist in multiple segments)
** perhaps (in general) a better approach then looking at an absolute
cardinality/docs threshold of 1000 / 100,000 would be to look at the aproximate
ratio of field cardinality to docCount?
*** that could be computed per segment, and then aproximated over all segments
**** ie: {{( sum of (seg_cardinality/seg_doccount) for all segments ) /
num_segments}}
*** (obviously this wouldn't make sense as a generic SolrIndexSearcher helper
method, just something that occured to me in the context of reviewing this
method)
* FacetProcessor.handleJoinField
** jdocs need updated to explain return value
** since semantics have changed such that method assumes it won't be called if
{{null==freq.domain.joinField}} method should start with an assert to that
effect (and have jdocs noting it)
* FacetProcessor.chooseCollector
** needs jdocs
** shouldn't this name be something more specific to the fact that it's
choosing the join domain collector? AFAICT it has nothing to do with any other
type of facet processing collector
*** in general, all this code seems like it belongs in the {{JoinField}} class?
**** Note the existing comments in {{createDomainQuery}} -- the same rationale
seems applicable here.
** in general, i'm confused as to the _intent_ of this code ... why are we
choosing a ("method" / "boolean useGraphCollector") for the JoinQuery that will
power the domain based on the "method" used by the FacetField itself? ... those
seem completely orthoginal
*** is this all a mistake? Was the intent to add an optional "method" option
to the "join" domain specified by the request?
*** if this code is intentional (ie: if we really are intentionally choosing
the join collector based on the term FacetMethod) then:
*** this should be extensively explained in the comments
*** the code/comments should more completely handle the various FacetMethod
enum options available
**** Example: code currently returns false immediately for FacetMethod.ENUM,
but not for FacetMethod.STREAM which is (IIUC) functionally equivilent???
** again: using an enum (in JoinQParser) to indicate the type of join collector
(might) make this method a lot easier to read and a lot less brittle
*** (depending on what the goal actually is ... if nothing else it would help
clarify the distinction between when the FacetMethod is relevant vs when the
"JoinMethod" is relevant )
** what are the cases where this {{catch (IOException e)}} block may trigger?
*** Why is that a simple "return false" instead of rethrowing???
** why is the hueristic logic (for choosing hte collector based on cardinality
and/or numDocs) in this method, instead of in a JoinQParserPlugin helper method?
*** wouldn't the same hueristic logic be useful for when a user does a
{{"\{!join\}"}} query w/o a "method" param?
* FacetProcessor.handleBlockJoin
** since semantics have changed such that method assumes it won't be called if
{{!(freq.domain.toChildren || freq.domain.toParent)}} method should start with
an assert to that effect (and have jdocs noting it)
** {{acceptDocs}} initialization could be heavily simplified with terinary op...
*** {{DocSet acceptDocs = (null == qlist) ? fcontext.searcher.getLiveDocs() :
fcontext.searcher.getDocSet(qlist);}}
* GraphEdgeCollector & GraphTermsCollector
** now that these classes are public, they should definitely get some javadocs
* schema12.xml & TestJoin
** rather then adding a new special sys prop just for this test, it seems a lot
more straight forward to add some new field/dynamicField declarations that
explicitly set docValues=true/false
** that would also allow the test methods to be refactored, so that a single
test class run could test both the enum & dv methods against diff fields as
needed (ie: paramaterize the meat of the tests)
*** randomly only testing one each time is far from ideal.
*** the pattern used for points/tries isn't ment to be an example to follow
when dealing with a "choice" in a test -- it was a path of last resort for
retrofiting hundreds of pre-existing tests knowing that one of those "choices"
was going to be immediately deprecated and phased out.
** TestJoin should also have some "test the error code/msg" checks that asking
for a particular method fails with an expected error message for all the
potential error paths.
*** having non-randomized types will make this much easier to add.
* TestCloudJSONFacetJoinDomain
** see comments above regarding my confusion about conflating the TermFacet
"method" with the "method" used by the "join" domain
*** more on this as it pertains to randomized testing below...
** buildRandomFacets
*** in general i don't really understand these changes...
**** it seems like you are randomizing the *facet* method based entirely on the
properties of the random domain .... but it should be possible to randomize the
method param entirely independently (from specifics of the domain) and across
all possible values of the FacetMethod enum
**** the only possible restrictions the "TermFacet.method" randomization logic
(in this test) should place on what method it chooses should be based on the
"TermFacet.field" (not the "JoinDomain.from" field) if and only if we know that
it's an "error" to request a certain type of FacetMethod for a certain
type/properties of the "facet field"
**** if there are in fact some join (domain) code paths that may be invalid
with the "graphCollector?" value chosen automatically by some "facet method"
options (depend on the various field types/properties) then that should be
considered a bug in the "chooseCollector()" code and fixed -- it should not
cause the client to get an error (that would needs to be worked around in this
test by only picking certain "TermFacet.method" values based on if/when a
JoinDomain uses certain types of fields)
*** regardless of intent: in at least some situations, the "TermFacet.method"
variable should be randomized to "null" to test the default hueristics.
** JoinDomain
*** if we are in fact adding a {{"join: \{ method: foo, ... \}"}} type
"JoinDomain.method" option to the code, then it should be accounted for in the
test model here, and randomized.
*** This should be completley independently from any FacetField.method
randomization
** testBespoke
*** please add some simple sanity checks of the new TermFacet "method" option
*** and, perhaps, some testing of a distinct JoinDomain "method" (if that is
being added)
> JoinQParser for non point fields should use the GraphTermsCollector
> --------------------------------------------------------------------
>
> Key: SOLR-11391
> URL: https://issues.apache.org/jira/browse/SOLR-11391
> Project: Solr
> Issue Type: Bug
> Security Level: Public(Default Security Level. Issues are Public)
> Reporter: Varun Thacker
> Attachments: SOLR-11391.patch, SOLR-11391.patch, SOLR-11391.patch,
> SOLR-11391.patch, SOLR-11391.patch, SOLR-11391.patch
>
>
> The Join Query Parser uses the GraphPointsCollector for point fields.
> For non point fields if we use the GraphTermsCollector instead of the current
> algorithm I am seeing quite a bit of performance gains.
> I'm going to attach a quick patch which I cooked up , making sure TestJoin
> and TestCloudJSONFacetJoinDomain passed.
> More tests, benchmarking and code cleanup to follow
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]