[ 
https://issues.apache.org/jira/browse/SOLR-6351?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14160658#comment-14160658
 ] 

Hoss Man commented on SOLR-6351:
--------------------------------


I haven't had a chance to look at the recent patches, and i probably won't 
today, but i wanted to post some quick replies to a few comments/questions...

----

bq. Also reverted the hasValues logic to replace it with checking if current 
pivot has positive count. Although it does produce some stats entries with 
Infinity minimum/maximum and NaN mean. This is what I was asking about before, 
I think I misunderstood the answer, but it still seems error-prone to have such 
entries...

You may be right ... lemme talk through my thinking on this and see where we 
wind up:

Forgeting about pivots for a moment, and just think about the idea of computing 
some arbitrary stat over a set of documents.  ie: you've got N documents that 
match some arbitrary query, and then you want to compute stats on the "price" 
and "popularity" fields .. what should happen if none of those documents has a 
"popularity" field at all?

my thinking, was that:
* the behavior when hanging stats off pivots should mirror that of regular stats
* if you ask for a stats block , you should always get that block, so the 
client doesn't have to conditionally check if it's there before looking at the 
values.
* the included stat values matter even if no doc has the stats.field, becuase 
one of the stats is in fact "missing" and that if you ask for stats, you should 
be able to look at that missing count. (and it should match up with your doc 
set size if the field is completely missing, etc...)

but looking at an example of this now, i see that for simple field stats (w/o 
pivots), that's not even how it currently works -- consider this URL using hte 
example data...

http://localhost:8983/solr/select?rows=0&q=name:utf&stats.field=foo_d&stats.field=popularity&stats=true

* foo_d doesn't exist in the index.
* popularity does exist in the index, but the one doc matching the query 
doesn't have a value in that field.

I thought (and still think) that the "correct" behavior for this query would be 
to get a stats block back for those fields where things like min/max/mean are 
"null", count==0, and missing=1 ... but that's not how it currently works.

so i guess the question is: is the current general stats behavior a "bug" that 
should be fixed, or is this the "correct" way to deal with stats when none of 
the documents have a value (and thus: the behavior of your "hasValue" logic was 
correct) ?

i'm leaning towards" bug" ... but i'd like to think about it more and hear how 
others feel..

----

bq. One more question around this, which applies for SOLR-6353 and SOLR-4212 as 
well. Should we have a syntax to apply stats/queries/ranges only at specific 
levels in the pivot hierarchy? It would reduce amount of computation and size 
of response for cases where you only need it at a specific level (usually last 
level I guess).

That's a great question ... honestly it's not something i ever really thought 
about.  

One quick thing i will point out: the size of the response shouldn't really be 
a huge factor in our decisions here, because with SOLR-6349 (which i'll 
hopefully have a patch for in the next day or so) the response will only need 
to include the stats people actually care about, andsask for, so the typical 
result size should be much smaller.

But you've got a good point about amount of computation done/returned at levels 
that people may not care about ... in my head, it seemed to make sense that the 
stats (and ranges, etc...) should be computed at every level just like the 
pivot count size -- but at the intermediate levels that count is a "free" 
computation based on the size of the subset, and but i suspect you are correct: 
may people may only care about having these new stats/ranges/query on the 
leaves in the common case.

I'm not really following your suggested syntax though ... you seem to be saying 
that in the "stats" local param, commas would be used to delimit "levels" of 
the pivot (corresponding to the commas in the list of pivot fields) but then 
i'm not really clear what you mean about using "\*" (if that means all levels, 
how do you know what tag name to use?

in the original examples i porposed, i was thinking that a comma seperated list 
could refer to multiple tag names, wimilar to how the "exlcusions" work -- ie..

{noformat}
facet.pivot={!stats=prices,ratings}category,manufacturer
facet.pivot={!stats=prices,pop}reseller
stats.field={!key=avg_list_price tag=prices mean=true}list_price
stats.field={!tag=ratings min=true max=true}user_rating
stats.field={!tag=ratings min=true max=true}editors_rating
stats.field={!tag=prices min=true max=true}sale_price
stats.field={!tag=pop}weekly_tweets
stats.field={!tag=pop}weekly_page_views
{noformat}

...would result in the "category,manufacturer" pivot having stats on 
"avg_list_price, sale_price, user_rating, & editors_rating" while the 
"reseller" pivot would have stats on "avg_list_price, sale_price, 
weekly_tweets, & weekly_page_views"

Thinking about it now though, if we support multiple tag names on stats.field, 
the same thing could be supported like this...

{noformat}
facet.pivot={!stats=cm_s}category,manufacturer
facet.pivot={!stats=r_s}reseller
stats.field={!key=avg_list_price tag=cm_s,r_s mean=true}list_price
stats.field={!tag=cm_s min=true max=true}user_rating
stats.field={!tag=cm_s min=true max=true}editors_rating
stats.field={!tag=cm_s,r_s min=true max=true}sale_price
stats.field={!tag=r_s}weekly_tweets
stats.field={!tag=r_s}weekly_page_views
{noformat}

So ... if we did that, then we could start using "position" info in a comma 
seperated list of tag names to refer to where in the pivot "depth" those 
stats/ranges/queries should be computed ... the question i have is "should we" 
? .. in the context of a facet.pivot param, will it be obvious to folks that 
there is a maping between the commas in these local params and hte commas in 
hte bod of the facet.pivot param, or will it confuse people who are use to 
seeing comma as just a way of delimiting multiple values in tag/ex params?

my opinion: no freaking clue at the moment ... need to let it soak in my brain.


> Let Stats Hang off of Pivots (via 'tag')
> ----------------------------------------
>
>                 Key: SOLR-6351
>                 URL: https://issues.apache.org/jira/browse/SOLR-6351
>             Project: Solr
>          Issue Type: Sub-task
>            Reporter: Hoss Man
>         Attachments: SOLR-6351.patch, SOLR-6351.patch, SOLR-6351.patch, 
> SOLR-6351.patch, SOLR-6351.patch, SOLR-6351.patch
>
>
> he goal here is basically flip the notion of "stats.facet" on it's head, so 
> that instead of asking the stats component to also do some faceting 
> (something that's never worked well with the variety of field types and has 
> never worked in distributed mode) we instead ask the PivotFacet code to 
> compute some stats X for each leaf in a pivot.  We'll do this with the 
> existing {{stats.field}} params, but we'll leverage the {{tag}} local param 
> of the {{stats.field}} instances to be able to associate which stats we want 
> hanging off of which {{facet.pivot}}
> Example...
> {noformat}
> facet.pivot={!stats=s1}category,manufacturer
> stats.field={!key=avg_price tag=s1 mean=true}price
> stats.field={!tag=s1 min=true max=true}user_rating
> {noformat}
> ...with the request above, in addition to computing the min/max user_rating 
> and mean price (labeled "avg_price") over the entire result set, the 
> PivotFacet component will also include those stats for every node of the tree 
> it builds up when generating a pivot of the fields "category,manufacturer"



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to