Timothy Potter created SOLR-10096:
-------------------------------------

             Summary: Parallel SQL not returning the correct results for count 
distinct aggregation
                 Key: SOLR-10096
                 URL: https://issues.apache.org/jira/browse/SOLR-10096
             Project: Solr
          Issue Type: Bug
      Security Level: Public (Default Security Level. Issues are Public)
          Components: Parallel SQL
    Affects Versions: 6.3
            Reporter: Timothy Potter


While trying to compare the results of this perf exp. blogged here:
https://www.periscopedata.com/blog/use-subqueries-to-count-distinct-50x-faster.html

I tried this with 6.3 and it only returns the dashboard_id's:

{code}
select dashboard_id, count(distinct user_id) as ct from time_on_site_logs group 
by dashboard_id
{code}

Results: (where is ct?)
{code}
{"result-set":{"docs":[
{"dashboard_id":0},
{"dashboard_id":2},
{"dashboard_id":5},
{"dashboard_id":6},
{"dashboard_id":8},
{"dashboard_id":10},
{"dashboard_id":12},
{"dashboard_id":13},
{"dashboard_id":14},
{"dashboard_id":15},
...
{code}

So I dropped the alias for {{count(distinct user_id)}} and got the wrong 
results (i.e. it's not applying the distinct for user_id's):
{code}
{"result-set":{"docs":[
{"count(*)":8288,"dashboard_id":0},
{"count(*)":7392,"dashboard_id":2},
{"count(*)":23800,"dashboard_id":5},
{"count(*)":25032,"dashboard_id":6},
{"count(*)":8960,"dashboard_id":8},
{"count(*)":7840,"dashboard_id":10},
{"count(*)":17192,"dashboard_id":12},
{code}

So I'm guessing this isn't a supported syntax yet?

Also, probably a different issue, but I then tried this query:
{code}
select distinct dashboard_id, user_id from time_on_site_logs
{code}
and it took ~70 secs (m3.xlarge), which is very surprising because this 
streaming expression (which basically does the initial query above):
{code}
select(rollup(facet(time_on_site_logs, q="*:*", buckets="dashboard_id,user_id", 
bucketSizeLimit=2000, bucketSorts="dashboard_id asc",count(*)), 
over="dashboard_id", count(*)), dashboard_id,count(*) as unique_users)
{code}
Returns in ~1.5 secs ... that's a huge difference in performance.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to