[jira] [Commented] (SOLR-10096) Parallel SQL not returning the correct results for count distinct aggregation

Joel Bernstein (JIRA) Fri, 03 Feb 2017 13:01:08 -0800

    [ 
https://issues.apache.org/jira/browse/SOLR-10096?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15852119#comment-15852119
 ]


Joel Bernstein commented on SOLR-10096:
---------------------------------------

About the performance issue you were seeing. The SQL query is likely in 
map_reduce mode which is the default or the SQL interface. If you switch to 
aggregationMode=facet you'll likely see comparable performance to expression 
you reference.

You can improve your map_reduce throughput by adding workers so it parallelize 
the select distinct.

> Parallel SQL not returning the correct results for count distinct aggregation
> -----------------------------------------------------------------------------
>
>                 Key: SOLR-10096
>                 URL: https://issues.apache.org/jira/browse/SOLR-10096
>             Project: Solr
>          Issue Type: Bug
>      Security Level: Public(Default Security Level. Issues are Public) 
>          Components: Parallel SQL
>    Affects Versions: 6.3
>            Reporter: Timothy Potter
>
> While trying to compare the results of this perf exp. blogged here:
> https://www.periscopedata.com/blog/use-subqueries-to-count-distinct-50x-faster.html
> I tried this with 6.3 and it only returns the dashboard_id's:
> {code}
> select dashboard_id, count(distinct user_id) as ct from time_on_site_logs 
> group by dashboard_id
> {code}
> Results: (where is ct?)
> {code}
> {"result-set":{"docs":[
> {"dashboard_id":0},
> {"dashboard_id":2},
> {"dashboard_id":5},
> {"dashboard_id":6},
> {"dashboard_id":8},
> {"dashboard_id":10},
> {"dashboard_id":12},
> {"dashboard_id":13},
> {"dashboard_id":14},
> {"dashboard_id":15},
> ...
> {code}
> So I dropped the alias for {{count(distinct user_id)}} and got the wrong 
> results (i.e. it's not applying the distinct for user_id's):
> {code}
> {"result-set":{"docs":[
> {"count(*)":8288,"dashboard_id":0},
> {"count(*)":7392,"dashboard_id":2},
> {"count(*)":23800,"dashboard_id":5},
> {"count(*)":25032,"dashboard_id":6},
> {"count(*)":8960,"dashboard_id":8},
> {"count(*)":7840,"dashboard_id":10},
> {"count(*)":17192,"dashboard_id":12},
> {code}
> So I'm guessing this isn't a supported syntax yet?
> Also, probably a different issue, but I then tried this query:
> {code}
> select distinct dashboard_id, user_id from time_on_site_logs
> {code}
> and it took ~70 secs (m3.xlarge), which is very surprising because this 
> streaming expression (which basically does the initial query above):
> {code}
> select(rollup(facet(time_on_site_logs, q="*:*", 
> buckets="dashboard_id,user_id", bucketSizeLimit=2000, 
> bucketSorts="dashboard_id asc",count(*)), over="dashboard_id", count(*)), 
> dashboard_id,count(*) as unique_users)
> {code}
> Returns in ~1.5 secs ... that's a huge difference in performance.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SOLR-10096) Parallel SQL not returning the correct results for count distinct aggregation

Reply via email to