[
https://issues.apache.org/jira/browse/SOLR-10096?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15852119#comment-15852119
]
Joel Bernstein commented on SOLR-10096:
---------------------------------------
About the performance issue you were seeing. The SQL query is likely in
map_reduce mode which is the default or the SQL interface. If you switch to
aggregationMode=facet you'll likely see comparable performance to expression
you reference.
You can improve your map_reduce throughput by adding workers so it parallelize
the select distinct.
> Parallel SQL not returning the correct results for count distinct aggregation
> -----------------------------------------------------------------------------
>
> Key: SOLR-10096
> URL: https://issues.apache.org/jira/browse/SOLR-10096
> Project: Solr
> Issue Type: Bug
> Security Level: Public(Default Security Level. Issues are Public)
> Components: Parallel SQL
> Affects Versions: 6.3
> Reporter: Timothy Potter
>
> While trying to compare the results of this perf exp. blogged here:
> https://www.periscopedata.com/blog/use-subqueries-to-count-distinct-50x-faster.html
> I tried this with 6.3 and it only returns the dashboard_id's:
> {code}
> select dashboard_id, count(distinct user_id) as ct from time_on_site_logs
> group by dashboard_id
> {code}
> Results: (where is ct?)
> {code}
> {"result-set":{"docs":[
> {"dashboard_id":0},
> {"dashboard_id":2},
> {"dashboard_id":5},
> {"dashboard_id":6},
> {"dashboard_id":8},
> {"dashboard_id":10},
> {"dashboard_id":12},
> {"dashboard_id":13},
> {"dashboard_id":14},
> {"dashboard_id":15},
> ...
> {code}
> So I dropped the alias for {{count(distinct user_id)}} and got the wrong
> results (i.e. it's not applying the distinct for user_id's):
> {code}
> {"result-set":{"docs":[
> {"count(*)":8288,"dashboard_id":0},
> {"count(*)":7392,"dashboard_id":2},
> {"count(*)":23800,"dashboard_id":5},
> {"count(*)":25032,"dashboard_id":6},
> {"count(*)":8960,"dashboard_id":8},
> {"count(*)":7840,"dashboard_id":10},
> {"count(*)":17192,"dashboard_id":12},
> {code}
> So I'm guessing this isn't a supported syntax yet?
> Also, probably a different issue, but I then tried this query:
> {code}
> select distinct dashboard_id, user_id from time_on_site_logs
> {code}
> and it took ~70 secs (m3.xlarge), which is very surprising because this
> streaming expression (which basically does the initial query above):
> {code}
> select(rollup(facet(time_on_site_logs, q="*:*",
> buckets="dashboard_id,user_id", bucketSizeLimit=2000,
> bucketSorts="dashboard_id asc",count(*)), over="dashboard_id", count(*)),
> dashboard_id,count(*) as unique_users)
> {code}
> Returns in ~1.5 secs ... that's a huge difference in performance.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]