[arangodb-google] Re: Faceted Search Performance

Jan Thu, 14 Sep 2017 08:17:47 -0700

Hi,

one of the things to do for improving the query performance is to get rid 
of the "INTO" clause, as "INTO" will copy all documents found per group 
into a new variable "g":


 FOR a in Asset 
  COLLECT attr = a.attribute1 INTO g
 RETURN { value: attr, count: length(g) }

The query without "INTO" would look like this:

 FOR a in Asset 
  COLLECT value = a.attribute1 WITH COUNT INTO length
 RETURN { value, length }

It should be faster than the one with "INTO", but I am not sure how much. 
This probably depends on the actual data.

Can you give it a try?
Thanks
Jan




Am Donnerstag, 14. September 2017 15:53:20 UTC+2 schrieb Roman Kuzmik:
>
> We are evaluating ArangoDB performance in space of facets calculations.
> There are number of other products capable of doing the same, either via 
> special API  or query language:
>
>    - MarkLogic Facets <https://docs.marklogic.com/jsearch.facet>
>    - ElasticSearch Aggregations 
>    
> <https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations.html>
>    - Solr Faceting 
>    <https://cwiki.apache.org/confluence/display/solr/Faceting>
>    - etc
>
> We understand, there is no special API in Arango to calculate factes 
> explicitly.
> But in reality, it is not needed, thanks for a comprehensive AQL it can be 
> easily achieved via simple query, like:
>
>  FOR a in Asset 
>   COLLECT attr = a.attribute1 INTO g
>  RETURN { value: attr, count: length(g) }
>
> This query calculate a facet on *attribute1* and yields frequency in the 
> form of:
>
> [
>   {
>     "value": "test-attr1-1",
>     "count": 2000000
>   },
>   {
>     "value": "test-attr1-2",
>     "count": 2000000
>   },
>   {
>     "value": "test-attr1-3",
>     "count": 3000000
>   }
> ]
>
>
> It is saying, that across my entire collection *attribute1* took three 
> forms (test-attr1-1, test-attr1-2 and test-attr1-3) with related counts 
> provided.
> Pretty much we run a DISTINCT query and aggregated counts.
>
> Looks simple and clean. With only one, but really big issue - 
> *performance*.
>
> Provided query above runs for !*31 seconds*! on top of the test 
> collection with only *8M* documents.
> We have experimented with different index types, storage engines (with 
> rocksdb and without), investigating explanation plans at no avail.
> Test documents we use in this test are very concise with only three short 
> attributes.
>
> We would appreciate any input at this point.
> Either we doing something wrong. Or ArangoDB simply is not designed to 
> perform in this particular area.
>
> btw, ultimate goal would be to run something like the following in 
> under-second time:
>
> LET docs = (FOR a IN Asset 
>
>   FILTER a.name like 'test-asset-%'
>
>   SORT a.name
>
>  RETURN a)
>
> LET attribute1 = (
>
>  FOR a in docs 
>
>   COLLECT attr = a.attribute1 INTO g
>
>  RETURN { value: attr, count: length(g[*])}
>
> )
>
> LET attribute2 = (
>
>  FOR a in docs 
>
>   COLLECT attr = a.attribute2 INTO g
>
>  RETURN { value: attr, count: length(g[*])}
>
> )
>
> LET attribute3 = (
>
>  FOR a in docs 
>
>   COLLECT attr = a.attribute3 INTO g
>
>  RETURN { value: attr, count: length(g[*])}
>
> )
>
> LET attribute4 = (
>
>  FOR a in docs 
>
>   COLLECT attr = a.attribute4 INTO g
>
>  RETURN { value: attr, count: length(g[*])}
>
> )
>
> RETURN {
>
>   counts: (RETURN {
>
>     total: LENGTH(docs), 
>
>     offset: 2, 
>
>     to: 4, 
>
>     facets: {
>
>       attribute1: {
>
>         from: 0, 
>
>         to: 5,
>
>         total: LENGTH(attribute1)
>
>       },
>
>       attribute2: {
>
>         from: 5, 
>
>         to: 10,
>
>         total: LENGTH(attribute2)
>
>       },
>
>       attribute3: {
>
>         from: 0, 
>
>         to: 1000,
>
>         total: LENGTH(attribute3)
>
>       },
>
>       attribute4: {
>
>         from: 0, 
>
>         to: 1000,
>
>         total: LENGTH(attribute4)
>
>       }
>
>     }
>
>   }),
>
>   items: (FOR a IN docs LIMIT 2, 4 RETURN {id: a._id, name: a.name}),
>
>   facets: {
>
>     attribute1: (FOR a in attribute1 SORT a.count LIMIT 0, 5 return a),
>
>     attribute2: (FOR a in attribute2 SORT a.value LIMIT 5, 10 return a),
>
>     attribute3: (FOR a in attribute3 LIMIT 0, 1000 return a),
>
>     attribute4: (FOR a in attribute4 SORT a.count, a.value LIMIT 0, 1000 
> return a)
>
>    }
>
> }
>
> Thanks!
>
>
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"ArangoDB" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

[arangodb-google] Re: Faceted Search Performance

Reply via email to