Re: [arangodb-google] Simple query, pretty slow

'Andreas Jung' via ArangoDB Mon, 17 Jun 2019 07:30:16 -0700

Profile:

Query String:
 for doc in import
    filter doc._type == 'Image'
    
    return {path: doc._path, key: doc._key}


Execution plan:
 Id   NodeType          Calls   Items   Runtime [s]   Comment
  1   SingletonNode         1       1       0.00000   * ROOT
  7   IndexNode            21   20617      84.34365     - FOR doc IN 
import   /* hash index scan, projections: `_key`, `_path` */
  5   CalculationNode      21   20617       0.05436       - LET #3 = { 
"path" : doc.`_path`, "key" : doc.`_key` }   /* simple expression */   /* 
collections used: doc : import */
  6   ReturnNode           21   20617       0.00017       - RETURN #3

Indexes used:
 By   Type   Collection   Unique   Sparse   Selectivity   Fields        
Ranges
  7   hash   import       false    false         0.05 %   [ `_type` ]  
 (doc.`_type` == "Image")

Optimization rules applied:
 Id   RuleName
  1   move-calculations-up
  2   move-filters-up
  3   move-calculations-up-2
  4   move-filters-up-2
  5   use-indexes
  6   remove-filter-covered-by-index
  7   remove-unnecessary-calculations-2
  8   reduce-extraction-to-projection

Query Statistics:
 Writes Exec   Writes Ign   Scan Full   Scan Index   Filtered   Exec Time 
[s]
           0            0           0        20617          0        
84.40501

Query Profile:
 Query Stage           Duration [s]
 initializing               0.00000
 parsing                    0.00020
 optimizing ast             0.00001
 loading collections        0.00001
 instantiating plan         0.00005
 optimizing plan            0.00021
 executing                 84.40415
 finalizing                 0.00033


On Monday, June 17, 2019 at 4:25:51 PM UTC+2, Andreas Jung wrote:
>
> A compound index on _type+_path or _type +_path + _key does not improve 
> things.
> The query time is still in the range of 120 to 150 seconds.
>
> Andreas
>
> Am Freitag, 14. Juni 2019 15:01:30 UTC+2 schrieb Wilfried Gösgens:
>>
>> May I get back to my sugestion once more?
>>
>> Could you instead of the index on `_type` create a combined index ot 
>> `_type`, `_path` and `_key` ? 
>> This should copy these fields into the index, so ArangoDB doesn't have to 
>> fetch the (big) documents.
>> I gues fetching and decompressing them is huge. 
>>
>> Another suggestion would be to put the payload (You've got base64 encoded 
>> binary data, right?) into a separate collection, parted of the structural 
>> information.
>>
>> Cheers, 
>> Willi
>>
>> On Friday, June 14, 2019 at 1:10:44 PM UTC+2, Andreas Jung wrote:
>>>
>>> All _path value are unique, we have about 20 different values for _type.
>>> I am not sure if I can break down the dataset into something smaller.
>>> The data is in general sensitive and not easy to share or anonymize.
>>>
>>>
>>> Am Freitag, 14. Juni 2019 13:03:59 UTC+2 schrieb Wilfried Gösgens:
>>>>
>>>>
>>>> Hi, 
>>>> Can you share a set of sample documents? How well is the distribution 
>>>> on `_type` ? Which samples are there? 
>>>> On Friday, June 14, 2019 at 11:22:51 AM UTC+2, Andreas Jung wrote:
>>>>>
>>>>> Recreating the indexes after import does not make a difference.
>>>>>
>>>>> Returning doc._path  for 20.000 items takes 50 ms, returning doc._path 
>>>>> takes minutes
>>>>>
>>>>> The _path index is deduplicated, the _type index is not 
>>>>>
>>>>> The only difference in the execution plans is "index only" when 
>>>>> "RETURN doc._type". Since both _type and _path
>>>>> are fully indexed I would assume that the query is executed in both 
>>>>> times based on index data.
>>>>>
>>>>> So ArangoDB will load all 100.000 objects for picking up the value of 
>>>>> _path? The overall data is meanwhile 55 GB
>>>>> (about one third of the data is binary data (files and images base64 
>>>>> encoded).  
>>>>>
>>>>> This is all no big problem for me since we perform such queries once 
>>>>> before a migration run and it does matter taking
>>>>> a migration running for some hours a minutes more or less but I want 
>>>>> to understand what is going on here (in particular
>>>>> this is unexpected behavior).
>>>>>
>>>>>
>>>>> Query String:
>>>>>  for doc in import 
>>>>>  filter doc._type == 'Image'
>>>>>  return doc._type
>>>>>
>>>>> Execution plan:
>>>>>  Id   NodeType          Est.   Comment
>>>>>   1   SingletonNode        1   * ROOT
>>>>>   7   IndexNode         2214     - FOR doc IN import   /* hash index 
>>>>> scan, index only, projections: `_type` */
>>>>>   5   CalculationNode   2214       - LET #3 = doc.`_type`   /* 
>>>>> attribute expression */   /* collections used: doc : import */
>>>>>   6   ReturnNode        2214       - RETURN #3
>>>>>
>>>>> Indexes used:
>>>>>  By   Type   Collection   Unique   Sparse   Selectivity   Fields      
>>>>>   Ranges
>>>>>   7   hash   import       false    false         0.05 %   [ `_type` ]  
>>>>>  (doc.`_type` == "Image")
>>>>>
>>>>> Optimization rules applied:
>>>>>  Id   RuleName
>>>>>   1   move-calculations-up
>>>>>   2   move-filters-up
>>>>>   3   move-calculations-up-2
>>>>>   4   move-filters-up-2
>>>>>   5   use-indexes
>>>>>   6   remove-filter-covered-by-index
>>>>>   7   remove-unnecessary-calculations-2
>>>>>   8   reduce-extraction-to-projection
>>>>>
>>>>>
>>>>>
>>>>> Query String:
>>>>>  for doc in import 
>>>>>  filter doc._type == 'Image'
>>>>>  return doc._path
>>>>>
>>>>> Execution plan:
>>>>>  Id   NodeType          Est.   Comment
>>>>>   1   SingletonNode        1   * ROOT
>>>>>   7   IndexNode         2214     - FOR doc IN import   /* hash index 
>>>>> scan, projections: `_path` */
>>>>>   5   CalculationNode   2214       - LET #3 = doc.`_path`   /* 
>>>>> attribute expression */   /* collections used: doc : import */
>>>>>   6   ReturnNode        2214       - RETURN #3
>>>>>
>>>>> Indexes used:
>>>>>  By   Type   Collection   Unique   Sparse   Selectivity   Fields      
>>>>>   Ranges
>>>>>   7   hash   import       false    false         0.05 %   [ `_type` ]  
>>>>>  (doc.`_type` == "Image")
>>>>>
>>>>> Optimization rules applied:
>>>>>  Id   RuleName
>>>>>   1   move-calculations-up
>>>>>   2   move-filters-up
>>>>>   3   move-calculations-up-2
>>>>>   4   move-filters-up-2
>>>>>   5   use-indexes
>>>>>   6   remove-filter-covered-by-index
>>>>>   7   remove-unnecessary-calculations-2
>>>>>   8   reduce-extraction-to-projection
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Friday, June 14, 2019 at 9:54:10 AM UTC+2, Andreas Jung wrote:
>>>>>>
>>>>>> Using RocksDB (default installation).
>>>>>>
>>>>>> I create a new collection for every import of the data including the 
>>>>>> indexes.
>>>>>>
>>>>>> Unfortunately I don't have the key names in my hands. They are coming
>>>>>> from a JSON dump of a CMS.
>>>>>>
>>>>>> Am Freitag, 14. Juni 2019 09:50:41 UTC+2 schrieb Wilfried Gösgens:
>>>>>>>
>>>>>>> Hi, 
>>>>>>> afair you're using rocksdb?
>>>>>>>
>>>>>>> can you try to re-create that index to be on `_type`, `_path`, 
>>>>>>> `_key` for better using of projections?
>>>>>>>
>>>>>>> Please note that you shouldn't use fieldnames starting with `_` 
>>>>>>> since they're defined as system specific fields in arangodb.
>>>>>>>
>>>>>>> Cheers, 
>>>>>>> Willi
>>>>>>>
>>>>>>> On Friday, June 14, 2019 at 9:41:24 AM UTC+2, Andreas Jung wrote:
>>>>>>>>
>>>>>>>> _key is a UUID4
>>>>>>>> _path is standard filesystem path not longer than 100 chars each
>>>>>>>>
>>>>>>>> That can not be the problem.
>>>>>>>>
>>>>>>>> Am Freitag, 14. Juni 2019 09:36:17 UTC+2 schrieb James 
>>>>>>>> Courtier-Dutton:
>>>>>>>>>
>>>>>>>>> Hi,
>>>>>>>>>
>>>>>>>>> What is the average size of the returned data? It could just be 
>>>>>>>>> the time it takes to serialise the data being returned
>>>>>>>>>
>>>>>>>>> James
>>>>>>>>>
>>>>>>>>> On Fri, 14 Jun 2019, 05:45 'Andreas Jung' via ArangoDB, <
>>>>>>>>> [email protected]> wrote:
>>>>>>>>>
>>>>>>>>>> Hi there,
>>>>>>>>>>
>>>>>>>>>> this query 
>>>>>>>>>>
>>>>>>>>>>  for doc in import 
>>>>>>>>>>    filter doc._type == 'Image'
>>>>>>>>>>    return {path: doc._path, key: doc._key}
>>>>>>>>>>
>>>>>>>>>> takes about 45 seconds on decent hardware with an import 
>>>>>>>>>> collection of about 100.000 items with about 21.000 of _type = 
>>>>>>>>>> 'Image'.
>>>>>>>>>> There is an index of _type. Using PyArango as client...I really 
>>>>>>>>>> wander why this query is running so slow?!
>>>>>>>>>>
>>>>>>>>>> Running ArangoDB 3.4.3
>>>>>>>>>>
>>>>>>>>>> Profile
>>>>>>>>>>
>>>>>>>>>> Query String:
>>>>>>>>>>  for doc in import 
>>>>>>>>>>  filter doc._type == 'Image'
>>>>>>>>>>  return {path: doc._path, key: doc._key}
>>>>>>>>>>
>>>>>>>>>> Execution plan:
>>>>>>>>>>  Id   NodeType          Calls   Items   Runtime [s]   Comment
>>>>>>>>>>   1   SingletonNode         1       1       0.00000   * ROOT
>>>>>>>>>>   7   IndexNode            21   20617      32.73956     - FOR doc 
>>>>>>>>>> IN import   /* hash index scan, projections: `_key`, `_path` */
>>>>>>>>>>   5   CalculationNode      21   20617       0.04354       - LET 
>>>>>>>>>> #3 = { "path" : doc.`_path`, "key" : doc.`_key` }   /* simple 
>>>>>>>>>> expression 
>>>>>>>>>> */   /* collections used: doc : import */
>>>>>>>>>>   6   ReturnNode           21   20617       0.00016       - 
>>>>>>>>>> RETURN #3
>>>>>>>>>>
>>>>>>>>>> Indexes used:
>>>>>>>>>>  By   Type   Collection   Unique   Sparse   Selectivity   Fields  
>>>>>>>>>>       Ranges
>>>>>>>>>>   7   hash   import       false    false         0.05 %   [ 
>>>>>>>>>> `_type` ]   (doc.`_type` == "Image")
>>>>>>>>>>
>>>>>>>>>> Optimization rules applied:
>>>>>>>>>>  Id   RuleName
>>>>>>>>>>   1   move-calculations-up
>>>>>>>>>>   2   move-filters-up
>>>>>>>>>>   3   move-calculations-up-2
>>>>>>>>>>   4   move-filters-up-2
>>>>>>>>>>   5   use-indexes
>>>>>>>>>>   6   remove-filter-covered-by-index
>>>>>>>>>>   7   remove-unnecessary-calculations-2
>>>>>>>>>>   8   reduce-extraction-to-projection
>>>>>>>>>>
>>>>>>>>>> Query Statistics:
>>>>>>>>>>  Writes Exec   Writes Ign   Scan Full   Scan Index   Filtered  
>>>>>>>>>>  Exec Time [s]
>>>>>>>>>>            0            0           0        20617          0    
>>>>>>>>>>     32.78928
>>>>>>>>>>
>>>>>>>>>> Query Profile:
>>>>>>>>>>  Query Stage           Duration [s]
>>>>>>>>>>  initializing               0.00001
>>>>>>>>>>  parsing                    0.00010
>>>>>>>>>>  optimizing ast             0.00001
>>>>>>>>>>  loading collections        0.00002
>>>>>>>>>>  instantiating plan         0.00005
>>>>>>>>>>  optimizing plan            0.00032
>>>>>>>>>>  executing                 32.78841
>>>>>>>>>>  finalizing                 0.00032
>>>>>>>>>>
>>>>>>>>>> -- 
>>>>>>>>>> You received this message because you are subscribed to the 
>>>>>>>>>> Google Groups "ArangoDB" group.
>>>>>>>>>> To unsubscribe from this group and stop receiving emails from it, 
>>>>>>>>>> send an email to [email protected].
>>>>>>>>>> To view this discussion on the web visit 
>>>>>>>>>> https://groups.google.com/d/msgid/arangodb/6c2de54c-3936-4aa5-8b6a-2dae3e5afcf7%40googlegroups.com
>>>>>>>>>>  
>>>>>>>>>> <https://groups.google.com/d/msgid/arangodb/6c2de54c-3936-4aa5-8b6a-2dae3e5afcf7%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>>>>>>> .
>>>>>>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>>>>>>
>>>>>>>>>

-- 
You received this message because you are subscribed to the Google Groups 
"ArangoDB" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/arangodb/06b82589-ab93-406d-8f30-4b4c4f1187ac%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: [arangodb-google] Simple query, pretty slow

Reply via email to