A compound index on _type+_path or _type +_path + _key does not improve 
things.
The query time is still in the range of 120 to 150 seconds.

Andreas

Am Freitag, 14. Juni 2019 15:01:30 UTC+2 schrieb Wilfried Gösgens:
>
> May I get back to my sugestion once more?
>
> Could you instead of the index on `_type` create a combined index ot 
> `_type`, `_path` and `_key` ? 
> This should copy these fields into the index, so ArangoDB doesn't have to 
> fetch the (big) documents.
> I gues fetching and decompressing them is huge. 
>
> Another suggestion would be to put the payload (You've got base64 encoded 
> binary data, right?) into a separate collection, parted of the structural 
> information.
>
> Cheers, 
> Willi
>
> On Friday, June 14, 2019 at 1:10:44 PM UTC+2, Andreas Jung wrote:
>>
>> All _path value are unique, we have about 20 different values for _type.
>> I am not sure if I can break down the dataset into something smaller.
>> The data is in general sensitive and not easy to share or anonymize.
>>
>>
>> Am Freitag, 14. Juni 2019 13:03:59 UTC+2 schrieb Wilfried Gösgens:
>>>
>>>
>>> Hi, 
>>> Can you share a set of sample documents? How well is the distribution on 
>>> `_type` ? Which samples are there? 
>>> On Friday, June 14, 2019 at 11:22:51 AM UTC+2, Andreas Jung wrote:
>>>>
>>>> Recreating the indexes after import does not make a difference.
>>>>
>>>> Returning doc._path  for 20.000 items takes 50 ms, returning doc._path 
>>>> takes minutes
>>>>
>>>> The _path index is deduplicated, the _type index is not 
>>>>
>>>> The only difference in the execution plans is "index only" when "RETURN 
>>>> doc._type". Since both _type and _path
>>>> are fully indexed I would assume that the query is executed in both 
>>>> times based on index data.
>>>>
>>>> So ArangoDB will load all 100.000 objects for picking up the value of 
>>>> _path? The overall data is meanwhile 55 GB
>>>> (about one third of the data is binary data (files and images base64 
>>>> encoded).  
>>>>
>>>> This is all no big problem for me since we perform such queries once 
>>>> before a migration run and it does matter taking
>>>> a migration running for some hours a minutes more or less but I want to 
>>>> understand what is going on here (in particular
>>>> this is unexpected behavior).
>>>>
>>>>
>>>> Query String:
>>>>  for doc in import 
>>>>  filter doc._type == 'Image'
>>>>  return doc._type
>>>>
>>>> Execution plan:
>>>>  Id   NodeType          Est.   Comment
>>>>   1   SingletonNode        1   * ROOT
>>>>   7   IndexNode         2214     - FOR doc IN import   /* hash index 
>>>> scan, index only, projections: `_type` */
>>>>   5   CalculationNode   2214       - LET #3 = doc.`_type`   /* 
>>>> attribute expression */   /* collections used: doc : import */
>>>>   6   ReturnNode        2214       - RETURN #3
>>>>
>>>> Indexes used:
>>>>  By   Type   Collection   Unique   Sparse   Selectivity   Fields        
>>>> Ranges
>>>>   7   hash   import       false    false         0.05 %   [ `_type` ]  
>>>>  (doc.`_type` == "Image")
>>>>
>>>> Optimization rules applied:
>>>>  Id   RuleName
>>>>   1   move-calculations-up
>>>>   2   move-filters-up
>>>>   3   move-calculations-up-2
>>>>   4   move-filters-up-2
>>>>   5   use-indexes
>>>>   6   remove-filter-covered-by-index
>>>>   7   remove-unnecessary-calculations-2
>>>>   8   reduce-extraction-to-projection
>>>>
>>>>
>>>>
>>>> Query String:
>>>>  for doc in import 
>>>>  filter doc._type == 'Image'
>>>>  return doc._path
>>>>
>>>> Execution plan:
>>>>  Id   NodeType          Est.   Comment
>>>>   1   SingletonNode        1   * ROOT
>>>>   7   IndexNode         2214     - FOR doc IN import   /* hash index 
>>>> scan, projections: `_path` */
>>>>   5   CalculationNode   2214       - LET #3 = doc.`_path`   /* 
>>>> attribute expression */   /* collections used: doc : import */
>>>>   6   ReturnNode        2214       - RETURN #3
>>>>
>>>> Indexes used:
>>>>  By   Type   Collection   Unique   Sparse   Selectivity   Fields        
>>>> Ranges
>>>>   7   hash   import       false    false         0.05 %   [ `_type` ]  
>>>>  (doc.`_type` == "Image")
>>>>
>>>> Optimization rules applied:
>>>>  Id   RuleName
>>>>   1   move-calculations-up
>>>>   2   move-filters-up
>>>>   3   move-calculations-up-2
>>>>   4   move-filters-up-2
>>>>   5   use-indexes
>>>>   6   remove-filter-covered-by-index
>>>>   7   remove-unnecessary-calculations-2
>>>>   8   reduce-extraction-to-projection
>>>>
>>>>
>>>>
>>>>
>>>> On Friday, June 14, 2019 at 9:54:10 AM UTC+2, Andreas Jung wrote:
>>>>>
>>>>> Using RocksDB (default installation).
>>>>>
>>>>> I create a new collection for every import of the data including the 
>>>>> indexes.
>>>>>
>>>>> Unfortunately I don't have the key names in my hands. They are coming
>>>>> from a JSON dump of a CMS.
>>>>>
>>>>> Am Freitag, 14. Juni 2019 09:50:41 UTC+2 schrieb Wilfried Gösgens:
>>>>>>
>>>>>> Hi, 
>>>>>> afair you're using rocksdb?
>>>>>>
>>>>>> can you try to re-create that index to be on `_type`, `_path`, `_key` 
>>>>>> for better using of projections?
>>>>>>
>>>>>> Please note that you shouldn't use fieldnames starting with `_` since 
>>>>>> they're defined as system specific fields in arangodb.
>>>>>>
>>>>>> Cheers, 
>>>>>> Willi
>>>>>>
>>>>>> On Friday, June 14, 2019 at 9:41:24 AM UTC+2, Andreas Jung wrote:
>>>>>>>
>>>>>>> _key is a UUID4
>>>>>>> _path is standard filesystem path not longer than 100 chars each
>>>>>>>
>>>>>>> That can not be the problem.
>>>>>>>
>>>>>>> Am Freitag, 14. Juni 2019 09:36:17 UTC+2 schrieb James 
>>>>>>> Courtier-Dutton:
>>>>>>>>
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> What is the average size of the returned data? It could just be the 
>>>>>>>> time it takes to serialise the data being returned
>>>>>>>>
>>>>>>>> James
>>>>>>>>
>>>>>>>> On Fri, 14 Jun 2019, 05:45 'Andreas Jung' via ArangoDB, <
>>>>>>>> [email protected]> wrote:
>>>>>>>>
>>>>>>>>> Hi there,
>>>>>>>>>
>>>>>>>>> this query 
>>>>>>>>>
>>>>>>>>>  for doc in import 
>>>>>>>>>    filter doc._type == 'Image'
>>>>>>>>>    return {path: doc._path, key: doc._key}
>>>>>>>>>
>>>>>>>>> takes about 45 seconds on decent hardware with an import 
>>>>>>>>> collection of about 100.000 items with about 21.000 of _type = 
>>>>>>>>> 'Image'.
>>>>>>>>> There is an index of _type. Using PyArango as client...I really 
>>>>>>>>> wander why this query is running so slow?!
>>>>>>>>>
>>>>>>>>> Running ArangoDB 3.4.3
>>>>>>>>>
>>>>>>>>> Profile
>>>>>>>>>
>>>>>>>>> Query String:
>>>>>>>>>  for doc in import 
>>>>>>>>>  filter doc._type == 'Image'
>>>>>>>>>  return {path: doc._path, key: doc._key}
>>>>>>>>>
>>>>>>>>> Execution plan:
>>>>>>>>>  Id   NodeType          Calls   Items   Runtime [s]   Comment
>>>>>>>>>   1   SingletonNode         1       1       0.00000   * ROOT
>>>>>>>>>   7   IndexNode            21   20617      32.73956     - FOR doc 
>>>>>>>>> IN import   /* hash index scan, projections: `_key`, `_path` */
>>>>>>>>>   5   CalculationNode      21   20617       0.04354       - LET #3 
>>>>>>>>> = { "path" : doc.`_path`, "key" : doc.`_key` }   /* simple expression 
>>>>>>>>> */  
>>>>>>>>>  /* collections used: doc : import */
>>>>>>>>>   6   ReturnNode           21   20617       0.00016       - RETURN 
>>>>>>>>> #3
>>>>>>>>>
>>>>>>>>> Indexes used:
>>>>>>>>>  By   Type   Collection   Unique   Sparse   Selectivity   Fields  
>>>>>>>>>       Ranges
>>>>>>>>>   7   hash   import       false    false         0.05 %   [ 
>>>>>>>>> `_type` ]   (doc.`_type` == "Image")
>>>>>>>>>
>>>>>>>>> Optimization rules applied:
>>>>>>>>>  Id   RuleName
>>>>>>>>>   1   move-calculations-up
>>>>>>>>>   2   move-filters-up
>>>>>>>>>   3   move-calculations-up-2
>>>>>>>>>   4   move-filters-up-2
>>>>>>>>>   5   use-indexes
>>>>>>>>>   6   remove-filter-covered-by-index
>>>>>>>>>   7   remove-unnecessary-calculations-2
>>>>>>>>>   8   reduce-extraction-to-projection
>>>>>>>>>
>>>>>>>>> Query Statistics:
>>>>>>>>>  Writes Exec   Writes Ign   Scan Full   Scan Index   Filtered  
>>>>>>>>>  Exec Time [s]
>>>>>>>>>            0            0           0        20617          0      
>>>>>>>>>   32.78928
>>>>>>>>>
>>>>>>>>> Query Profile:
>>>>>>>>>  Query Stage           Duration [s]
>>>>>>>>>  initializing               0.00001
>>>>>>>>>  parsing                    0.00010
>>>>>>>>>  optimizing ast             0.00001
>>>>>>>>>  loading collections        0.00002
>>>>>>>>>  instantiating plan         0.00005
>>>>>>>>>  optimizing plan            0.00032
>>>>>>>>>  executing                 32.78841
>>>>>>>>>  finalizing                 0.00032
>>>>>>>>>
>>>>>>>>> -- 
>>>>>>>>> You received this message because you are subscribed to the Google 
>>>>>>>>> Groups "ArangoDB" group.
>>>>>>>>> To unsubscribe from this group and stop receiving emails from it, 
>>>>>>>>> send an email to [email protected].
>>>>>>>>> To view this discussion on the web visit 
>>>>>>>>> https://groups.google.com/d/msgid/arangodb/6c2de54c-3936-4aa5-8b6a-2dae3e5afcf7%40googlegroups.com
>>>>>>>>>  
>>>>>>>>> <https://groups.google.com/d/msgid/arangodb/6c2de54c-3936-4aa5-8b6a-2dae3e5afcf7%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>>>>>> .
>>>>>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>>>>>
>>>>>>>>

-- 
You received this message because you are subscribed to the Google Groups 
"ArangoDB" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/arangodb/24c7cc65-765e-402e-a779-f04544b444c0%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to