May I get back to my sugestion once more?

Could you instead of the index on `_type` create a combined index ot 
`_type`, `_path` and `_key` ? 
This should copy these fields into the index, so ArangoDB doesn't have to 
fetch the (big) documents.
I gues fetching and decompressing them is huge. 

Another suggestion would be to put the payload (You've got base64 encoded 
binary data, right?) into a separate collection, parted of the structural 
information.

Cheers, 
Willi

On Friday, June 14, 2019 at 1:10:44 PM UTC+2, Andreas Jung wrote:
>
> All _path value are unique, we have about 20 different values for _type.
> I am not sure if I can break down the dataset into something smaller.
> The data is in general sensitive and not easy to share or anonymize.
>
>
> Am Freitag, 14. Juni 2019 13:03:59 UTC+2 schrieb Wilfried Gösgens:
>>
>>
>> Hi, 
>> Can you share a set of sample documents? How well is the distribution on 
>> `_type` ? Which samples are there? 
>> On Friday, June 14, 2019 at 11:22:51 AM UTC+2, Andreas Jung wrote:
>>>
>>> Recreating the indexes after import does not make a difference.
>>>
>>> Returning doc._path  for 20.000 items takes 50 ms, returning doc._path 
>>> takes minutes
>>>
>>> The _path index is deduplicated, the _type index is not 
>>>
>>> The only difference in the execution plans is "index only" when "RETURN 
>>> doc._type". Since both _type and _path
>>> are fully indexed I would assume that the query is executed in both 
>>> times based on index data.
>>>
>>> So ArangoDB will load all 100.000 objects for picking up the value of 
>>> _path? The overall data is meanwhile 55 GB
>>> (about one third of the data is binary data (files and images base64 
>>> encoded).  
>>>
>>> This is all no big problem for me since we perform such queries once 
>>> before a migration run and it does matter taking
>>> a migration running for some hours a minutes more or less but I want to 
>>> understand what is going on here (in particular
>>> this is unexpected behavior).
>>>
>>>
>>> Query String:
>>>  for doc in import 
>>>  filter doc._type == 'Image'
>>>  return doc._type
>>>
>>> Execution plan:
>>>  Id   NodeType          Est.   Comment
>>>   1   SingletonNode        1   * ROOT
>>>   7   IndexNode         2214     - FOR doc IN import   /* hash index 
>>> scan, index only, projections: `_type` */
>>>   5   CalculationNode   2214       - LET #3 = doc.`_type`   /* attribute 
>>> expression */   /* collections used: doc : import */
>>>   6   ReturnNode        2214       - RETURN #3
>>>
>>> Indexes used:
>>>  By   Type   Collection   Unique   Sparse   Selectivity   Fields        
>>> Ranges
>>>   7   hash   import       false    false         0.05 %   [ `_type` ]  
>>>  (doc.`_type` == "Image")
>>>
>>> Optimization rules applied:
>>>  Id   RuleName
>>>   1   move-calculations-up
>>>   2   move-filters-up
>>>   3   move-calculations-up-2
>>>   4   move-filters-up-2
>>>   5   use-indexes
>>>   6   remove-filter-covered-by-index
>>>   7   remove-unnecessary-calculations-2
>>>   8   reduce-extraction-to-projection
>>>
>>>
>>>
>>> Query String:
>>>  for doc in import 
>>>  filter doc._type == 'Image'
>>>  return doc._path
>>>
>>> Execution plan:
>>>  Id   NodeType          Est.   Comment
>>>   1   SingletonNode        1   * ROOT
>>>   7   IndexNode         2214     - FOR doc IN import   /* hash index 
>>> scan, projections: `_path` */
>>>   5   CalculationNode   2214       - LET #3 = doc.`_path`   /* attribute 
>>> expression */   /* collections used: doc : import */
>>>   6   ReturnNode        2214       - RETURN #3
>>>
>>> Indexes used:
>>>  By   Type   Collection   Unique   Sparse   Selectivity   Fields        
>>> Ranges
>>>   7   hash   import       false    false         0.05 %   [ `_type` ]  
>>>  (doc.`_type` == "Image")
>>>
>>> Optimization rules applied:
>>>  Id   RuleName
>>>   1   move-calculations-up
>>>   2   move-filters-up
>>>   3   move-calculations-up-2
>>>   4   move-filters-up-2
>>>   5   use-indexes
>>>   6   remove-filter-covered-by-index
>>>   7   remove-unnecessary-calculations-2
>>>   8   reduce-extraction-to-projection
>>>
>>>
>>>
>>>
>>> On Friday, June 14, 2019 at 9:54:10 AM UTC+2, Andreas Jung wrote:
>>>>
>>>> Using RocksDB (default installation).
>>>>
>>>> I create a new collection for every import of the data including the 
>>>> indexes.
>>>>
>>>> Unfortunately I don't have the key names in my hands. They are coming
>>>> from a JSON dump of a CMS.
>>>>
>>>> Am Freitag, 14. Juni 2019 09:50:41 UTC+2 schrieb Wilfried Gösgens:
>>>>>
>>>>> Hi, 
>>>>> afair you're using rocksdb?
>>>>>
>>>>> can you try to re-create that index to be on `_type`, `_path`, `_key` 
>>>>> for better using of projections?
>>>>>
>>>>> Please note that you shouldn't use fieldnames starting with `_` since 
>>>>> they're defined as system specific fields in arangodb.
>>>>>
>>>>> Cheers, 
>>>>> Willi
>>>>>
>>>>> On Friday, June 14, 2019 at 9:41:24 AM UTC+2, Andreas Jung wrote:
>>>>>>
>>>>>> _key is a UUID4
>>>>>> _path is standard filesystem path not longer than 100 chars each
>>>>>>
>>>>>> That can not be the problem.
>>>>>>
>>>>>> Am Freitag, 14. Juni 2019 09:36:17 UTC+2 schrieb James 
>>>>>> Courtier-Dutton:
>>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> What is the average size of the returned data? It could just be the 
>>>>>>> time it takes to serialise the data being returned
>>>>>>>
>>>>>>> James
>>>>>>>
>>>>>>> On Fri, 14 Jun 2019, 05:45 'Andreas Jung' via ArangoDB, <
>>>>>>> [email protected]> wrote:
>>>>>>>
>>>>>>>> Hi there,
>>>>>>>>
>>>>>>>> this query 
>>>>>>>>
>>>>>>>>  for doc in import 
>>>>>>>>    filter doc._type == 'Image'
>>>>>>>>    return {path: doc._path, key: doc._key}
>>>>>>>>
>>>>>>>> takes about 45 seconds on decent hardware with an import collection 
>>>>>>>> of about 100.000 items with about 21.000 of _type = 'Image'.
>>>>>>>> There is an index of _type. Using PyArango as client...I really 
>>>>>>>> wander why this query is running so slow?!
>>>>>>>>
>>>>>>>> Running ArangoDB 3.4.3
>>>>>>>>
>>>>>>>> Profile
>>>>>>>>
>>>>>>>> Query String:
>>>>>>>>  for doc in import 
>>>>>>>>  filter doc._type == 'Image'
>>>>>>>>  return {path: doc._path, key: doc._key}
>>>>>>>>
>>>>>>>> Execution plan:
>>>>>>>>  Id   NodeType          Calls   Items   Runtime [s]   Comment
>>>>>>>>   1   SingletonNode         1       1       0.00000   * ROOT
>>>>>>>>   7   IndexNode            21   20617      32.73956     - FOR doc 
>>>>>>>> IN import   /* hash index scan, projections: `_key`, `_path` */
>>>>>>>>   5   CalculationNode      21   20617       0.04354       - LET #3 
>>>>>>>> = { "path" : doc.`_path`, "key" : doc.`_key` }   /* simple expression 
>>>>>>>> */  
>>>>>>>>  /* collections used: doc : import */
>>>>>>>>   6   ReturnNode           21   20617       0.00016       - RETURN 
>>>>>>>> #3
>>>>>>>>
>>>>>>>> Indexes used:
>>>>>>>>  By   Type   Collection   Unique   Sparse   Selectivity   Fields    
>>>>>>>>     Ranges
>>>>>>>>   7   hash   import       false    false         0.05 %   [ `_type` 
>>>>>>>> ]   (doc.`_type` == "Image")
>>>>>>>>
>>>>>>>> Optimization rules applied:
>>>>>>>>  Id   RuleName
>>>>>>>>   1   move-calculations-up
>>>>>>>>   2   move-filters-up
>>>>>>>>   3   move-calculations-up-2
>>>>>>>>   4   move-filters-up-2
>>>>>>>>   5   use-indexes
>>>>>>>>   6   remove-filter-covered-by-index
>>>>>>>>   7   remove-unnecessary-calculations-2
>>>>>>>>   8   reduce-extraction-to-projection
>>>>>>>>
>>>>>>>> Query Statistics:
>>>>>>>>  Writes Exec   Writes Ign   Scan Full   Scan Index   Filtered  
>>>>>>>>  Exec Time [s]
>>>>>>>>            0            0           0        20617          0      
>>>>>>>>   32.78928
>>>>>>>>
>>>>>>>> Query Profile:
>>>>>>>>  Query Stage           Duration [s]
>>>>>>>>  initializing               0.00001
>>>>>>>>  parsing                    0.00010
>>>>>>>>  optimizing ast             0.00001
>>>>>>>>  loading collections        0.00002
>>>>>>>>  instantiating plan         0.00005
>>>>>>>>  optimizing plan            0.00032
>>>>>>>>  executing                 32.78841
>>>>>>>>  finalizing                 0.00032
>>>>>>>>
>>>>>>>> -- 
>>>>>>>> You received this message because you are subscribed to the Google 
>>>>>>>> Groups "ArangoDB" group.
>>>>>>>> To unsubscribe from this group and stop receiving emails from it, 
>>>>>>>> send an email to [email protected].
>>>>>>>> To view this discussion on the web visit 
>>>>>>>> https://groups.google.com/d/msgid/arangodb/6c2de54c-3936-4aa5-8b6a-2dae3e5afcf7%40googlegroups.com
>>>>>>>>  
>>>>>>>> <https://groups.google.com/d/msgid/arangodb/6c2de54c-3936-4aa5-8b6a-2dae3e5afcf7%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>>>>> .
>>>>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>>>>
>>>>>>>

-- 
You received this message because you are subscribed to the Google Groups 
"ArangoDB" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/arangodb/65ebdfca-dfb1-4803-9ef8-c646072ad0d5%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to