Hi, 

can you get some figures of Avg/Min/Max document sizes?

On Monday, June 17, 2019 at 4:29:56 PM UTC+2, Andreas Jung wrote:
>
> Profile:
>
> Query String:
>  for doc in import
>     filter doc._type == 'Image'
>     
>     return {path: doc._path, key: doc._key}
>
> Execution plan:
>  Id   NodeType          Calls   Items   Runtime [s]   Comment
>   1   SingletonNode         1       1       0.00000   * ROOT
>   7   IndexNode            21   20617      84.34365     - FOR doc IN 
> import   /* hash index scan, projections: `_key`, `_path` */
>   5   CalculationNode      21   20617       0.05436       - LET #3 = { 
> "path" : doc.`_path`, "key" : doc.`_key` }   /* simple expression */   /* 
> collections used: doc : import */
>   6   ReturnNode           21   20617       0.00017       - RETURN #3
>
> Indexes used:
>  By   Type   Collection   Unique   Sparse   Selectivity   Fields        
> Ranges
>   7   hash   import       false    false         0.05 %   [ `_type` ]  
>  (doc.`_type` == "Image")
>
> Optimization rules applied:
>  Id   RuleName
>   1   move-calculations-up
>   2   move-filters-up
>   3   move-calculations-up-2
>   4   move-filters-up-2
>   5   use-indexes
>   6   remove-filter-covered-by-index
>   7   remove-unnecessary-calculations-2
>   8   reduce-extraction-to-projection
>
> Query Statistics:
>  Writes Exec   Writes Ign   Scan Full   Scan Index   Filtered   Exec Time 
> [s]
>            0            0           0        20617          0        
> 84.40501
>
> Query Profile:
>  Query Stage           Duration [s]
>  initializing               0.00000
>  parsing                    0.00020
>  optimizing ast             0.00001
>  loading collections        0.00001
>  instantiating plan         0.00005
>  optimizing plan            0.00021
>  executing                 84.40415
>  finalizing                 0.00033
>
>
> On Monday, June 17, 2019 at 4:25:51 PM UTC+2, Andreas Jung wrote:
>>
>> A compound index on _type+_path or _type +_path + _key does not improve 
>> things.
>> The query time is still in the range of 120 to 150 seconds.
>>
>> Andreas
>>
>> Am Freitag, 14. Juni 2019 15:01:30 UTC+2 schrieb Wilfried Gösgens:
>>>
>>> May I get back to my sugestion once more?
>>>
>>> Could you instead of the index on `_type` create a combined index ot 
>>> `_type`, `_path` and `_key` ? 
>>> This should copy these fields into the index, so ArangoDB doesn't have 
>>> to fetch the (big) documents.
>>> I gues fetching and decompressing them is huge. 
>>>
>>> Another suggestion would be to put the payload (You've got base64 
>>> encoded binary data, right?) into a separate collection, parted of the 
>>> structural information.
>>>
>>> Cheers, 
>>> Willi
>>>
>>> On Friday, June 14, 2019 at 1:10:44 PM UTC+2, Andreas Jung wrote:
>>>>
>>>> All _path value are unique, we have about 20 different values for _type.
>>>> I am not sure if I can break down the dataset into something smaller.
>>>> The data is in general sensitive and not easy to share or anonymize.
>>>>
>>>>
>>>> Am Freitag, 14. Juni 2019 13:03:59 UTC+2 schrieb Wilfried Gösgens:
>>>>>
>>>>>
>>>>> Hi, 
>>>>> Can you share a set of sample documents? How well is the distribution 
>>>>> on `_type` ? Which samples are there? 
>>>>> On Friday, June 14, 2019 at 11:22:51 AM UTC+2, Andreas Jung wrote:
>>>>>>
>>>>>> Recreating the indexes after import does not make a difference.
>>>>>>
>>>>>> Returning doc._path  for 20.000 items takes 50 ms, returning 
>>>>>> doc._path takes minutes
>>>>>>
>>>>>> The _path index is deduplicated, the _type index is not 
>>>>>>
>>>>>> The only difference in the execution plans is "index only" when 
>>>>>> "RETURN doc._type". Since both _type and _path
>>>>>> are fully indexed I would assume that the query is executed in both 
>>>>>> times based on index data.
>>>>>>
>>>>>> So ArangoDB will load all 100.000 objects for picking up the value of 
>>>>>> _path? The overall data is meanwhile 55 GB
>>>>>> (about one third of the data is binary data (files and images base64 
>>>>>> encoded).  
>>>>>>
>>>>>> This is all no big problem for me since we perform such queries once 
>>>>>> before a migration run and it does matter taking
>>>>>> a migration running for some hours a minutes more or less but I want 
>>>>>> to understand what is going on here (in particular
>>>>>> this is unexpected behavior).
>>>>>>
>>>>>>
>>>>>> Query String:
>>>>>>  for doc in import 
>>>>>>  filter doc._type == 'Image'
>>>>>>  return doc._type
>>>>>>
>>>>>> Execution plan:
>>>>>>  Id   NodeType          Est.   Comment
>>>>>>   1   SingletonNode        1   * ROOT
>>>>>>   7   IndexNode         2214     - FOR doc IN import   /* hash index 
>>>>>> scan, index only, projections: `_type` */
>>>>>>   5   CalculationNode   2214       - LET #3 = doc.`_type`   /* 
>>>>>> attribute expression */   /* collections used: doc : import */
>>>>>>   6   ReturnNode        2214       - RETURN #3
>>>>>>
>>>>>> Indexes used:
>>>>>>  By   Type   Collection   Unique   Sparse   Selectivity   Fields      
>>>>>>   Ranges
>>>>>>   7   hash   import       false    false         0.05 %   [ `_type` 
>>>>>> ]   (doc.`_type` == "Image")
>>>>>>
>>>>>> Optimization rules applied:
>>>>>>  Id   RuleName
>>>>>>   1   move-calculations-up
>>>>>>   2   move-filters-up
>>>>>>   3   move-calculations-up-2
>>>>>>   4   move-filters-up-2
>>>>>>   5   use-indexes
>>>>>>   6   remove-filter-covered-by-index
>>>>>>   7   remove-unnecessary-calculations-2
>>>>>>   8   reduce-extraction-to-projection
>>>>>>
>>>>>>
>>>>>>
>>>>>> Query String:
>>>>>>  for doc in import 
>>>>>>  filter doc._type == 'Image'
>>>>>>  return doc._path
>>>>>>
>>>>>> Execution plan:
>>>>>>  Id   NodeType          Est.   Comment
>>>>>>   1   SingletonNode        1   * ROOT
>>>>>>   7   IndexNode         2214     - FOR doc IN import   /* hash index 
>>>>>> scan, projections: `_path` */
>>>>>>   5   CalculationNode   2214       - LET #3 = doc.`_path`   /* 
>>>>>> attribute expression */   /* collections used: doc : import */
>>>>>>   6   ReturnNode        2214       - RETURN #3
>>>>>>
>>>>>> Indexes used:
>>>>>>  By   Type   Collection   Unique   Sparse   Selectivity   Fields      
>>>>>>   Ranges
>>>>>>   7   hash   import       false    false         0.05 %   [ `_type` 
>>>>>> ]   (doc.`_type` == "Image")
>>>>>>
>>>>>> Optimization rules applied:
>>>>>>  Id   RuleName
>>>>>>   1   move-calculations-up
>>>>>>   2   move-filters-up
>>>>>>   3   move-calculations-up-2
>>>>>>   4   move-filters-up-2
>>>>>>   5   use-indexes
>>>>>>   6   remove-filter-covered-by-index
>>>>>>   7   remove-unnecessary-calculations-2
>>>>>>   8   reduce-extraction-to-projection
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Friday, June 14, 2019 at 9:54:10 AM UTC+2, Andreas Jung wrote:
>>>>>>>
>>>>>>> Using RocksDB (default installation).
>>>>>>>
>>>>>>> I create a new collection for every import of the data including the 
>>>>>>> indexes.
>>>>>>>
>>>>>>> Unfortunately I don't have the key names in my hands. They are coming
>>>>>>> from a JSON dump of a CMS.
>>>>>>>
>>>>>>> Am Freitag, 14. Juni 2019 09:50:41 UTC+2 schrieb Wilfried Gösgens:
>>>>>>>>
>>>>>>>> Hi, 
>>>>>>>> afair you're using rocksdb?
>>>>>>>>
>>>>>>>> can you try to re-create that index to be on `_type`, `_path`, 
>>>>>>>> `_key` for better using of projections?
>>>>>>>>
>>>>>>>> Please note that you shouldn't use fieldnames starting with `_` 
>>>>>>>> since they're defined as system specific fields in arangodb.
>>>>>>>>
>>>>>>>> Cheers, 
>>>>>>>> Willi
>>>>>>>>
>>>>>>>> On Friday, June 14, 2019 at 9:41:24 AM UTC+2, Andreas Jung wrote:
>>>>>>>>>
>>>>>>>>> _key is a UUID4
>>>>>>>>> _path is standard filesystem path not longer than 100 chars each
>>>>>>>>>
>>>>>>>>> That can not be the problem.
>>>>>>>>>
>>>>>>>>> Am Freitag, 14. Juni 2019 09:36:17 UTC+2 schrieb James 
>>>>>>>>> Courtier-Dutton:
>>>>>>>>>>
>>>>>>>>>> Hi,
>>>>>>>>>>
>>>>>>>>>> What is the average size of the returned data? It could just be 
>>>>>>>>>> the time it takes to serialise the data being returned
>>>>>>>>>>
>>>>>>>>>> James
>>>>>>>>>>
>>>>>>>>>> On Fri, 14 Jun 2019, 05:45 'Andreas Jung' via ArangoDB, <
>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi there,
>>>>>>>>>>>
>>>>>>>>>>> this query 
>>>>>>>>>>>
>>>>>>>>>>>  for doc in import 
>>>>>>>>>>>    filter doc._type == 'Image'
>>>>>>>>>>>    return {path: doc._path, key: doc._key}
>>>>>>>>>>>
>>>>>>>>>>> takes about 45 seconds on decent hardware with an import 
>>>>>>>>>>> collection of about 100.000 items with about 21.000 of _type = 
>>>>>>>>>>> 'Image'.
>>>>>>>>>>> There is an index of _type. Using PyArango as client...I really 
>>>>>>>>>>> wander why this query is running so slow?!
>>>>>>>>>>>
>>>>>>>>>>> Running ArangoDB 3.4.3
>>>>>>>>>>>
>>>>>>>>>>> Profile
>>>>>>>>>>>
>>>>>>>>>>> Query String:
>>>>>>>>>>>  for doc in import 
>>>>>>>>>>>  filter doc._type == 'Image'
>>>>>>>>>>>  return {path: doc._path, key: doc._key}
>>>>>>>>>>>
>>>>>>>>>>> Execution plan:
>>>>>>>>>>>  Id   NodeType          Calls   Items   Runtime [s]   Comment
>>>>>>>>>>>   1   SingletonNode         1       1       0.00000   * ROOT
>>>>>>>>>>>   7   IndexNode            21   20617      32.73956     - FOR 
>>>>>>>>>>> doc IN import   /* hash index scan, projections: `_key`, `_path` */
>>>>>>>>>>>   5   CalculationNode      21   20617       0.04354       - LET 
>>>>>>>>>>> #3 = { "path" : doc.`_path`, "key" : doc.`_key` }   /* simple 
>>>>>>>>>>> expression 
>>>>>>>>>>> */   /* collections used: doc : import */
>>>>>>>>>>>   6   ReturnNode           21   20617       0.00016       - 
>>>>>>>>>>> RETURN #3
>>>>>>>>>>>
>>>>>>>>>>> Indexes used:
>>>>>>>>>>>  By   Type   Collection   Unique   Sparse   Selectivity  
>>>>>>>>>>>  Fields        Ranges
>>>>>>>>>>>   7   hash   import       false    false         0.05 %   [ 
>>>>>>>>>>> `_type` ]   (doc.`_type` == "Image")
>>>>>>>>>>>
>>>>>>>>>>> Optimization rules applied:
>>>>>>>>>>>  Id   RuleName
>>>>>>>>>>>   1   move-calculations-up
>>>>>>>>>>>   2   move-filters-up
>>>>>>>>>>>   3   move-calculations-up-2
>>>>>>>>>>>   4   move-filters-up-2
>>>>>>>>>>>   5   use-indexes
>>>>>>>>>>>   6   remove-filter-covered-by-index
>>>>>>>>>>>   7   remove-unnecessary-calculations-2
>>>>>>>>>>>   8   reduce-extraction-to-projection
>>>>>>>>>>>
>>>>>>>>>>> Query Statistics:
>>>>>>>>>>>  Writes Exec   Writes Ign   Scan Full   Scan Index   Filtered  
>>>>>>>>>>>  Exec Time [s]
>>>>>>>>>>>            0            0           0        20617          0    
>>>>>>>>>>>     32.78928
>>>>>>>>>>>
>>>>>>>>>>> Query Profile:
>>>>>>>>>>>  Query Stage           Duration [s]
>>>>>>>>>>>  initializing               0.00001
>>>>>>>>>>>  parsing                    0.00010
>>>>>>>>>>>  optimizing ast             0.00001
>>>>>>>>>>>  loading collections        0.00002
>>>>>>>>>>>  instantiating plan         0.00005
>>>>>>>>>>>  optimizing plan            0.00032
>>>>>>>>>>>  executing                 32.78841
>>>>>>>>>>>  finalizing                 0.00032
>>>>>>>>>>>
>>>>>>>>>>> -- 
>>>>>>>>>>> You received this message because you are subscribed to the 
>>>>>>>>>>> Google Groups "ArangoDB" group.
>>>>>>>>>>> To unsubscribe from this group and stop receiving emails from 
>>>>>>>>>>> it, send an email to [email protected].
>>>>>>>>>>> To view this discussion on the web visit 
>>>>>>>>>>> https://groups.google.com/d/msgid/arangodb/6c2de54c-3936-4aa5-8b6a-2dae3e5afcf7%40googlegroups.com
>>>>>>>>>>>  
>>>>>>>>>>> <https://groups.google.com/d/msgid/arangodb/6c2de54c-3936-4aa5-8b6a-2dae3e5afcf7%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>>>>>>>> .
>>>>>>>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>>>>>>>
>>>>>>>>>>

-- 
You received this message because you are subscribed to the Google Groups 
"ArangoDB" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/arangodb/83ab8ae6-bdab-452c-8985-59b98da0b92a%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to