min= 425
max= 103973102
avg= 604706

The avg size is that high because about 1/4 of the 100.000 documents 
contains binary content.


Am Montag, 17. Juni 2019 18:58:36 UTC+2 schrieb Wilfried Gösgens:
>
> Hi, 
>
> can you get some figures of Avg/Min/Max document sizes?
>
> On Monday, June 17, 2019 at 4:29:56 PM UTC+2, Andreas Jung wrote:
>>
>> Profile:
>>
>> Query String:
>>  for doc in import
>>     filter doc._type == 'Image'
>>     
>>     return {path: doc._path, key: doc._key}
>>
>> Execution plan:
>>  Id   NodeType          Calls   Items   Runtime [s]   Comment
>>   1   SingletonNode         1       1       0.00000   * ROOT
>>   7   IndexNode            21   20617      84.34365     - FOR doc IN 
>> import   /* hash index scan, projections: `_key`, `_path` */
>>   5   CalculationNode      21   20617       0.05436       - LET #3 = { 
>> "path" : doc.`_path`, "key" : doc.`_key` }   /* simple expression */   /* 
>> collections used: doc : import */
>>   6   ReturnNode           21   20617       0.00017       - RETURN #3
>>
>> Indexes used:
>>  By   Type   Collection   Unique   Sparse   Selectivity   Fields        
>> Ranges
>>   7   hash   import       false    false         0.05 %   [ `_type` ]  
>>  (doc.`_type` == "Image")
>>
>> Optimization rules applied:
>>  Id   RuleName
>>   1   move-calculations-up
>>   2   move-filters-up
>>   3   move-calculations-up-2
>>   4   move-filters-up-2
>>   5   use-indexes
>>   6   remove-filter-covered-by-index
>>   7   remove-unnecessary-calculations-2
>>   8   reduce-extraction-to-projection
>>
>> Query Statistics:
>>  Writes Exec   Writes Ign   Scan Full   Scan Index   Filtered   Exec Time 
>> [s]
>>            0            0           0        20617          0        
>> 84.40501
>>
>> Query Profile:
>>  Query Stage           Duration [s]
>>  initializing               0.00000
>>  parsing                    0.00020
>>  optimizing ast             0.00001
>>  loading collections        0.00001
>>  instantiating plan         0.00005
>>  optimizing plan            0.00021
>>  executing                 84.40415
>>  finalizing                 0.00033
>>
>>
>> On Monday, June 17, 2019 at 4:25:51 PM UTC+2, Andreas Jung wrote:
>>>
>>> A compound index on _type+_path or _type +_path + _key does not improve 
>>> things.
>>> The query time is still in the range of 120 to 150 seconds.
>>>
>>> Andreas
>>>
>>> Am Freitag, 14. Juni 2019 15:01:30 UTC+2 schrieb Wilfried Gösgens:
>>>>
>>>> May I get back to my sugestion once more?
>>>>
>>>> Could you instead of the index on `_type` create a combined index ot 
>>>> `_type`, `_path` and `_key` ? 
>>>> This should copy these fields into the index, so ArangoDB doesn't have 
>>>> to fetch the (big) documents.
>>>> I gues fetching and decompressing them is huge. 
>>>>
>>>> Another suggestion would be to put the payload (You've got base64 
>>>> encoded binary data, right?) into a separate collection, parted of the 
>>>> structural information.
>>>>
>>>> Cheers, 
>>>> Willi
>>>>
>>>> On Friday, June 14, 2019 at 1:10:44 PM UTC+2, Andreas Jung wrote:
>>>>>
>>>>> All _path value are unique, we have about 20 different values for 
>>>>> _type.
>>>>> I am not sure if I can break down the dataset into something smaller.
>>>>> The data is in general sensitive and not easy to share or anonymize.
>>>>>
>>>>>
>>>>> Am Freitag, 14. Juni 2019 13:03:59 UTC+2 schrieb Wilfried Gösgens:
>>>>>>
>>>>>>
>>>>>> Hi, 
>>>>>> Can you share a set of sample documents? How well is the distribution 
>>>>>> on `_type` ? Which samples are there? 
>>>>>> On Friday, June 14, 2019 at 11:22:51 AM UTC+2, Andreas Jung wrote:
>>>>>>>
>>>>>>> Recreating the indexes after import does not make a difference.
>>>>>>>
>>>>>>> Returning doc._path  for 20.000 items takes 50 ms, returning 
>>>>>>> doc._path takes minutes
>>>>>>>
>>>>>>> The _path index is deduplicated, the _type index is not 
>>>>>>>
>>>>>>> The only difference in the execution plans is "index only" when 
>>>>>>> "RETURN doc._type". Since both _type and _path
>>>>>>> are fully indexed I would assume that the query is executed in both 
>>>>>>> times based on index data.
>>>>>>>
>>>>>>> So ArangoDB will load all 100.000 objects for picking up the value 
>>>>>>> of _path? The overall data is meanwhile 55 GB
>>>>>>> (about one third of the data is binary data (files and images base64 
>>>>>>> encoded).  
>>>>>>>
>>>>>>> This is all no big problem for me since we perform such queries once 
>>>>>>> before a migration run and it does matter taking
>>>>>>> a migration running for some hours a minutes more or less but I want 
>>>>>>> to understand what is going on here (in particular
>>>>>>> this is unexpected behavior).
>>>>>>>
>>>>>>>
>>>>>>> Query String:
>>>>>>>  for doc in import 
>>>>>>>  filter doc._type == 'Image'
>>>>>>>  return doc._type
>>>>>>>
>>>>>>> Execution plan:
>>>>>>>  Id   NodeType          Est.   Comment
>>>>>>>   1   SingletonNode        1   * ROOT
>>>>>>>   7   IndexNode         2214     - FOR doc IN import   /* hash index 
>>>>>>> scan, index only, projections: `_type` */
>>>>>>>   5   CalculationNode   2214       - LET #3 = doc.`_type`   /* 
>>>>>>> attribute expression */   /* collections used: doc : import */
>>>>>>>   6   ReturnNode        2214       - RETURN #3
>>>>>>>
>>>>>>> Indexes used:
>>>>>>>  By   Type   Collection   Unique   Sparse   Selectivity   Fields    
>>>>>>>     Ranges
>>>>>>>   7   hash   import       false    false         0.05 %   [ `_type` 
>>>>>>> ]   (doc.`_type` == "Image")
>>>>>>>
>>>>>>> Optimization rules applied:
>>>>>>>  Id   RuleName
>>>>>>>   1   move-calculations-up
>>>>>>>   2   move-filters-up
>>>>>>>   3   move-calculations-up-2
>>>>>>>   4   move-filters-up-2
>>>>>>>   5   use-indexes
>>>>>>>   6   remove-filter-covered-by-index
>>>>>>>   7   remove-unnecessary-calculations-2
>>>>>>>   8   reduce-extraction-to-projection
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Query String:
>>>>>>>  for doc in import 
>>>>>>>  filter doc._type == 'Image'
>>>>>>>  return doc._path
>>>>>>>
>>>>>>> Execution plan:
>>>>>>>  Id   NodeType          Est.   Comment
>>>>>>>   1   SingletonNode        1   * ROOT
>>>>>>>   7   IndexNode         2214     - FOR doc IN import   /* hash index 
>>>>>>> scan, projections: `_path` */
>>>>>>>   5   CalculationNode   2214       - LET #3 = doc.`_path`   /* 
>>>>>>> attribute expression */   /* collections used: doc : import */
>>>>>>>   6   ReturnNode        2214       - RETURN #3
>>>>>>>
>>>>>>> Indexes used:
>>>>>>>  By   Type   Collection   Unique   Sparse   Selectivity   Fields    
>>>>>>>     Ranges
>>>>>>>   7   hash   import       false    false         0.05 %   [ `_type` 
>>>>>>> ]   (doc.`_type` == "Image")
>>>>>>>
>>>>>>> Optimization rules applied:
>>>>>>>  Id   RuleName
>>>>>>>   1   move-calculations-up
>>>>>>>   2   move-filters-up
>>>>>>>   3   move-calculations-up-2
>>>>>>>   4   move-filters-up-2
>>>>>>>   5   use-indexes
>>>>>>>   6   remove-filter-covered-by-index
>>>>>>>   7   remove-unnecessary-calculations-2
>>>>>>>   8   reduce-extraction-to-projection
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Friday, June 14, 2019 at 9:54:10 AM UTC+2, Andreas Jung wrote:
>>>>>>>>
>>>>>>>> Using RocksDB (default installation).
>>>>>>>>
>>>>>>>> I create a new collection for every import of the data including 
>>>>>>>> the indexes.
>>>>>>>>
>>>>>>>> Unfortunately I don't have the key names in my hands. They are 
>>>>>>>> coming
>>>>>>>> from a JSON dump of a CMS.
>>>>>>>>
>>>>>>>> Am Freitag, 14. Juni 2019 09:50:41 UTC+2 schrieb Wilfried Gösgens:
>>>>>>>>>
>>>>>>>>> Hi, 
>>>>>>>>> afair you're using rocksdb?
>>>>>>>>>
>>>>>>>>> can you try to re-create that index to be on `_type`, `_path`, 
>>>>>>>>> `_key` for better using of projections?
>>>>>>>>>
>>>>>>>>> Please note that you shouldn't use fieldnames starting with `_` 
>>>>>>>>> since they're defined as system specific fields in arangodb.
>>>>>>>>>
>>>>>>>>> Cheers, 
>>>>>>>>> Willi
>>>>>>>>>
>>>>>>>>> On Friday, June 14, 2019 at 9:41:24 AM UTC+2, Andreas Jung wrote:
>>>>>>>>>>
>>>>>>>>>> _key is a UUID4
>>>>>>>>>> _path is standard filesystem path not longer than 100 chars each
>>>>>>>>>>
>>>>>>>>>> That can not be the problem.
>>>>>>>>>>
>>>>>>>>>> Am Freitag, 14. Juni 2019 09:36:17 UTC+2 schrieb James 
>>>>>>>>>> Courtier-Dutton:
>>>>>>>>>>>
>>>>>>>>>>> Hi,
>>>>>>>>>>>
>>>>>>>>>>> What is the average size of the returned data? It could just be 
>>>>>>>>>>> the time it takes to serialise the data being returned
>>>>>>>>>>>
>>>>>>>>>>> James
>>>>>>>>>>>
>>>>>>>>>>> On Fri, 14 Jun 2019, 05:45 'Andreas Jung' via ArangoDB, <
>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hi there,
>>>>>>>>>>>>
>>>>>>>>>>>> this query 
>>>>>>>>>>>>
>>>>>>>>>>>>  for doc in import 
>>>>>>>>>>>>    filter doc._type == 'Image'
>>>>>>>>>>>>    return {path: doc._path, key: doc._key}
>>>>>>>>>>>>
>>>>>>>>>>>> takes about 45 seconds on decent hardware with an import 
>>>>>>>>>>>> collection of about 100.000 items with about 21.000 of _type = 
>>>>>>>>>>>> 'Image'.
>>>>>>>>>>>> There is an index of _type. Using PyArango as client...I really 
>>>>>>>>>>>> wander why this query is running so slow?!
>>>>>>>>>>>>
>>>>>>>>>>>> Running ArangoDB 3.4.3
>>>>>>>>>>>>
>>>>>>>>>>>> Profile
>>>>>>>>>>>>
>>>>>>>>>>>> Query String:
>>>>>>>>>>>>  for doc in import 
>>>>>>>>>>>>  filter doc._type == 'Image'
>>>>>>>>>>>>  return {path: doc._path, key: doc._key}
>>>>>>>>>>>>
>>>>>>>>>>>> Execution plan:
>>>>>>>>>>>>  Id   NodeType          Calls   Items   Runtime [s]   Comment
>>>>>>>>>>>>   1   SingletonNode         1       1       0.00000   * ROOT
>>>>>>>>>>>>   7   IndexNode            21   20617      32.73956     - FOR 
>>>>>>>>>>>> doc IN import   /* hash index scan, projections: `_key`, `_path` */
>>>>>>>>>>>>   5   CalculationNode      21   20617       0.04354       - LET 
>>>>>>>>>>>> #3 = { "path" : doc.`_path`, "key" : doc.`_key` }   /* simple 
>>>>>>>>>>>> expression 
>>>>>>>>>>>> */   /* collections used: doc : import */
>>>>>>>>>>>>   6   ReturnNode           21   20617       0.00016       - 
>>>>>>>>>>>> RETURN #3
>>>>>>>>>>>>
>>>>>>>>>>>> Indexes used:
>>>>>>>>>>>>  By   Type   Collection   Unique   Sparse   Selectivity  
>>>>>>>>>>>>  Fields        Ranges
>>>>>>>>>>>>   7   hash   import       false    false         0.05 %   [ 
>>>>>>>>>>>> `_type` ]   (doc.`_type` == "Image")
>>>>>>>>>>>>
>>>>>>>>>>>> Optimization rules applied:
>>>>>>>>>>>>  Id   RuleName
>>>>>>>>>>>>   1   move-calculations-up
>>>>>>>>>>>>   2   move-filters-up
>>>>>>>>>>>>   3   move-calculations-up-2
>>>>>>>>>>>>   4   move-filters-up-2
>>>>>>>>>>>>   5   use-indexes
>>>>>>>>>>>>   6   remove-filter-covered-by-index
>>>>>>>>>>>>   7   remove-unnecessary-calculations-2
>>>>>>>>>>>>   8   reduce-extraction-to-projection
>>>>>>>>>>>>
>>>>>>>>>>>> Query Statistics:
>>>>>>>>>>>>  Writes Exec   Writes Ign   Scan Full   Scan Index   Filtered  
>>>>>>>>>>>>  Exec Time [s]
>>>>>>>>>>>>            0            0           0        20617          0  
>>>>>>>>>>>>       32.78928
>>>>>>>>>>>>
>>>>>>>>>>>> Query Profile:
>>>>>>>>>>>>  Query Stage           Duration [s]
>>>>>>>>>>>>  initializing               0.00001
>>>>>>>>>>>>  parsing                    0.00010
>>>>>>>>>>>>  optimizing ast             0.00001
>>>>>>>>>>>>  loading collections        0.00002
>>>>>>>>>>>>  instantiating plan         0.00005
>>>>>>>>>>>>  optimizing plan            0.00032
>>>>>>>>>>>>  executing                 32.78841
>>>>>>>>>>>>  finalizing                 0.00032
>>>>>>>>>>>>
>>>>>>>>>>>> -- 
>>>>>>>>>>>> You received this message because you are subscribed to the 
>>>>>>>>>>>> Google Groups "ArangoDB" group.
>>>>>>>>>>>> To unsubscribe from this group and stop receiving emails from 
>>>>>>>>>>>> it, send an email to [email protected].
>>>>>>>>>>>> To view this discussion on the web visit 
>>>>>>>>>>>> https://groups.google.com/d/msgid/arangodb/6c2de54c-3936-4aa5-8b6a-2dae3e5afcf7%40googlegroups.com
>>>>>>>>>>>>  
>>>>>>>>>>>> <https://groups.google.com/d/msgid/arangodb/6c2de54c-3936-4aa5-8b6a-2dae3e5afcf7%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>>>>>>>>> .
>>>>>>>>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>>>>>>>>
>>>>>>>>>>>

-- 
You received this message because you are subscribed to the Google Groups 
"ArangoDB" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/arangodb/adbfc0fe-807a-492e-88a5-b557462d06d6%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to