<1000 bytes: 41.000 items
<10.000 bytes: 21.000 items
< 100.000 bytes: 22.00  items
< 1 MB: 9000 items
< 10 MB: 1000 items
> 10 MB: 20 items 

Am Montag, 17. Juni 2019 19:14:49 UTC+2 schrieb Andreas Jung:
>
> min= 425
> max= 103973102
> avg= 604706
>
> The avg size is that high because about 1/4 of the 100.000 documents 
> contains binary content.
>
>
> Am Montag, 17. Juni 2019 18:58:36 UTC+2 schrieb Wilfried Gösgens:
>>
>> Hi, 
>>
>> can you get some figures of Avg/Min/Max document sizes?
>>
>> On Monday, June 17, 2019 at 4:29:56 PM UTC+2, Andreas Jung wrote:
>>>
>>> Profile:
>>>
>>> Query String:
>>>  for doc in import
>>>     filter doc._type == 'Image'
>>>     
>>>     return {path: doc._path, key: doc._key}
>>>
>>> Execution plan:
>>>  Id   NodeType          Calls   Items   Runtime [s]   Comment
>>>   1   SingletonNode         1       1       0.00000   * ROOT
>>>   7   IndexNode            21   20617      84.34365     - FOR doc IN 
>>> import   /* hash index scan, projections: `_key`, `_path` */
>>>   5   CalculationNode      21   20617       0.05436       - LET #3 = { 
>>> "path" : doc.`_path`, "key" : doc.`_key` }   /* simple expression */   /* 
>>> collections used: doc : import */
>>>   6   ReturnNode           21   20617       0.00017       - RETURN #3
>>>
>>> Indexes used:
>>>  By   Type   Collection   Unique   Sparse   Selectivity   Fields        
>>> Ranges
>>>   7   hash   import       false    false         0.05 %   [ `_type` ]  
>>>  (doc.`_type` == "Image")
>>>
>>> Optimization rules applied:
>>>  Id   RuleName
>>>   1   move-calculations-up
>>>   2   move-filters-up
>>>   3   move-calculations-up-2
>>>   4   move-filters-up-2
>>>   5   use-indexes
>>>   6   remove-filter-covered-by-index
>>>   7   remove-unnecessary-calculations-2
>>>   8   reduce-extraction-to-projection
>>>
>>> Query Statistics:
>>>  Writes Exec   Writes Ign   Scan Full   Scan Index   Filtered   Exec 
>>> Time [s]
>>>            0            0           0        20617          0        
>>> 84.40501
>>>
>>> Query Profile:
>>>  Query Stage           Duration [s]
>>>  initializing               0.00000
>>>  parsing                    0.00020
>>>  optimizing ast             0.00001
>>>  loading collections        0.00001
>>>  instantiating plan         0.00005
>>>  optimizing plan            0.00021
>>>  executing                 84.40415
>>>  finalizing                 0.00033
>>>
>>>
>>> On Monday, June 17, 2019 at 4:25:51 PM UTC+2, Andreas Jung wrote:
>>>>
>>>> A compound index on _type+_path or _type +_path + _key does not improve 
>>>> things.
>>>> The query time is still in the range of 120 to 150 seconds.
>>>>
>>>> Andreas
>>>>
>>>> Am Freitag, 14. Juni 2019 15:01:30 UTC+2 schrieb Wilfried Gösgens:
>>>>>
>>>>> May I get back to my sugestion once more?
>>>>>
>>>>> Could you instead of the index on `_type` create a combined index ot 
>>>>> `_type`, `_path` and `_key` ? 
>>>>> This should copy these fields into the index, so ArangoDB doesn't have 
>>>>> to fetch the (big) documents.
>>>>> I gues fetching and decompressing them is huge. 
>>>>>
>>>>> Another suggestion would be to put the payload (You've got base64 
>>>>> encoded binary data, right?) into a separate collection, parted of the 
>>>>> structural information.
>>>>>
>>>>> Cheers, 
>>>>> Willi
>>>>>
>>>>> On Friday, June 14, 2019 at 1:10:44 PM UTC+2, Andreas Jung wrote:
>>>>>>
>>>>>> All _path value are unique, we have about 20 different values for 
>>>>>> _type.
>>>>>> I am not sure if I can break down the dataset into something smaller.
>>>>>> The data is in general sensitive and not easy to share or anonymize.
>>>>>>
>>>>>>
>>>>>> Am Freitag, 14. Juni 2019 13:03:59 UTC+2 schrieb Wilfried Gösgens:
>>>>>>>
>>>>>>>
>>>>>>> Hi, 
>>>>>>> Can you share a set of sample documents? How well is the 
>>>>>>> distribution on `_type` ? Which samples are there? 
>>>>>>> On Friday, June 14, 2019 at 11:22:51 AM UTC+2, Andreas Jung wrote:
>>>>>>>>
>>>>>>>> Recreating the indexes after import does not make a difference.
>>>>>>>>
>>>>>>>> Returning doc._path  for 20.000 items takes 50 ms, returning 
>>>>>>>> doc._path takes minutes
>>>>>>>>
>>>>>>>> The _path index is deduplicated, the _type index is not 
>>>>>>>>
>>>>>>>> The only difference in the execution plans is "index only" when 
>>>>>>>> "RETURN doc._type". Since both _type and _path
>>>>>>>> are fully indexed I would assume that the query is executed in both 
>>>>>>>> times based on index data.
>>>>>>>>
>>>>>>>> So ArangoDB will load all 100.000 objects for picking up the value 
>>>>>>>> of _path? The overall data is meanwhile 55 GB
>>>>>>>> (about one third of the data is binary data (files and images 
>>>>>>>> base64 encoded).  
>>>>>>>>
>>>>>>>> This is all no big problem for me since we perform such queries 
>>>>>>>> once before a migration run and it does matter taking
>>>>>>>> a migration running for some hours a minutes more or less but I 
>>>>>>>> want to understand what is going on here (in particular
>>>>>>>> this is unexpected behavior).
>>>>>>>>
>>>>>>>>
>>>>>>>> Query String:
>>>>>>>>  for doc in import 
>>>>>>>>  filter doc._type == 'Image'
>>>>>>>>  return doc._type
>>>>>>>>
>>>>>>>> Execution plan:
>>>>>>>>  Id   NodeType          Est.   Comment
>>>>>>>>   1   SingletonNode        1   * ROOT
>>>>>>>>   7   IndexNode         2214     - FOR doc IN import   /* hash 
>>>>>>>> index scan, index only, projections: `_type` */
>>>>>>>>   5   CalculationNode   2214       - LET #3 = doc.`_type`   /* 
>>>>>>>> attribute expression */   /* collections used: doc : import */
>>>>>>>>   6   ReturnNode        2214       - RETURN #3
>>>>>>>>
>>>>>>>> Indexes used:
>>>>>>>>  By   Type   Collection   Unique   Sparse   Selectivity   Fields    
>>>>>>>>     Ranges
>>>>>>>>   7   hash   import       false    false         0.05 %   [ `_type` 
>>>>>>>> ]   (doc.`_type` == "Image")
>>>>>>>>
>>>>>>>> Optimization rules applied:
>>>>>>>>  Id   RuleName
>>>>>>>>   1   move-calculations-up
>>>>>>>>   2   move-filters-up
>>>>>>>>   3   move-calculations-up-2
>>>>>>>>   4   move-filters-up-2
>>>>>>>>   5   use-indexes
>>>>>>>>   6   remove-filter-covered-by-index
>>>>>>>>   7   remove-unnecessary-calculations-2
>>>>>>>>   8   reduce-extraction-to-projection
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Query String:
>>>>>>>>  for doc in import 
>>>>>>>>  filter doc._type == 'Image'
>>>>>>>>  return doc._path
>>>>>>>>
>>>>>>>> Execution plan:
>>>>>>>>  Id   NodeType          Est.   Comment
>>>>>>>>   1   SingletonNode        1   * ROOT
>>>>>>>>   7   IndexNode         2214     - FOR doc IN import   /* hash 
>>>>>>>> index scan, projections: `_path` */
>>>>>>>>   5   CalculationNode   2214       - LET #3 = doc.`_path`   /* 
>>>>>>>> attribute expression */   /* collections used: doc : import */
>>>>>>>>   6   ReturnNode        2214       - RETURN #3
>>>>>>>>
>>>>>>>> Indexes used:
>>>>>>>>  By   Type   Collection   Unique   Sparse   Selectivity   Fields    
>>>>>>>>     Ranges
>>>>>>>>   7   hash   import       false    false         0.05 %   [ `_type` 
>>>>>>>> ]   (doc.`_type` == "Image")
>>>>>>>>
>>>>>>>> Optimization rules applied:
>>>>>>>>  Id   RuleName
>>>>>>>>   1   move-calculations-up
>>>>>>>>   2   move-filters-up
>>>>>>>>   3   move-calculations-up-2
>>>>>>>>   4   move-filters-up-2
>>>>>>>>   5   use-indexes
>>>>>>>>   6   remove-filter-covered-by-index
>>>>>>>>   7   remove-unnecessary-calculations-2
>>>>>>>>   8   reduce-extraction-to-projection
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Friday, June 14, 2019 at 9:54:10 AM UTC+2, Andreas Jung wrote:
>>>>>>>>>
>>>>>>>>> Using RocksDB (default installation).
>>>>>>>>>
>>>>>>>>> I create a new collection for every import of the data including 
>>>>>>>>> the indexes.
>>>>>>>>>
>>>>>>>>> Unfortunately I don't have the key names in my hands. They are 
>>>>>>>>> coming
>>>>>>>>> from a JSON dump of a CMS.
>>>>>>>>>
>>>>>>>>> Am Freitag, 14. Juni 2019 09:50:41 UTC+2 schrieb Wilfried Gösgens:
>>>>>>>>>>
>>>>>>>>>> Hi, 
>>>>>>>>>> afair you're using rocksdb?
>>>>>>>>>>
>>>>>>>>>> can you try to re-create that index to be on `_type`, `_path`, 
>>>>>>>>>> `_key` for better using of projections?
>>>>>>>>>>
>>>>>>>>>> Please note that you shouldn't use fieldnames starting with `_` 
>>>>>>>>>> since they're defined as system specific fields in arangodb.
>>>>>>>>>>
>>>>>>>>>> Cheers, 
>>>>>>>>>> Willi
>>>>>>>>>>
>>>>>>>>>> On Friday, June 14, 2019 at 9:41:24 AM UTC+2, Andreas Jung wrote:
>>>>>>>>>>>
>>>>>>>>>>> _key is a UUID4
>>>>>>>>>>> _path is standard filesystem path not longer than 100 chars each
>>>>>>>>>>>
>>>>>>>>>>> That can not be the problem.
>>>>>>>>>>>
>>>>>>>>>>> Am Freitag, 14. Juni 2019 09:36:17 UTC+2 schrieb James 
>>>>>>>>>>> Courtier-Dutton:
>>>>>>>>>>>>
>>>>>>>>>>>> Hi,
>>>>>>>>>>>>
>>>>>>>>>>>> What is the average size of the returned data? It could just be 
>>>>>>>>>>>> the time it takes to serialise the data being returned
>>>>>>>>>>>>
>>>>>>>>>>>> James
>>>>>>>>>>>>
>>>>>>>>>>>> On Fri, 14 Jun 2019, 05:45 'Andreas Jung' via ArangoDB, <
>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Hi there,
>>>>>>>>>>>>>
>>>>>>>>>>>>> this query 
>>>>>>>>>>>>>
>>>>>>>>>>>>>  for doc in import 
>>>>>>>>>>>>>    filter doc._type == 'Image'
>>>>>>>>>>>>>    return {path: doc._path, key: doc._key}
>>>>>>>>>>>>>
>>>>>>>>>>>>> takes about 45 seconds on decent hardware with an import 
>>>>>>>>>>>>> collection of about 100.000 items with about 21.000 of _type = 
>>>>>>>>>>>>> 'Image'.
>>>>>>>>>>>>> There is an index of _type. Using PyArango as client...I 
>>>>>>>>>>>>> really wander why this query is running so slow?!
>>>>>>>>>>>>>
>>>>>>>>>>>>> Running ArangoDB 3.4.3
>>>>>>>>>>>>>
>>>>>>>>>>>>> Profile
>>>>>>>>>>>>>
>>>>>>>>>>>>> Query String:
>>>>>>>>>>>>>  for doc in import 
>>>>>>>>>>>>>  filter doc._type == 'Image'
>>>>>>>>>>>>>  return {path: doc._path, key: doc._key}
>>>>>>>>>>>>>
>>>>>>>>>>>>> Execution plan:
>>>>>>>>>>>>>  Id   NodeType          Calls   Items   Runtime [s]   Comment
>>>>>>>>>>>>>   1   SingletonNode         1       1       0.00000   * ROOT
>>>>>>>>>>>>>   7   IndexNode            21   20617      32.73956     - FOR 
>>>>>>>>>>>>> doc IN import   /* hash index scan, projections: `_key`, `_path` 
>>>>>>>>>>>>> */
>>>>>>>>>>>>>   5   CalculationNode      21   20617       0.04354       - 
>>>>>>>>>>>>> LET #3 = { "path" : doc.`_path`, "key" : doc.`_key` }   /* simple 
>>>>>>>>>>>>> expression */   /* collections used: doc : import */
>>>>>>>>>>>>>   6   ReturnNode           21   20617       0.00016       - 
>>>>>>>>>>>>> RETURN #3
>>>>>>>>>>>>>
>>>>>>>>>>>>> Indexes used:
>>>>>>>>>>>>>  By   Type   Collection   Unique   Sparse   Selectivity  
>>>>>>>>>>>>>  Fields        Ranges
>>>>>>>>>>>>>   7   hash   import       false    false         0.05 %   [ 
>>>>>>>>>>>>> `_type` ]   (doc.`_type` == "Image")
>>>>>>>>>>>>>
>>>>>>>>>>>>> Optimization rules applied:
>>>>>>>>>>>>>  Id   RuleName
>>>>>>>>>>>>>   1   move-calculations-up
>>>>>>>>>>>>>   2   move-filters-up
>>>>>>>>>>>>>   3   move-calculations-up-2
>>>>>>>>>>>>>   4   move-filters-up-2
>>>>>>>>>>>>>   5   use-indexes
>>>>>>>>>>>>>   6   remove-filter-covered-by-index
>>>>>>>>>>>>>   7   remove-unnecessary-calculations-2
>>>>>>>>>>>>>   8   reduce-extraction-to-projection
>>>>>>>>>>>>>
>>>>>>>>>>>>> Query Statistics:
>>>>>>>>>>>>>  Writes Exec   Writes Ign   Scan Full   Scan Index   Filtered  
>>>>>>>>>>>>>  Exec Time [s]
>>>>>>>>>>>>>            0            0           0        20617          0  
>>>>>>>>>>>>>       32.78928
>>>>>>>>>>>>>
>>>>>>>>>>>>> Query Profile:
>>>>>>>>>>>>>  Query Stage           Duration [s]
>>>>>>>>>>>>>  initializing               0.00001
>>>>>>>>>>>>>  parsing                    0.00010
>>>>>>>>>>>>>  optimizing ast             0.00001
>>>>>>>>>>>>>  loading collections        0.00002
>>>>>>>>>>>>>  instantiating plan         0.00005
>>>>>>>>>>>>>  optimizing plan            0.00032
>>>>>>>>>>>>>  executing                 32.78841
>>>>>>>>>>>>>  finalizing                 0.00032
>>>>>>>>>>>>>
>>>>>>>>>>>>> -- 
>>>>>>>>>>>>> You received this message because you are subscribed to the 
>>>>>>>>>>>>> Google Groups "ArangoDB" group.
>>>>>>>>>>>>> To unsubscribe from this group and stop receiving emails from 
>>>>>>>>>>>>> it, send an email to [email protected].
>>>>>>>>>>>>> To view this discussion on the web visit 
>>>>>>>>>>>>> https://groups.google.com/d/msgid/arangodb/6c2de54c-3936-4aa5-8b6a-2dae3e5afcf7%40googlegroups.com
>>>>>>>>>>>>>  
>>>>>>>>>>>>> <https://groups.google.com/d/msgid/arangodb/6c2de54c-3936-4aa5-8b6a-2dae3e5afcf7%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>>>>>>>>>> .
>>>>>>>>>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>>>>>>>>>
>>>>>>>>>>>>

-- 
You received this message because you are subscribed to the Google Groups 
"ArangoDB" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/arangodb/9fbcde0d-5fdb-49b7-876b-736d8d070c1c%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to