min= 425
max= 103973102
avg= 604706
The avg size is that high because about 1/4 of the 100.000 documents
contains binary content.
Am Montag, 17. Juni 2019 18:58:36 UTC+2 schrieb Wilfried Gösgens:
>
> Hi,
>
> can you get some figures of Avg/Min/Max document sizes?
>
> On Monday, June 17, 2019 at 4:29:56 PM UTC+2, Andreas Jung wrote:
>>
>> Profile:
>>
>> Query String:
>> for doc in import
>> filter doc._type == 'Image'
>>
>> return {path: doc._path, key: doc._key}
>>
>> Execution plan:
>> Id NodeType Calls Items Runtime [s] Comment
>> 1 SingletonNode 1 1 0.00000 * ROOT
>> 7 IndexNode 21 20617 84.34365 - FOR doc IN
>> import /* hash index scan, projections: `_key`, `_path` */
>> 5 CalculationNode 21 20617 0.05436 - LET #3 = {
>> "path" : doc.`_path`, "key" : doc.`_key` } /* simple expression */ /*
>> collections used: doc : import */
>> 6 ReturnNode 21 20617 0.00017 - RETURN #3
>>
>> Indexes used:
>> By Type Collection Unique Sparse Selectivity Fields
>> Ranges
>> 7 hash import false false 0.05 % [ `_type` ]
>> (doc.`_type` == "Image")
>>
>> Optimization rules applied:
>> Id RuleName
>> 1 move-calculations-up
>> 2 move-filters-up
>> 3 move-calculations-up-2
>> 4 move-filters-up-2
>> 5 use-indexes
>> 6 remove-filter-covered-by-index
>> 7 remove-unnecessary-calculations-2
>> 8 reduce-extraction-to-projection
>>
>> Query Statistics:
>> Writes Exec Writes Ign Scan Full Scan Index Filtered Exec Time
>> [s]
>> 0 0 0 20617 0
>> 84.40501
>>
>> Query Profile:
>> Query Stage Duration [s]
>> initializing 0.00000
>> parsing 0.00020
>> optimizing ast 0.00001
>> loading collections 0.00001
>> instantiating plan 0.00005
>> optimizing plan 0.00021
>> executing 84.40415
>> finalizing 0.00033
>>
>>
>> On Monday, June 17, 2019 at 4:25:51 PM UTC+2, Andreas Jung wrote:
>>>
>>> A compound index on _type+_path or _type +_path + _key does not improve
>>> things.
>>> The query time is still in the range of 120 to 150 seconds.
>>>
>>> Andreas
>>>
>>> Am Freitag, 14. Juni 2019 15:01:30 UTC+2 schrieb Wilfried Gösgens:
>>>>
>>>> May I get back to my sugestion once more?
>>>>
>>>> Could you instead of the index on `_type` create a combined index ot
>>>> `_type`, `_path` and `_key` ?
>>>> This should copy these fields into the index, so ArangoDB doesn't have
>>>> to fetch the (big) documents.
>>>> I gues fetching and decompressing them is huge.
>>>>
>>>> Another suggestion would be to put the payload (You've got base64
>>>> encoded binary data, right?) into a separate collection, parted of the
>>>> structural information.
>>>>
>>>> Cheers,
>>>> Willi
>>>>
>>>> On Friday, June 14, 2019 at 1:10:44 PM UTC+2, Andreas Jung wrote:
>>>>>
>>>>> All _path value are unique, we have about 20 different values for
>>>>> _type.
>>>>> I am not sure if I can break down the dataset into something smaller.
>>>>> The data is in general sensitive and not easy to share or anonymize.
>>>>>
>>>>>
>>>>> Am Freitag, 14. Juni 2019 13:03:59 UTC+2 schrieb Wilfried Gösgens:
>>>>>>
>>>>>>
>>>>>> Hi,
>>>>>> Can you share a set of sample documents? How well is the distribution
>>>>>> on `_type` ? Which samples are there?
>>>>>> On Friday, June 14, 2019 at 11:22:51 AM UTC+2, Andreas Jung wrote:
>>>>>>>
>>>>>>> Recreating the indexes after import does not make a difference.
>>>>>>>
>>>>>>> Returning doc._path for 20.000 items takes 50 ms, returning
>>>>>>> doc._path takes minutes
>>>>>>>
>>>>>>> The _path index is deduplicated, the _type index is not
>>>>>>>
>>>>>>> The only difference in the execution plans is "index only" when
>>>>>>> "RETURN doc._type". Since both _type and _path
>>>>>>> are fully indexed I would assume that the query is executed in both
>>>>>>> times based on index data.
>>>>>>>
>>>>>>> So ArangoDB will load all 100.000 objects for picking up the value
>>>>>>> of _path? The overall data is meanwhile 55 GB
>>>>>>> (about one third of the data is binary data (files and images base64
>>>>>>> encoded).
>>>>>>>
>>>>>>> This is all no big problem for me since we perform such queries once
>>>>>>> before a migration run and it does matter taking
>>>>>>> a migration running for some hours a minutes more or less but I want
>>>>>>> to understand what is going on here (in particular
>>>>>>> this is unexpected behavior).
>>>>>>>
>>>>>>>
>>>>>>> Query String:
>>>>>>> for doc in import
>>>>>>> filter doc._type == 'Image'
>>>>>>> return doc._type
>>>>>>>
>>>>>>> Execution plan:
>>>>>>> Id NodeType Est. Comment
>>>>>>> 1 SingletonNode 1 * ROOT
>>>>>>> 7 IndexNode 2214 - FOR doc IN import /* hash index
>>>>>>> scan, index only, projections: `_type` */
>>>>>>> 5 CalculationNode 2214 - LET #3 = doc.`_type` /*
>>>>>>> attribute expression */ /* collections used: doc : import */
>>>>>>> 6 ReturnNode 2214 - RETURN #3
>>>>>>>
>>>>>>> Indexes used:
>>>>>>> By Type Collection Unique Sparse Selectivity Fields
>>>>>>> Ranges
>>>>>>> 7 hash import false false 0.05 % [ `_type`
>>>>>>> ] (doc.`_type` == "Image")
>>>>>>>
>>>>>>> Optimization rules applied:
>>>>>>> Id RuleName
>>>>>>> 1 move-calculations-up
>>>>>>> 2 move-filters-up
>>>>>>> 3 move-calculations-up-2
>>>>>>> 4 move-filters-up-2
>>>>>>> 5 use-indexes
>>>>>>> 6 remove-filter-covered-by-index
>>>>>>> 7 remove-unnecessary-calculations-2
>>>>>>> 8 reduce-extraction-to-projection
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Query String:
>>>>>>> for doc in import
>>>>>>> filter doc._type == 'Image'
>>>>>>> return doc._path
>>>>>>>
>>>>>>> Execution plan:
>>>>>>> Id NodeType Est. Comment
>>>>>>> 1 SingletonNode 1 * ROOT
>>>>>>> 7 IndexNode 2214 - FOR doc IN import /* hash index
>>>>>>> scan, projections: `_path` */
>>>>>>> 5 CalculationNode 2214 - LET #3 = doc.`_path` /*
>>>>>>> attribute expression */ /* collections used: doc : import */
>>>>>>> 6 ReturnNode 2214 - RETURN #3
>>>>>>>
>>>>>>> Indexes used:
>>>>>>> By Type Collection Unique Sparse Selectivity Fields
>>>>>>> Ranges
>>>>>>> 7 hash import false false 0.05 % [ `_type`
>>>>>>> ] (doc.`_type` == "Image")
>>>>>>>
>>>>>>> Optimization rules applied:
>>>>>>> Id RuleName
>>>>>>> 1 move-calculations-up
>>>>>>> 2 move-filters-up
>>>>>>> 3 move-calculations-up-2
>>>>>>> 4 move-filters-up-2
>>>>>>> 5 use-indexes
>>>>>>> 6 remove-filter-covered-by-index
>>>>>>> 7 remove-unnecessary-calculations-2
>>>>>>> 8 reduce-extraction-to-projection
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Friday, June 14, 2019 at 9:54:10 AM UTC+2, Andreas Jung wrote:
>>>>>>>>
>>>>>>>> Using RocksDB (default installation).
>>>>>>>>
>>>>>>>> I create a new collection for every import of the data including
>>>>>>>> the indexes.
>>>>>>>>
>>>>>>>> Unfortunately I don't have the key names in my hands. They are
>>>>>>>> coming
>>>>>>>> from a JSON dump of a CMS.
>>>>>>>>
>>>>>>>> Am Freitag, 14. Juni 2019 09:50:41 UTC+2 schrieb Wilfried Gösgens:
>>>>>>>>>
>>>>>>>>> Hi,
>>>>>>>>> afair you're using rocksdb?
>>>>>>>>>
>>>>>>>>> can you try to re-create that index to be on `_type`, `_path`,
>>>>>>>>> `_key` for better using of projections?
>>>>>>>>>
>>>>>>>>> Please note that you shouldn't use fieldnames starting with `_`
>>>>>>>>> since they're defined as system specific fields in arangodb.
>>>>>>>>>
>>>>>>>>> Cheers,
>>>>>>>>> Willi
>>>>>>>>>
>>>>>>>>> On Friday, June 14, 2019 at 9:41:24 AM UTC+2, Andreas Jung wrote:
>>>>>>>>>>
>>>>>>>>>> _key is a UUID4
>>>>>>>>>> _path is standard filesystem path not longer than 100 chars each
>>>>>>>>>>
>>>>>>>>>> That can not be the problem.
>>>>>>>>>>
>>>>>>>>>> Am Freitag, 14. Juni 2019 09:36:17 UTC+2 schrieb James
>>>>>>>>>> Courtier-Dutton:
>>>>>>>>>>>
>>>>>>>>>>> Hi,
>>>>>>>>>>>
>>>>>>>>>>> What is the average size of the returned data? It could just be
>>>>>>>>>>> the time it takes to serialise the data being returned
>>>>>>>>>>>
>>>>>>>>>>> James
>>>>>>>>>>>
>>>>>>>>>>> On Fri, 14 Jun 2019, 05:45 'Andreas Jung' via ArangoDB, <
>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hi there,
>>>>>>>>>>>>
>>>>>>>>>>>> this query
>>>>>>>>>>>>
>>>>>>>>>>>> for doc in import
>>>>>>>>>>>> filter doc._type == 'Image'
>>>>>>>>>>>> return {path: doc._path, key: doc._key}
>>>>>>>>>>>>
>>>>>>>>>>>> takes about 45 seconds on decent hardware with an import
>>>>>>>>>>>> collection of about 100.000 items with about 21.000 of _type =
>>>>>>>>>>>> 'Image'.
>>>>>>>>>>>> There is an index of _type. Using PyArango as client...I really
>>>>>>>>>>>> wander why this query is running so slow?!
>>>>>>>>>>>>
>>>>>>>>>>>> Running ArangoDB 3.4.3
>>>>>>>>>>>>
>>>>>>>>>>>> Profile
>>>>>>>>>>>>
>>>>>>>>>>>> Query String:
>>>>>>>>>>>> for doc in import
>>>>>>>>>>>> filter doc._type == 'Image'
>>>>>>>>>>>> return {path: doc._path, key: doc._key}
>>>>>>>>>>>>
>>>>>>>>>>>> Execution plan:
>>>>>>>>>>>> Id NodeType Calls Items Runtime [s] Comment
>>>>>>>>>>>> 1 SingletonNode 1 1 0.00000 * ROOT
>>>>>>>>>>>> 7 IndexNode 21 20617 32.73956 - FOR
>>>>>>>>>>>> doc IN import /* hash index scan, projections: `_key`, `_path` */
>>>>>>>>>>>> 5 CalculationNode 21 20617 0.04354 - LET
>>>>>>>>>>>> #3 = { "path" : doc.`_path`, "key" : doc.`_key` } /* simple
>>>>>>>>>>>> expression
>>>>>>>>>>>> */ /* collections used: doc : import */
>>>>>>>>>>>> 6 ReturnNode 21 20617 0.00016 -
>>>>>>>>>>>> RETURN #3
>>>>>>>>>>>>
>>>>>>>>>>>> Indexes used:
>>>>>>>>>>>> By Type Collection Unique Sparse Selectivity
>>>>>>>>>>>> Fields Ranges
>>>>>>>>>>>> 7 hash import false false 0.05 % [
>>>>>>>>>>>> `_type` ] (doc.`_type` == "Image")
>>>>>>>>>>>>
>>>>>>>>>>>> Optimization rules applied:
>>>>>>>>>>>> Id RuleName
>>>>>>>>>>>> 1 move-calculations-up
>>>>>>>>>>>> 2 move-filters-up
>>>>>>>>>>>> 3 move-calculations-up-2
>>>>>>>>>>>> 4 move-filters-up-2
>>>>>>>>>>>> 5 use-indexes
>>>>>>>>>>>> 6 remove-filter-covered-by-index
>>>>>>>>>>>> 7 remove-unnecessary-calculations-2
>>>>>>>>>>>> 8 reduce-extraction-to-projection
>>>>>>>>>>>>
>>>>>>>>>>>> Query Statistics:
>>>>>>>>>>>> Writes Exec Writes Ign Scan Full Scan Index Filtered
>>>>>>>>>>>> Exec Time [s]
>>>>>>>>>>>> 0 0 0 20617 0
>>>>>>>>>>>> 32.78928
>>>>>>>>>>>>
>>>>>>>>>>>> Query Profile:
>>>>>>>>>>>> Query Stage Duration [s]
>>>>>>>>>>>> initializing 0.00001
>>>>>>>>>>>> parsing 0.00010
>>>>>>>>>>>> optimizing ast 0.00001
>>>>>>>>>>>> loading collections 0.00002
>>>>>>>>>>>> instantiating plan 0.00005
>>>>>>>>>>>> optimizing plan 0.00032
>>>>>>>>>>>> executing 32.78841
>>>>>>>>>>>> finalizing 0.00032
>>>>>>>>>>>>
>>>>>>>>>>>> --
>>>>>>>>>>>> You received this message because you are subscribed to the
>>>>>>>>>>>> Google Groups "ArangoDB" group.
>>>>>>>>>>>> To unsubscribe from this group and stop receiving emails from
>>>>>>>>>>>> it, send an email to [email protected].
>>>>>>>>>>>> To view this discussion on the web visit
>>>>>>>>>>>> https://groups.google.com/d/msgid/arangodb/6c2de54c-3936-4aa5-8b6a-2dae3e5afcf7%40googlegroups.com
>>>>>>>>>>>>
>>>>>>>>>>>> <https://groups.google.com/d/msgid/arangodb/6c2de54c-3936-4aa5-8b6a-2dae3e5afcf7%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>>>>>>>>> .
>>>>>>>>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>>>>>>>>
>>>>>>>>>>>
--
You received this message because you are subscribed to the Google Groups
"ArangoDB" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To view this discussion on the web visit
https://groups.google.com/d/msgid/arangodb/adbfc0fe-807a-492e-88a5-b557462d06d6%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.