<1000 bytes: 41.000 items
<10.000 bytes: 21.000 items
< 100.000 bytes: 22.00 items
< 1 MB: 9000 items
< 10 MB: 1000 items
> 10 MB: 20 items
Am Montag, 17. Juni 2019 19:14:49 UTC+2 schrieb Andreas Jung:
>
> min= 425
> max= 103973102
> avg= 604706
>
> The avg size is that high because about 1/4 of the 100.000 documents
> contains binary content.
>
>
> Am Montag, 17. Juni 2019 18:58:36 UTC+2 schrieb Wilfried Gösgens:
>>
>> Hi,
>>
>> can you get some figures of Avg/Min/Max document sizes?
>>
>> On Monday, June 17, 2019 at 4:29:56 PM UTC+2, Andreas Jung wrote:
>>>
>>> Profile:
>>>
>>> Query String:
>>> for doc in import
>>> filter doc._type == 'Image'
>>>
>>> return {path: doc._path, key: doc._key}
>>>
>>> Execution plan:
>>> Id NodeType Calls Items Runtime [s] Comment
>>> 1 SingletonNode 1 1 0.00000 * ROOT
>>> 7 IndexNode 21 20617 84.34365 - FOR doc IN
>>> import /* hash index scan, projections: `_key`, `_path` */
>>> 5 CalculationNode 21 20617 0.05436 - LET #3 = {
>>> "path" : doc.`_path`, "key" : doc.`_key` } /* simple expression */ /*
>>> collections used: doc : import */
>>> 6 ReturnNode 21 20617 0.00017 - RETURN #3
>>>
>>> Indexes used:
>>> By Type Collection Unique Sparse Selectivity Fields
>>> Ranges
>>> 7 hash import false false 0.05 % [ `_type` ]
>>> (doc.`_type` == "Image")
>>>
>>> Optimization rules applied:
>>> Id RuleName
>>> 1 move-calculations-up
>>> 2 move-filters-up
>>> 3 move-calculations-up-2
>>> 4 move-filters-up-2
>>> 5 use-indexes
>>> 6 remove-filter-covered-by-index
>>> 7 remove-unnecessary-calculations-2
>>> 8 reduce-extraction-to-projection
>>>
>>> Query Statistics:
>>> Writes Exec Writes Ign Scan Full Scan Index Filtered Exec
>>> Time [s]
>>> 0 0 0 20617 0
>>> 84.40501
>>>
>>> Query Profile:
>>> Query Stage Duration [s]
>>> initializing 0.00000
>>> parsing 0.00020
>>> optimizing ast 0.00001
>>> loading collections 0.00001
>>> instantiating plan 0.00005
>>> optimizing plan 0.00021
>>> executing 84.40415
>>> finalizing 0.00033
>>>
>>>
>>> On Monday, June 17, 2019 at 4:25:51 PM UTC+2, Andreas Jung wrote:
>>>>
>>>> A compound index on _type+_path or _type +_path + _key does not improve
>>>> things.
>>>> The query time is still in the range of 120 to 150 seconds.
>>>>
>>>> Andreas
>>>>
>>>> Am Freitag, 14. Juni 2019 15:01:30 UTC+2 schrieb Wilfried Gösgens:
>>>>>
>>>>> May I get back to my sugestion once more?
>>>>>
>>>>> Could you instead of the index on `_type` create a combined index ot
>>>>> `_type`, `_path` and `_key` ?
>>>>> This should copy these fields into the index, so ArangoDB doesn't have
>>>>> to fetch the (big) documents.
>>>>> I gues fetching and decompressing them is huge.
>>>>>
>>>>> Another suggestion would be to put the payload (You've got base64
>>>>> encoded binary data, right?) into a separate collection, parted of the
>>>>> structural information.
>>>>>
>>>>> Cheers,
>>>>> Willi
>>>>>
>>>>> On Friday, June 14, 2019 at 1:10:44 PM UTC+2, Andreas Jung wrote:
>>>>>>
>>>>>> All _path value are unique, we have about 20 different values for
>>>>>> _type.
>>>>>> I am not sure if I can break down the dataset into something smaller.
>>>>>> The data is in general sensitive and not easy to share or anonymize.
>>>>>>
>>>>>>
>>>>>> Am Freitag, 14. Juni 2019 13:03:59 UTC+2 schrieb Wilfried Gösgens:
>>>>>>>
>>>>>>>
>>>>>>> Hi,
>>>>>>> Can you share a set of sample documents? How well is the
>>>>>>> distribution on `_type` ? Which samples are there?
>>>>>>> On Friday, June 14, 2019 at 11:22:51 AM UTC+2, Andreas Jung wrote:
>>>>>>>>
>>>>>>>> Recreating the indexes after import does not make a difference.
>>>>>>>>
>>>>>>>> Returning doc._path for 20.000 items takes 50 ms, returning
>>>>>>>> doc._path takes minutes
>>>>>>>>
>>>>>>>> The _path index is deduplicated, the _type index is not
>>>>>>>>
>>>>>>>> The only difference in the execution plans is "index only" when
>>>>>>>> "RETURN doc._type". Since both _type and _path
>>>>>>>> are fully indexed I would assume that the query is executed in both
>>>>>>>> times based on index data.
>>>>>>>>
>>>>>>>> So ArangoDB will load all 100.000 objects for picking up the value
>>>>>>>> of _path? The overall data is meanwhile 55 GB
>>>>>>>> (about one third of the data is binary data (files and images
>>>>>>>> base64 encoded).
>>>>>>>>
>>>>>>>> This is all no big problem for me since we perform such queries
>>>>>>>> once before a migration run and it does matter taking
>>>>>>>> a migration running for some hours a minutes more or less but I
>>>>>>>> want to understand what is going on here (in particular
>>>>>>>> this is unexpected behavior).
>>>>>>>>
>>>>>>>>
>>>>>>>> Query String:
>>>>>>>> for doc in import
>>>>>>>> filter doc._type == 'Image'
>>>>>>>> return doc._type
>>>>>>>>
>>>>>>>> Execution plan:
>>>>>>>> Id NodeType Est. Comment
>>>>>>>> 1 SingletonNode 1 * ROOT
>>>>>>>> 7 IndexNode 2214 - FOR doc IN import /* hash
>>>>>>>> index scan, index only, projections: `_type` */
>>>>>>>> 5 CalculationNode 2214 - LET #3 = doc.`_type` /*
>>>>>>>> attribute expression */ /* collections used: doc : import */
>>>>>>>> 6 ReturnNode 2214 - RETURN #3
>>>>>>>>
>>>>>>>> Indexes used:
>>>>>>>> By Type Collection Unique Sparse Selectivity Fields
>>>>>>>> Ranges
>>>>>>>> 7 hash import false false 0.05 % [ `_type`
>>>>>>>> ] (doc.`_type` == "Image")
>>>>>>>>
>>>>>>>> Optimization rules applied:
>>>>>>>> Id RuleName
>>>>>>>> 1 move-calculations-up
>>>>>>>> 2 move-filters-up
>>>>>>>> 3 move-calculations-up-2
>>>>>>>> 4 move-filters-up-2
>>>>>>>> 5 use-indexes
>>>>>>>> 6 remove-filter-covered-by-index
>>>>>>>> 7 remove-unnecessary-calculations-2
>>>>>>>> 8 reduce-extraction-to-projection
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Query String:
>>>>>>>> for doc in import
>>>>>>>> filter doc._type == 'Image'
>>>>>>>> return doc._path
>>>>>>>>
>>>>>>>> Execution plan:
>>>>>>>> Id NodeType Est. Comment
>>>>>>>> 1 SingletonNode 1 * ROOT
>>>>>>>> 7 IndexNode 2214 - FOR doc IN import /* hash
>>>>>>>> index scan, projections: `_path` */
>>>>>>>> 5 CalculationNode 2214 - LET #3 = doc.`_path` /*
>>>>>>>> attribute expression */ /* collections used: doc : import */
>>>>>>>> 6 ReturnNode 2214 - RETURN #3
>>>>>>>>
>>>>>>>> Indexes used:
>>>>>>>> By Type Collection Unique Sparse Selectivity Fields
>>>>>>>> Ranges
>>>>>>>> 7 hash import false false 0.05 % [ `_type`
>>>>>>>> ] (doc.`_type` == "Image")
>>>>>>>>
>>>>>>>> Optimization rules applied:
>>>>>>>> Id RuleName
>>>>>>>> 1 move-calculations-up
>>>>>>>> 2 move-filters-up
>>>>>>>> 3 move-calculations-up-2
>>>>>>>> 4 move-filters-up-2
>>>>>>>> 5 use-indexes
>>>>>>>> 6 remove-filter-covered-by-index
>>>>>>>> 7 remove-unnecessary-calculations-2
>>>>>>>> 8 reduce-extraction-to-projection
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Friday, June 14, 2019 at 9:54:10 AM UTC+2, Andreas Jung wrote:
>>>>>>>>>
>>>>>>>>> Using RocksDB (default installation).
>>>>>>>>>
>>>>>>>>> I create a new collection for every import of the data including
>>>>>>>>> the indexes.
>>>>>>>>>
>>>>>>>>> Unfortunately I don't have the key names in my hands. They are
>>>>>>>>> coming
>>>>>>>>> from a JSON dump of a CMS.
>>>>>>>>>
>>>>>>>>> Am Freitag, 14. Juni 2019 09:50:41 UTC+2 schrieb Wilfried Gösgens:
>>>>>>>>>>
>>>>>>>>>> Hi,
>>>>>>>>>> afair you're using rocksdb?
>>>>>>>>>>
>>>>>>>>>> can you try to re-create that index to be on `_type`, `_path`,
>>>>>>>>>> `_key` for better using of projections?
>>>>>>>>>>
>>>>>>>>>> Please note that you shouldn't use fieldnames starting with `_`
>>>>>>>>>> since they're defined as system specific fields in arangodb.
>>>>>>>>>>
>>>>>>>>>> Cheers,
>>>>>>>>>> Willi
>>>>>>>>>>
>>>>>>>>>> On Friday, June 14, 2019 at 9:41:24 AM UTC+2, Andreas Jung wrote:
>>>>>>>>>>>
>>>>>>>>>>> _key is a UUID4
>>>>>>>>>>> _path is standard filesystem path not longer than 100 chars each
>>>>>>>>>>>
>>>>>>>>>>> That can not be the problem.
>>>>>>>>>>>
>>>>>>>>>>> Am Freitag, 14. Juni 2019 09:36:17 UTC+2 schrieb James
>>>>>>>>>>> Courtier-Dutton:
>>>>>>>>>>>>
>>>>>>>>>>>> Hi,
>>>>>>>>>>>>
>>>>>>>>>>>> What is the average size of the returned data? It could just be
>>>>>>>>>>>> the time it takes to serialise the data being returned
>>>>>>>>>>>>
>>>>>>>>>>>> James
>>>>>>>>>>>>
>>>>>>>>>>>> On Fri, 14 Jun 2019, 05:45 'Andreas Jung' via ArangoDB, <
>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Hi there,
>>>>>>>>>>>>>
>>>>>>>>>>>>> this query
>>>>>>>>>>>>>
>>>>>>>>>>>>> for doc in import
>>>>>>>>>>>>> filter doc._type == 'Image'
>>>>>>>>>>>>> return {path: doc._path, key: doc._key}
>>>>>>>>>>>>>
>>>>>>>>>>>>> takes about 45 seconds on decent hardware with an import
>>>>>>>>>>>>> collection of about 100.000 items with about 21.000 of _type =
>>>>>>>>>>>>> 'Image'.
>>>>>>>>>>>>> There is an index of _type. Using PyArango as client...I
>>>>>>>>>>>>> really wander why this query is running so slow?!
>>>>>>>>>>>>>
>>>>>>>>>>>>> Running ArangoDB 3.4.3
>>>>>>>>>>>>>
>>>>>>>>>>>>> Profile
>>>>>>>>>>>>>
>>>>>>>>>>>>> Query String:
>>>>>>>>>>>>> for doc in import
>>>>>>>>>>>>> filter doc._type == 'Image'
>>>>>>>>>>>>> return {path: doc._path, key: doc._key}
>>>>>>>>>>>>>
>>>>>>>>>>>>> Execution plan:
>>>>>>>>>>>>> Id NodeType Calls Items Runtime [s] Comment
>>>>>>>>>>>>> 1 SingletonNode 1 1 0.00000 * ROOT
>>>>>>>>>>>>> 7 IndexNode 21 20617 32.73956 - FOR
>>>>>>>>>>>>> doc IN import /* hash index scan, projections: `_key`, `_path`
>>>>>>>>>>>>> */
>>>>>>>>>>>>> 5 CalculationNode 21 20617 0.04354 -
>>>>>>>>>>>>> LET #3 = { "path" : doc.`_path`, "key" : doc.`_key` } /* simple
>>>>>>>>>>>>> expression */ /* collections used: doc : import */
>>>>>>>>>>>>> 6 ReturnNode 21 20617 0.00016 -
>>>>>>>>>>>>> RETURN #3
>>>>>>>>>>>>>
>>>>>>>>>>>>> Indexes used:
>>>>>>>>>>>>> By Type Collection Unique Sparse Selectivity
>>>>>>>>>>>>> Fields Ranges
>>>>>>>>>>>>> 7 hash import false false 0.05 % [
>>>>>>>>>>>>> `_type` ] (doc.`_type` == "Image")
>>>>>>>>>>>>>
>>>>>>>>>>>>> Optimization rules applied:
>>>>>>>>>>>>> Id RuleName
>>>>>>>>>>>>> 1 move-calculations-up
>>>>>>>>>>>>> 2 move-filters-up
>>>>>>>>>>>>> 3 move-calculations-up-2
>>>>>>>>>>>>> 4 move-filters-up-2
>>>>>>>>>>>>> 5 use-indexes
>>>>>>>>>>>>> 6 remove-filter-covered-by-index
>>>>>>>>>>>>> 7 remove-unnecessary-calculations-2
>>>>>>>>>>>>> 8 reduce-extraction-to-projection
>>>>>>>>>>>>>
>>>>>>>>>>>>> Query Statistics:
>>>>>>>>>>>>> Writes Exec Writes Ign Scan Full Scan Index Filtered
>>>>>>>>>>>>> Exec Time [s]
>>>>>>>>>>>>> 0 0 0 20617 0
>>>>>>>>>>>>> 32.78928
>>>>>>>>>>>>>
>>>>>>>>>>>>> Query Profile:
>>>>>>>>>>>>> Query Stage Duration [s]
>>>>>>>>>>>>> initializing 0.00001
>>>>>>>>>>>>> parsing 0.00010
>>>>>>>>>>>>> optimizing ast 0.00001
>>>>>>>>>>>>> loading collections 0.00002
>>>>>>>>>>>>> instantiating plan 0.00005
>>>>>>>>>>>>> optimizing plan 0.00032
>>>>>>>>>>>>> executing 32.78841
>>>>>>>>>>>>> finalizing 0.00032
>>>>>>>>>>>>>
>>>>>>>>>>>>> --
>>>>>>>>>>>>> You received this message because you are subscribed to the
>>>>>>>>>>>>> Google Groups "ArangoDB" group.
>>>>>>>>>>>>> To unsubscribe from this group and stop receiving emails from
>>>>>>>>>>>>> it, send an email to [email protected].
>>>>>>>>>>>>> To view this discussion on the web visit
>>>>>>>>>>>>> https://groups.google.com/d/msgid/arangodb/6c2de54c-3936-4aa5-8b6a-2dae3e5afcf7%40googlegroups.com
>>>>>>>>>>>>>
>>>>>>>>>>>>> <https://groups.google.com/d/msgid/arangodb/6c2de54c-3936-4aa5-8b6a-2dae3e5afcf7%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>>>>>>>>>> .
>>>>>>>>>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>>>>>>>>>
>>>>>>>>>>>>
--
You received this message because you are subscribed to the Google Groups
"ArangoDB" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To view this discussion on the web visit
https://groups.google.com/d/msgid/arangodb/9fbcde0d-5fdb-49b7-876b-736d8d070c1c%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.