May I get back to my sugestion once more? Could you instead of the index on `_type` create a combined index ot `_type`, `_path` and `_key` ? This should copy these fields into the index, so ArangoDB doesn't have to fetch the (big) documents. I gues fetching and decompressing them is huge.
Another suggestion would be to put the payload (You've got base64 encoded binary data, right?) into a separate collection, parted of the structural information. Cheers, Willi On Friday, June 14, 2019 at 1:10:44 PM UTC+2, Andreas Jung wrote: > > All _path value are unique, we have about 20 different values for _type. > I am not sure if I can break down the dataset into something smaller. > The data is in general sensitive and not easy to share or anonymize. > > > Am Freitag, 14. Juni 2019 13:03:59 UTC+2 schrieb Wilfried Gösgens: >> >> >> Hi, >> Can you share a set of sample documents? How well is the distribution on >> `_type` ? Which samples are there? >> On Friday, June 14, 2019 at 11:22:51 AM UTC+2, Andreas Jung wrote: >>> >>> Recreating the indexes after import does not make a difference. >>> >>> Returning doc._path for 20.000 items takes 50 ms, returning doc._path >>> takes minutes >>> >>> The _path index is deduplicated, the _type index is not >>> >>> The only difference in the execution plans is "index only" when "RETURN >>> doc._type". Since both _type and _path >>> are fully indexed I would assume that the query is executed in both >>> times based on index data. >>> >>> So ArangoDB will load all 100.000 objects for picking up the value of >>> _path? The overall data is meanwhile 55 GB >>> (about one third of the data is binary data (files and images base64 >>> encoded). >>> >>> This is all no big problem for me since we perform such queries once >>> before a migration run and it does matter taking >>> a migration running for some hours a minutes more or less but I want to >>> understand what is going on here (in particular >>> this is unexpected behavior). >>> >>> >>> Query String: >>> for doc in import >>> filter doc._type == 'Image' >>> return doc._type >>> >>> Execution plan: >>> Id NodeType Est. Comment >>> 1 SingletonNode 1 * ROOT >>> 7 IndexNode 2214 - FOR doc IN import /* hash index >>> scan, index only, projections: `_type` */ >>> 5 CalculationNode 2214 - LET #3 = doc.`_type` /* attribute >>> expression */ /* collections used: doc : import */ >>> 6 ReturnNode 2214 - RETURN #3 >>> >>> Indexes used: >>> By Type Collection Unique Sparse Selectivity Fields >>> Ranges >>> 7 hash import false false 0.05 % [ `_type` ] >>> (doc.`_type` == "Image") >>> >>> Optimization rules applied: >>> Id RuleName >>> 1 move-calculations-up >>> 2 move-filters-up >>> 3 move-calculations-up-2 >>> 4 move-filters-up-2 >>> 5 use-indexes >>> 6 remove-filter-covered-by-index >>> 7 remove-unnecessary-calculations-2 >>> 8 reduce-extraction-to-projection >>> >>> >>> >>> Query String: >>> for doc in import >>> filter doc._type == 'Image' >>> return doc._path >>> >>> Execution plan: >>> Id NodeType Est. Comment >>> 1 SingletonNode 1 * ROOT >>> 7 IndexNode 2214 - FOR doc IN import /* hash index >>> scan, projections: `_path` */ >>> 5 CalculationNode 2214 - LET #3 = doc.`_path` /* attribute >>> expression */ /* collections used: doc : import */ >>> 6 ReturnNode 2214 - RETURN #3 >>> >>> Indexes used: >>> By Type Collection Unique Sparse Selectivity Fields >>> Ranges >>> 7 hash import false false 0.05 % [ `_type` ] >>> (doc.`_type` == "Image") >>> >>> Optimization rules applied: >>> Id RuleName >>> 1 move-calculations-up >>> 2 move-filters-up >>> 3 move-calculations-up-2 >>> 4 move-filters-up-2 >>> 5 use-indexes >>> 6 remove-filter-covered-by-index >>> 7 remove-unnecessary-calculations-2 >>> 8 reduce-extraction-to-projection >>> >>> >>> >>> >>> On Friday, June 14, 2019 at 9:54:10 AM UTC+2, Andreas Jung wrote: >>>> >>>> Using RocksDB (default installation). >>>> >>>> I create a new collection for every import of the data including the >>>> indexes. >>>> >>>> Unfortunately I don't have the key names in my hands. They are coming >>>> from a JSON dump of a CMS. >>>> >>>> Am Freitag, 14. Juni 2019 09:50:41 UTC+2 schrieb Wilfried Gösgens: >>>>> >>>>> Hi, >>>>> afair you're using rocksdb? >>>>> >>>>> can you try to re-create that index to be on `_type`, `_path`, `_key` >>>>> for better using of projections? >>>>> >>>>> Please note that you shouldn't use fieldnames starting with `_` since >>>>> they're defined as system specific fields in arangodb. >>>>> >>>>> Cheers, >>>>> Willi >>>>> >>>>> On Friday, June 14, 2019 at 9:41:24 AM UTC+2, Andreas Jung wrote: >>>>>> >>>>>> _key is a UUID4 >>>>>> _path is standard filesystem path not longer than 100 chars each >>>>>> >>>>>> That can not be the problem. >>>>>> >>>>>> Am Freitag, 14. Juni 2019 09:36:17 UTC+2 schrieb James >>>>>> Courtier-Dutton: >>>>>>> >>>>>>> Hi, >>>>>>> >>>>>>> What is the average size of the returned data? It could just be the >>>>>>> time it takes to serialise the data being returned >>>>>>> >>>>>>> James >>>>>>> >>>>>>> On Fri, 14 Jun 2019, 05:45 'Andreas Jung' via ArangoDB, < >>>>>>> [email protected]> wrote: >>>>>>> >>>>>>>> Hi there, >>>>>>>> >>>>>>>> this query >>>>>>>> >>>>>>>> for doc in import >>>>>>>> filter doc._type == 'Image' >>>>>>>> return {path: doc._path, key: doc._key} >>>>>>>> >>>>>>>> takes about 45 seconds on decent hardware with an import collection >>>>>>>> of about 100.000 items with about 21.000 of _type = 'Image'. >>>>>>>> There is an index of _type. Using PyArango as client...I really >>>>>>>> wander why this query is running so slow?! >>>>>>>> >>>>>>>> Running ArangoDB 3.4.3 >>>>>>>> >>>>>>>> Profile >>>>>>>> >>>>>>>> Query String: >>>>>>>> for doc in import >>>>>>>> filter doc._type == 'Image' >>>>>>>> return {path: doc._path, key: doc._key} >>>>>>>> >>>>>>>> Execution plan: >>>>>>>> Id NodeType Calls Items Runtime [s] Comment >>>>>>>> 1 SingletonNode 1 1 0.00000 * ROOT >>>>>>>> 7 IndexNode 21 20617 32.73956 - FOR doc >>>>>>>> IN import /* hash index scan, projections: `_key`, `_path` */ >>>>>>>> 5 CalculationNode 21 20617 0.04354 - LET #3 >>>>>>>> = { "path" : doc.`_path`, "key" : doc.`_key` } /* simple expression >>>>>>>> */ >>>>>>>> /* collections used: doc : import */ >>>>>>>> 6 ReturnNode 21 20617 0.00016 - RETURN >>>>>>>> #3 >>>>>>>> >>>>>>>> Indexes used: >>>>>>>> By Type Collection Unique Sparse Selectivity Fields >>>>>>>> Ranges >>>>>>>> 7 hash import false false 0.05 % [ `_type` >>>>>>>> ] (doc.`_type` == "Image") >>>>>>>> >>>>>>>> Optimization rules applied: >>>>>>>> Id RuleName >>>>>>>> 1 move-calculations-up >>>>>>>> 2 move-filters-up >>>>>>>> 3 move-calculations-up-2 >>>>>>>> 4 move-filters-up-2 >>>>>>>> 5 use-indexes >>>>>>>> 6 remove-filter-covered-by-index >>>>>>>> 7 remove-unnecessary-calculations-2 >>>>>>>> 8 reduce-extraction-to-projection >>>>>>>> >>>>>>>> Query Statistics: >>>>>>>> Writes Exec Writes Ign Scan Full Scan Index Filtered >>>>>>>> Exec Time [s] >>>>>>>> 0 0 0 20617 0 >>>>>>>> 32.78928 >>>>>>>> >>>>>>>> Query Profile: >>>>>>>> Query Stage Duration [s] >>>>>>>> initializing 0.00001 >>>>>>>> parsing 0.00010 >>>>>>>> optimizing ast 0.00001 >>>>>>>> loading collections 0.00002 >>>>>>>> instantiating plan 0.00005 >>>>>>>> optimizing plan 0.00032 >>>>>>>> executing 32.78841 >>>>>>>> finalizing 0.00032 >>>>>>>> >>>>>>>> -- >>>>>>>> You received this message because you are subscribed to the Google >>>>>>>> Groups "ArangoDB" group. >>>>>>>> To unsubscribe from this group and stop receiving emails from it, >>>>>>>> send an email to [email protected]. >>>>>>>> To view this discussion on the web visit >>>>>>>> https://groups.google.com/d/msgid/arangodb/6c2de54c-3936-4aa5-8b6a-2dae3e5afcf7%40googlegroups.com >>>>>>>> >>>>>>>> <https://groups.google.com/d/msgid/arangodb/6c2de54c-3936-4aa5-8b6a-2dae3e5afcf7%40googlegroups.com?utm_medium=email&utm_source=footer> >>>>>>>> . >>>>>>>> For more options, visit https://groups.google.com/d/optout. >>>>>>>> >>>>>>> -- You received this message because you are subscribed to the Google Groups "ArangoDB" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/arangodb/65ebdfca-dfb1-4803-9ef8-c646072ad0d5%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.
