All _path value are unique, we have about 20 different values for _type. I am not sure if I can break down the dataset into something smaller. The data is in general sensitive and not easy to share or anonymize.
Am Freitag, 14. Juni 2019 13:03:59 UTC+2 schrieb Wilfried Gösgens: > > > Hi, > Can you share a set of sample documents? How well is the distribution on > `_type` ? Which samples are there? > On Friday, June 14, 2019 at 11:22:51 AM UTC+2, Andreas Jung wrote: >> >> Recreating the indexes after import does not make a difference. >> >> Returning doc._path for 20.000 items takes 50 ms, returning doc._path >> takes minutes >> >> The _path index is deduplicated, the _type index is not >> >> The only difference in the execution plans is "index only" when "RETURN >> doc._type". Since both _type and _path >> are fully indexed I would assume that the query is executed in both times >> based on index data. >> >> So ArangoDB will load all 100.000 objects for picking up the value of >> _path? The overall data is meanwhile 55 GB >> (about one third of the data is binary data (files and images base64 >> encoded). >> >> This is all no big problem for me since we perform such queries once >> before a migration run and it does matter taking >> a migration running for some hours a minutes more or less but I want to >> understand what is going on here (in particular >> this is unexpected behavior). >> >> >> Query String: >> for doc in import >> filter doc._type == 'Image' >> return doc._type >> >> Execution plan: >> Id NodeType Est. Comment >> 1 SingletonNode 1 * ROOT >> 7 IndexNode 2214 - FOR doc IN import /* hash index >> scan, index only, projections: `_type` */ >> 5 CalculationNode 2214 - LET #3 = doc.`_type` /* attribute >> expression */ /* collections used: doc : import */ >> 6 ReturnNode 2214 - RETURN #3 >> >> Indexes used: >> By Type Collection Unique Sparse Selectivity Fields >> Ranges >> 7 hash import false false 0.05 % [ `_type` ] >> (doc.`_type` == "Image") >> >> Optimization rules applied: >> Id RuleName >> 1 move-calculations-up >> 2 move-filters-up >> 3 move-calculations-up-2 >> 4 move-filters-up-2 >> 5 use-indexes >> 6 remove-filter-covered-by-index >> 7 remove-unnecessary-calculations-2 >> 8 reduce-extraction-to-projection >> >> >> >> Query String: >> for doc in import >> filter doc._type == 'Image' >> return doc._path >> >> Execution plan: >> Id NodeType Est. Comment >> 1 SingletonNode 1 * ROOT >> 7 IndexNode 2214 - FOR doc IN import /* hash index >> scan, projections: `_path` */ >> 5 CalculationNode 2214 - LET #3 = doc.`_path` /* attribute >> expression */ /* collections used: doc : import */ >> 6 ReturnNode 2214 - RETURN #3 >> >> Indexes used: >> By Type Collection Unique Sparse Selectivity Fields >> Ranges >> 7 hash import false false 0.05 % [ `_type` ] >> (doc.`_type` == "Image") >> >> Optimization rules applied: >> Id RuleName >> 1 move-calculations-up >> 2 move-filters-up >> 3 move-calculations-up-2 >> 4 move-filters-up-2 >> 5 use-indexes >> 6 remove-filter-covered-by-index >> 7 remove-unnecessary-calculations-2 >> 8 reduce-extraction-to-projection >> >> >> >> >> On Friday, June 14, 2019 at 9:54:10 AM UTC+2, Andreas Jung wrote: >>> >>> Using RocksDB (default installation). >>> >>> I create a new collection for every import of the data including the >>> indexes. >>> >>> Unfortunately I don't have the key names in my hands. They are coming >>> from a JSON dump of a CMS. >>> >>> Am Freitag, 14. Juni 2019 09:50:41 UTC+2 schrieb Wilfried Gösgens: >>>> >>>> Hi, >>>> afair you're using rocksdb? >>>> >>>> can you try to re-create that index to be on `_type`, `_path`, `_key` >>>> for better using of projections? >>>> >>>> Please note that you shouldn't use fieldnames starting with `_` since >>>> they're defined as system specific fields in arangodb. >>>> >>>> Cheers, >>>> Willi >>>> >>>> On Friday, June 14, 2019 at 9:41:24 AM UTC+2, Andreas Jung wrote: >>>>> >>>>> _key is a UUID4 >>>>> _path is standard filesystem path not longer than 100 chars each >>>>> >>>>> That can not be the problem. >>>>> >>>>> Am Freitag, 14. Juni 2019 09:36:17 UTC+2 schrieb James Courtier-Dutton: >>>>>> >>>>>> Hi, >>>>>> >>>>>> What is the average size of the returned data? It could just be the >>>>>> time it takes to serialise the data being returned >>>>>> >>>>>> James >>>>>> >>>>>> On Fri, 14 Jun 2019, 05:45 'Andreas Jung' via ArangoDB, < >>>>>> [email protected]> wrote: >>>>>> >>>>>>> Hi there, >>>>>>> >>>>>>> this query >>>>>>> >>>>>>> for doc in import >>>>>>> filter doc._type == 'Image' >>>>>>> return {path: doc._path, key: doc._key} >>>>>>> >>>>>>> takes about 45 seconds on decent hardware with an import collection >>>>>>> of about 100.000 items with about 21.000 of _type = 'Image'. >>>>>>> There is an index of _type. Using PyArango as client...I really >>>>>>> wander why this query is running so slow?! >>>>>>> >>>>>>> Running ArangoDB 3.4.3 >>>>>>> >>>>>>> Profile >>>>>>> >>>>>>> Query String: >>>>>>> for doc in import >>>>>>> filter doc._type == 'Image' >>>>>>> return {path: doc._path, key: doc._key} >>>>>>> >>>>>>> Execution plan: >>>>>>> Id NodeType Calls Items Runtime [s] Comment >>>>>>> 1 SingletonNode 1 1 0.00000 * ROOT >>>>>>> 7 IndexNode 21 20617 32.73956 - FOR doc IN >>>>>>> import /* hash index scan, projections: `_key`, `_path` */ >>>>>>> 5 CalculationNode 21 20617 0.04354 - LET #3 = >>>>>>> { "path" : doc.`_path`, "key" : doc.`_key` } /* simple expression */ >>>>>>> /* >>>>>>> collections used: doc : import */ >>>>>>> 6 ReturnNode 21 20617 0.00016 - RETURN #3 >>>>>>> >>>>>>> Indexes used: >>>>>>> By Type Collection Unique Sparse Selectivity Fields >>>>>>> Ranges >>>>>>> 7 hash import false false 0.05 % [ `_type` >>>>>>> ] (doc.`_type` == "Image") >>>>>>> >>>>>>> Optimization rules applied: >>>>>>> Id RuleName >>>>>>> 1 move-calculations-up >>>>>>> 2 move-filters-up >>>>>>> 3 move-calculations-up-2 >>>>>>> 4 move-filters-up-2 >>>>>>> 5 use-indexes >>>>>>> 6 remove-filter-covered-by-index >>>>>>> 7 remove-unnecessary-calculations-2 >>>>>>> 8 reduce-extraction-to-projection >>>>>>> >>>>>>> Query Statistics: >>>>>>> Writes Exec Writes Ign Scan Full Scan Index Filtered Exec >>>>>>> Time [s] >>>>>>> 0 0 0 20617 0 >>>>>>> 32.78928 >>>>>>> >>>>>>> Query Profile: >>>>>>> Query Stage Duration [s] >>>>>>> initializing 0.00001 >>>>>>> parsing 0.00010 >>>>>>> optimizing ast 0.00001 >>>>>>> loading collections 0.00002 >>>>>>> instantiating plan 0.00005 >>>>>>> optimizing plan 0.00032 >>>>>>> executing 32.78841 >>>>>>> finalizing 0.00032 >>>>>>> >>>>>>> -- >>>>>>> You received this message because you are subscribed to the Google >>>>>>> Groups "ArangoDB" group. >>>>>>> To unsubscribe from this group and stop receiving emails from it, >>>>>>> send an email to [email protected]. >>>>>>> To view this discussion on the web visit >>>>>>> https://groups.google.com/d/msgid/arangodb/6c2de54c-3936-4aa5-8b6a-2dae3e5afcf7%40googlegroups.com >>>>>>> >>>>>>> <https://groups.google.com/d/msgid/arangodb/6c2de54c-3936-4aa5-8b6a-2dae3e5afcf7%40googlegroups.com?utm_medium=email&utm_source=footer> >>>>>>> . >>>>>>> For more options, visit https://groups.google.com/d/optout. >>>>>>> >>>>>> -- You received this message because you are subscribed to the Google Groups "ArangoDB" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/arangodb/de155599-d9d8-4b9a-b436-6c1e25a435f9%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.
