On Mon, Oct 31, 2016 at 1:36 PM, <pieter...@gmail.com> wrote: > On Monday, October 31, 2016 at 4:43:40 PM UTC+1, Sean Beckett wrote: > > > > So every user always has all 100 other dimensions? And those dimensions > are 100% independent of each other? See https://docs.influxdata.com/ > influxdb/v1.0//concepts/glossary/#series-cardinality for more on > dependent vs. independent tags. > > The tag values are almost completely independent of each other. There are > three independent tags, one with 8 possible values, one with 6 (for now, in > the future the number of values for this one might actually increase), and > one boolean tag. 8*6*2=96. There is a dependency between the user-id and > the 8-values tag: some users have only 3 different values for this tag, > some 5, and some all eight. Similarly, some users only have a single value > for the boolean tag, but some have two values. So a better estimate might > be 5*6*1.5=45. > > Unfortunately, I did not realize that the number of measurements also > factor into the cardinality of the database. We have 7 different > measurements, all with the same tags but different values. I guess the > cardinality is actually 7*45=315 before taking the user-id into account. > This makes the issue a factor 3 worse. > > Also, any extension (new tag, new measurement, increase of tag values) > could potentially kill our project. Not a good place to be. > > > It's highly dependent on the string length of your tag keys and values > and > > the shape of the metadata. > > I would not have expected tag key length to be a factory but I guess this > makes sense as InfluxDB is schema-less so tags can be added later at will. > > > E.g 100 measurements of 1 series each will be > > different from 100 measurements of 1 series each. > > I think you made a typo here somewhere because I read the same phrase > twice. >
Should have read "100 measurements of 1 series each will be different from 1 measurements of 100 series." > That makes it basically impossible to calculate, but if you really do > need 15 million series, that's going to require in the neighborhood of > 128-256GB of RAM. > > I understand it is difficult to estimate, but roughly 9-18KB per series > just for the index sounds like a lot. But then again, I am no expert in > time series databases, so what do I know. I will stick with your rough > estimate for my feasibility study. > The RAM needs aren't just for the index, it's more that the index is going to eat up 100+GB, so you'll need headroom for queries and writes to complete. The inverted index stores each series more than once, and since it stores a subset of series combinations, it grows as a very slow exponential, not simple linear. This means that the more series, the more RAM is needed per series. > > Would this high cardinality be less of an issue in a multi-node setup? > > > > > > > > Yes. If you have, for example, 6 data nodes with a replication factor of > 2 > > for redundancy, then each node is only handling 1/3 of the total series > > count. 5 million series per node is still very significant, but with > proper > > schema and lots of RAM, it is probably feasible. > > That is good news. Of course, the "7 measurements" factor would require 7 > times the servers or RAM, which does not sound feasible. > Why have seven measurements? Why not store all metrics in one measurement with seven different field names? E.g. Instead of alice,tag=foo value=1 bob,tag=foo value=2 charlie,tag=foo value=3 diane,tag=foo value=4 edgar,tag=foo value=5 flora,tag=foo value=6 greg,tag=foo value=7 Use users,tag=foo alice=1,bob=2,charlie=3,diane=4,edgar=5,flora=6,greg=7 That's one series instead of seven. > > > Are there any plans to mitigate the cardinality issues in such a use > case? > > > > https://github.com/influxdata/influxdb/issues/7151 > > That is great news ;-) > > > Would the second approach (storing the data twice) actually help, or > would it require the same amount of memory (or even more) than the > straightforward approach? > > > > > > > > Slightly more than double, would be my guess. The in-RAM index is per > InfluxDB instance, not per database or per series. There's no way to break > it down. The total series index for all databases must (currently) always > live in RAM. > > I also deem this good news as I can forget about the ugly approach and > focus on the straightforward one :-) > > In the end it might boil down to a solution for issue 7151 for our > use-case to be feasible. > 7151 is a near-term goal for us (3-6 months) so we should achieve that long before your actual cardinality is a concern. > > -- > Remember to include the version number! > --- > You received this message because you are subscribed to the Google Groups > "InfluxData" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to influxdb+unsubscr...@googlegroups.com. > To post to this group, send email to influxdb@googlegroups.com. > Visit this group at https://groups.google.com/group/influxdb. > To view this discussion on the web visit https://groups.google.com/d/ > msgid/influxdb/3c17e0ba-b86f-4767-b603-cdacd7881b6d%40googlegroups.com. > For more options, visit https://groups.google.com/d/optout. > -- Sean Beckett Director of Support and Professional Services InfluxDB -- Remember to include the version number! --- You received this message because you are subscribed to the Google Groups "InfluxData" group. To unsubscribe from this group and stop receiving emails from it, send an email to influxdb+unsubscr...@googlegroups.com. To post to this group, send email to influxdb@googlegroups.com. Visit this group at https://groups.google.com/group/influxdb. To view this discussion on the web visit https://groups.google.com/d/msgid/influxdb/CALGqCvNnizdn%2B8_W_Fsu_J7KP5CgLfmwVZvBi3cxueF-OGqZfg%40mail.gmail.com. For more options, visit https://groups.google.com/d/optout.