Re: [influxdb] Schema design: put user-id in series name or tag

Sean Beckett Mon, 31 Oct 2016 15:04:38 -0700

On Mon, Oct 31, 2016 at 1:36 PM, <pieter...@gmail.com> wrote:

> On Monday, October 31, 2016 at 4:43:40 PM UTC+1, Sean Beckett wrote:
> >
> > So every user always has all 100 other dimensions? And those dimensions
> are 100% independent of each other? See https://docs.influxdata.com/
> influxdb/v1.0//concepts/glossary/#series-cardinality for more on
> dependent vs. independent tags.
>
> The tag values are almost completely independent of each other. There are
> three independent tags, one with 8 possible values, one with 6 (for now, in
> the future the number of values for this one might actually increase), and
> one boolean tag. 8*6*2=96. There is a dependency between the user-id and
> the 8-values tag: some users have only 3 different values for this tag,
> some 5, and some all eight. Similarly, some users only have a single value
> for the boolean tag, but some have two values. So a better estimate might
> be 5*6*1.5=45.
>
> Unfortunately, I did not realize that the number of measurements also
> factor into the cardinality of the database. We have 7 different
> measurements, all with the same tags but different values. I guess the
> cardinality is actually 7*45=315 before taking the user-id into account.
> This makes the issue a factor 3 worse.
>
> Also, any extension (new tag, new measurement, increase of tag values)
> could potentially kill our project. Not a good place to be.
>
> > It's highly dependent on the string length of your tag keys and values
> and
> > the shape of the metadata.
>
> I would not have expected tag key length to be a factory but I guess this
> makes sense as InfluxDB is schema-less so tags can be added later at will.
>
> > E.g 100 measurements of 1 series each will be
> > different from 100 measurements of 1 series each.
>
> I think you made a typo here somewhere because I read the same phrase
> twice.
>


Should have read "100 measurements of 1 series each will be different from
1 measurements of 100 series."

> That makes it basically impossible to calculate, but if you really do
> need 15 million series, that's going to require in the neighborhood of
> 128-256GB of RAM.
>
> I understand it is difficult to estimate, but roughly 9-18KB per series
> just for the index sounds like a lot. But then again, I am no expert in
> time series databases, so what do I know. I will stick with your rough
> estimate for my feasibility study.
>

The RAM needs aren't just for the index, it's more that the index is going
to eat up 100+GB, so you'll need headroom for queries and writes to
complete. The inverted index stores each series more than once, and since
it stores a subset of series combinations, it grows as a very slow
exponential, not simple linear. This means that the more series, the more
RAM is needed per series.


> > Would this high cardinality be less of an issue in a multi-node setup?
> >
> >
> >
> > Yes. If you have, for example, 6 data nodes with a replication factor of
> 2
> > for redundancy, then each node is only handling 1/3 of the total series
> > count. 5 million series per node is still very significant, but with
> proper
> > schema and lots of RAM, it is probably feasible.
>
> That is good news. Of course, the "7 measurements" factor would require 7
> times the servers or RAM, which does not sound feasible.
>

Why have seven measurements? Why not store all metrics in one measurement
with seven different field names? E.g.

Instead of

alice,tag=foo value=1
bob,tag=foo value=2
charlie,tag=foo value=3
diane,tag=foo value=4
edgar,tag=foo value=5
flora,tag=foo value=6
greg,tag=foo value=7

Use

users,tag=foo alice=1,bob=2,charlie=3,diane=4,edgar=5,flora=6,greg=7

That's one series instead of seven.


>
> >  Are there any plans to mitigate the cardinality issues in such a use
> case?
> >
> > https://github.com/influxdata/influxdb/issues/7151
>
> That is great news ;-)
>
> >  Would the second approach (storing the data twice) actually help, or
> would it require the same amount of memory (or even more) than the
> straightforward approach?
> >
> >
> >
> > Slightly more than double, would be my guess. The in-RAM index is per
> InfluxDB instance, not per database or per series. There's no way to break
> it down. The total series index for all databases must (currently) always
> live in RAM.
>
> I also deem this good news as I can forget about the ugly approach and
> focus on the straightforward one :-)
>
> In the end it might boil down to a solution for issue 7151 for our
> use-case to be feasible.
>

7151 is a near-term goal for us (3-6 months) so we should achieve that long
before your actual cardinality is a concern.


>
> --
> Remember to include the version number!
> ---
> You received this message because you are subscribed to the Google Groups
> "InfluxData" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to influxdb+unsubscr...@googlegroups.com.
> To post to this group, send email to influxdb@googlegroups.com.
> Visit this group at https://groups.google.com/group/influxdb.
> To view this discussion on the web visit https://groups.google.com/d/
> msgid/influxdb/3c17e0ba-b86f-4767-b603-cdacd7881b6d%40googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.
>



-- 
Sean Beckett
Director of Support and Professional Services
InfluxDB

-- 
Remember to include the version number!
--- 
You received this message because you are subscribed to the Google Groups 
"InfluxData" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to influxdb+unsubscr...@googlegroups.com.
To post to this group, send email to influxdb@googlegroups.com.
Visit this group at https://groups.google.com/group/influxdb.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/influxdb/CALGqCvNnizdn%2B8_W_Fsu_J7KP5CgLfmwVZvBi3cxueF-OGqZfg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Re: [influxdb] Schema design: put user-id in series name or tag

Reply via email to