Re: Geode for time series data

Michael Stolz Tue, 23 Feb 2016 09:31:49 -0800

Something like that.
You might choose a smaller granularity than minute if you're really getting
that many ticks per minute. But you probably want a consistent granularity
to make it relatively easy to find what you are looking for.
You'll probably also want the date in the key.



--
Mike Stolz
Principal Engineer, GemFire Product Manager
Mobile: 631-835-4771

On Tue, Feb 23, 2016 at 11:07 AM, Andrew Munn <[email protected]> wrote:

> How does that work when you're appending incoming data in realtime?  Say
> you're getting 1,000,000 data points per day on each of 1,000 incoming
> stock symbols.  That is 1bln data points.  Are you using keys like this
> that bucket the data into one array per minute of the day
>
>         MSFT-08:00
>         MSFT-08:01
>         ...
>         MSFT-08:59
>         etc?
>
> each array might have several thousand elements in that case.
>
> Thanks
> Andrew
>
> On Mon, 22 Feb 2016, Michael Stolz wrote:
>
> > You will definitely want to use arrays rather than storing each
> individual data point because the overhead of each entry in Geode is nearly
> 300 bytes.
> > You could choose to partition by day/week/month but it shouldn't be
> necessary because the default partitioning scheme should be random enough
> to get reasonable distribution
> > if you are using the metadata and starting timestamp of the array as the
> key.
> >
> >
> > --Mike Stolz
> > Principal Engineer, GemFire Product Manager
> > Mobile: 631-835-4771
> >
> > On Fri, Feb 19, 2016 at 1:43 PM, Alan Kash <[email protected]> wrote:
> >       Hi,
> > I am also building a dashboard prototype for time-series data,
> >
> > For time-series data, usually we target a single metric change (stock
> price, temperature, pressure, etc.) for an entity, but the associated
> metadata with event -
> > {StockName/Place, DeviceID, ApplicationID, EventType} remains constant.
> >
> > For a backend like Cassandra, we denormalize everything and put
> everything in a flat key-map with [Metric, Timestamp, DeviceID, Type] as
> the key. This results in data
> > duplication of the associated "Metadata".
> >
> > Do you recommend similar approach for Geode ?
> >
> > Alternatively,
> >
> > We can have an array for Metrics associated with a given Metadata key
> and store it in a Map ?
> >
> > Key = [Metadata, Timestamp]
> >
> > TSMAP<Key, Array<Metric>> series = [1,2,3,4,5,6,7,8,9]
> >
> > We can partition this at application level by day / week / month.
> >
> > Is this approach better ?
> >
> > There is a metrics spec for TS data modeling for those who are
> interested - http://metrics20.org
> >
> > Thanks
> >
> >
> >
> > On Fri, Feb 19, 2016 at 1:11 PM, Michael Stolz <[email protected]>
> wrote:
> >       You will likely get best results in terms of speed of access if
> you put some structure around the way you store the data in-memory.
> > First off, you would probably want to parse the data into the individual
> fields and create a Java object that represents that structure.
> >
> > Then you would probably want to bundle those Java structures into arrays
> in such a way that it is easy to get to the array for a particular date and
> time by the
> > combination of a ticker and a date and time as the key.
> >
> > Those arrays of Java objects is what you would store as entries in Geode.
> > I think this would give you the fastest access to the data.
> >
> > By the way, probably better to use an integer Julian date and a long
> integer for the time rather than a Java Date. Java Dates in Geode PDX are
> way bigger than you
> > want when you have millions of them.
> >
> > Looking at the sample dataset you provided it appears there is a lot of
> redundant data in there. Repeating 1926.75 for instance.
> > In fact, every field but 2 are all the same. Are the repetitious fields
> necessary? If they are, then you might consider using a columnar approach
> instead of the
> > Java structures I mentioned. Make an array for each column and compact
> the repetitions with a count. It would be slower but more compact.
> > The timestamps are all the same too. Strange.
> >
> >
> >
> > --Mike Stolz
> > Principal Engineer, GemFire Product Manager
> > Mobile: 631-835-4771
> >
> > On Fri, Feb 19, 2016 at 12:15 AM, Gregory Chase <[email protected]>
> wrote:
> >       Hi Andrew,I'll let one of the committers answer to your specific
> data file question. However, you might find some inspiration in this open
> source demo
> >       that some of the Geode team presented at OSCON earlier this year:
> http://pivotal-open-source-hub.github.io/StockInference-Spark/
> >
> > This was based on a pre-release version of Geode, so you'll want to sub
> the M1 release in and see if any other tweaks are required at that point.
> >
> > I believe this video and presentation go with the Github project:
> http://www.infoq.com/presentations/r-gemfire-spring-xd
> >
> > On Thu, Feb 18, 2016 at 8:58 PM, Andrew Munn <[email protected]> wrote:
> >       What would be the best way to use Geode (or GF) to store and
> utilize
> >       financial time series data like a stream of stock trades?  I have
> ASCII
> >       files with timestamps that include microseconds:
> >
> >       2016-02-17
> 18:00:00.000660,1926.75,5,5,1926.75,1926.75,14644971,C,43,01,
> >       2016-02-17
> 18:00:00.000660,1926.75,80,85,1926.75,1926.75,14644971,C,43,01,
> >       2016-02-17
> 18:00:00.000660,1926.75,1,86,1926.75,1926.75,14644971,C,43,01,
> >       2016-02-17
> 18:00:00.000660,1926.75,6,92,1926.75,1926.75,14644971,C,43,01,
> >       2016-02-17
> 18:00:00.000660,1926.75,27,119,1926.75,1926.75,14644971,C,43,01,
> >       2016-02-17
> 18:00:00.000660,1926.75,3,122,1926.75,1926.75,14644971,C,43,01,
> >       2016-02-17
> 18:00:00.000660,1926.75,5,127,1926.75,1926.75,14644971,C,43,01,
> >       2016-02-17
> 18:00:00.000660,1926.75,4,131,1926.75,1926.75,14644971,C,43,01,
> >       2016-02-17
> 18:00:00.000660,1926.75,2,133,1926.75,1926.75,14644971,C,43,01,
> >
> >       I have one file per day and each file can have over 1,000,000
> rows.  My
> >       thought is to fault in the files and parse the ASCII as needed.  I
> know I
> >       could store the data as binary primitives in a file on disk
> instead of
> >       ASCII for a bit more speed.
> >
> >       I don't have a cluster of machines to create an HDFS cluster
> with.  My
> >       machine does have 128GB of RAM though.
> >
> >       Thanks!
> >
> >
> >
> >
> > --
> > Greg Chase
> > Global Head, Big Data Communities
> > http://www.pivotal.io/big-data
> >
> > Pivotal Software
> > http://www.pivotal.io/
> >
> > 650-215-0477
> > @GregChase
> > Blog: http://geekmarketing.biz/
> >
> >
> >
> >
> >
> >

Re: Geode for time series data

Reply via email to