You will likely get best results in terms of speed of access if you put some structure around the way you store the data in-memory.
First off, you would probably want to parse the data into the individual fields and create a Java object that represents that structure. Then you would probably want to bundle those Java structures into arrays in such a way that it is easy to get to the array for a particular date and time by the combination of a ticker and a date and time as the key. Those arrays of Java objects is what you would store as entries in Geode. I think this would give you the fastest access to the data. By the way, probably better to use an integer Julian date and a long integer for the time rather than a Java Date. Java Dates in Geode PDX are way bigger than you want when you have millions of them. Looking at the sample dataset you provided it appears there is a lot of redundant data in there. Repeating 1926.75 for instance. In fact, every field but 2 are all the same. Are the repetitious fields necessary? If they are, then you might consider using a columnar approach instead of the Java structures I mentioned. Make an array for each column and compact the repetitions with a count. It would be slower but more compact. The timestamps are all the same too. Strange. -- Mike Stolz Principal Engineer, GemFire Product Manager Mobile: 631-835-4771 On Fri, Feb 19, 2016 at 12:15 AM, Gregory Chase <[email protected]> wrote: > Hi Andrew, > I'll let one of the committers answer to your specific data file question. > However, you might find some inspiration in this open source demo that some > of the Geode team presented at OSCON earlier this year: > http://pivotal-open-source-hub.github.io/StockInference-Spark/ > > This was based on a pre-release version of Geode, so you'll want to sub > the M1 release in and see if any other tweaks are required at that point. > > I believe this video and presentation go with the Github project: > http://www.infoq.com/presentations/r-gemfire-spring-xd > > On Thu, Feb 18, 2016 at 8:58 PM, Andrew Munn <[email protected]> wrote: > >> What would be the best way to use Geode (or GF) to store and utilize >> financial time series data like a stream of stock trades? I have ASCII >> files with timestamps that include microseconds: >> >> 2016-02-17 18:00:00.000660,1926.75,5,5,1926.75,1926.75,14644971,C,43,01, >> 2016-02-17 18:00:00.000660,1926.75,80,85,1926.75,1926.75,14644971,C,43,01, >> 2016-02-17 18:00:00.000660,1926.75,1,86,1926.75,1926.75,14644971,C,43,01, >> 2016-02-17 18:00:00.000660,1926.75,6,92,1926.75,1926.75,14644971,C,43,01, >> 2016-02-17 >> 18:00:00.000660,1926.75,27,119,1926.75,1926.75,14644971,C,43,01, >> 2016-02-17 18:00:00.000660,1926.75,3,122,1926.75,1926.75,14644971,C,43,01, >> 2016-02-17 18:00:00.000660,1926.75,5,127,1926.75,1926.75,14644971,C,43,01, >> 2016-02-17 18:00:00.000660,1926.75,4,131,1926.75,1926.75,14644971,C,43,01, >> 2016-02-17 18:00:00.000660,1926.75,2,133,1926.75,1926.75,14644971,C,43,01, >> >> I have one file per day and each file can have over 1,000,000 rows. My >> thought is to fault in the files and parse the ASCII as needed. I know I >> could store the data as binary primitives in a file on disk instead of >> ASCII for a bit more speed. >> >> I don't have a cluster of machines to create an HDFS cluster with. My >> machine does have 128GB of RAM though. >> >> Thanks! >> > > > > -- > Greg Chase > > Global Head, Big Data Communities > http://www.pivotal.io/big-data > > Pivotal Software > http://www.pivotal.io/ > > 650-215-0477 > @GregChase > Blog: http://geekmarketing.biz/ > >
