You will likely get best results in terms of speed of access if you put
some structure around the way you store the data in-memory.

First off, you would probably want to parse the data into the individual
fields and create a Java object that represents that structure.

Then you would probably want to bundle those Java structures into arrays in
such a way that it is easy to get to the array for a particular date and
time by the combination of a ticker and a date and time as the key.

Those arrays of Java objects is what you would store as entries in Geode.
I think this would give you the fastest access to the data.

By the way, probably better to use an integer Julian date and a long
integer for the time rather than a Java Date. Java Dates in Geode PDX are
way bigger than you want when you have millions of them.

Looking at the sample dataset you provided it appears there is a lot of
redundant data in there. Repeating 1926.75 for instance.
In fact, every field but 2 are all the same. Are the repetitious fields
necessary? If they are, then you might consider using a columnar approach
instead of the Java structures I mentioned. Make an array for each column
and compact the repetitions with a count. It would be slower but more
compact.
The timestamps are all the same too. Strange.



--
Mike Stolz
Principal Engineer, GemFire Product Manager
Mobile: 631-835-4771

On Fri, Feb 19, 2016 at 12:15 AM, Gregory Chase <[email protected]> wrote:

> Hi Andrew,
> I'll let one of the committers answer to your specific data file question.
> However, you might find some inspiration in this open source demo that some
> of the Geode team presented at OSCON earlier this year:
> http://pivotal-open-source-hub.github.io/StockInference-Spark/
>
> This was based on a pre-release version of Geode, so you'll want to sub
> the M1 release in and see if any other tweaks are required at that point.
>
> I believe this video and presentation go with the Github project:
> http://www.infoq.com/presentations/r-gemfire-spring-xd
>
> On Thu, Feb 18, 2016 at 8:58 PM, Andrew Munn <[email protected]> wrote:
>
>> What would be the best way to use Geode (or GF) to store and utilize
>> financial time series data like a stream of stock trades?  I have ASCII
>> files with timestamps that include microseconds:
>>
>> 2016-02-17 18:00:00.000660,1926.75,5,5,1926.75,1926.75,14644971,C,43,01,
>> 2016-02-17 18:00:00.000660,1926.75,80,85,1926.75,1926.75,14644971,C,43,01,
>> 2016-02-17 18:00:00.000660,1926.75,1,86,1926.75,1926.75,14644971,C,43,01,
>> 2016-02-17 18:00:00.000660,1926.75,6,92,1926.75,1926.75,14644971,C,43,01,
>> 2016-02-17
>> 18:00:00.000660,1926.75,27,119,1926.75,1926.75,14644971,C,43,01,
>> 2016-02-17 18:00:00.000660,1926.75,3,122,1926.75,1926.75,14644971,C,43,01,
>> 2016-02-17 18:00:00.000660,1926.75,5,127,1926.75,1926.75,14644971,C,43,01,
>> 2016-02-17 18:00:00.000660,1926.75,4,131,1926.75,1926.75,14644971,C,43,01,
>> 2016-02-17 18:00:00.000660,1926.75,2,133,1926.75,1926.75,14644971,C,43,01,
>>
>> I have one file per day and each file can have over 1,000,000 rows.  My
>> thought is to fault in the files and parse the ASCII as needed.  I know I
>> could store the data as binary primitives in a file on disk instead of
>> ASCII for a bit more speed.
>>
>> I don't have a cluster of machines to create an HDFS cluster with.  My
>> machine does have 128GB of RAM though.
>>
>> Thanks!
>>
>
>
>
> --
> Greg Chase
>
> Global Head, Big Data Communities
> http://www.pivotal.io/big-data
>
> Pivotal Software
> http://www.pivotal.io/
>
> 650-215-0477
> @GregChase
> Blog: http://geekmarketing.biz/
>
>

Reply via email to