Re: design advice for market data

2014-03-08 Thread Bobby Richards
Wanted to hit the list to get some more advice to finalize my market data 
design.  

Currently I have about 10 million events per week.  I would like to keep 
weekly indexes because they provide a nice logical separation of data (ie 
markets closed on weekend)
as of now I am using the default number of 5 shards which I was thinking of 
bumping to 10, right now I am routing based on symbol which there are about 
20, and I am wandering if I should just set number of shards = to number of 
symbols?

Data is about 1.5 gig per week so with 10 shards that 150 m each but I see 
that github has 120 gigs per shard (all be it with much beefier machines)

I had thought about daily indexes which is appealing because the potential 
is that many queries will not typically span more than a day and I would 
assume it is best to design indexes around the most frequent queries? 
 Would I be able to combine the daily indexes into a weekly 
and optimize over the weekend, is this possible?  

Also, I am trying to build candle data which is represented by the open 
(head) high, low, and close (last) values of the time period for which date 
histogram aggs are ideal.  High and low are easy but as of now its a two 
step query.  Any clever ways to get the first and last element of the 
bucket with aggs?

Just trying to nail this down and I appreciate any and all advice and 
feedback.

On Thursday, February 6, 2014 4:18:54 PM UTC-6, Bobby Richards wrote:

 great thanks.  I am not sure I would have found this on my own anytime 
 soon.  Ill look into it.

 Bobby


 On Thu, Feb 6, 2014 at 4:33 AM, Alexander Reelsen a...@spinscale.dewrote:

 Hey,

 the side field as defined in your mapping (I assume you use elasticsearch 
 0.90.X) uses the standard analyzer, which by default removes stopwords. As 
 a is a stopword, it gets removed as part of the indexing process - and 
 that makes it impossible to search for. In order to find out more about 
 this, a good way is to play around with the analyze API. If you like a nice 
 UI on top of that, go with the inquisitor plugin. 

 The analyze API basically tells you, how a string is tokenized and stored 
 in the index, which parts are being removed or altered (due to stemming for 
 example).

 See 
 http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/indices-analyze.html


 --Alex


 On Thu, Feb 6, 2014 at 3:38 AM, Bobby Richards 
 bobby.richa...@gmail.comwrote:

 So I have decided on using the week of year as the index and quotes as 
 my type.  I want to clarfiy a couple of things that I am seeing.

 first I create my index *curl 'http://localhost:9200/2014_6/quotes 
 http://localhost:9200/2014_6/quotes'*

 then I set my mapping:
  
 *curl -XPUT 'http://localhost:9200/2014_6/quotes/_mapping 
 http://localhost:9200/2014_6/quotes/_mapping' -d '*

 *{*

 *  quotes : {*

 * properties : {*

 *time_stamp: {type:date},*

 *symbol: {type:string},*

 *side : {type:string},*

 *price : {type:double}*

 * },*

 *_routing : {*

 *   required: true,*

 *  path:symbol*

 *   },*

 * _timestamp : {*

 *enabled : true,*

 *path:  time_stamp,*

 *format: date_hour_minute_second_millis*

 * }*

 *  }*

 *}*

 *'*
 now because of this I understand when I am posting a new event to be 
 indexed I do not need to specify quote?routing=symbol.  However my first 
 question is that now I must include symbol in the json object I am posting, 
 is this costing me more as far as storage?  If I do not do this via the 
 mapping I have no problem adding the routing to the uri, especially if it 
 saves me space.

 second I am seeing a couple of weird things...
 by running this:
 *curl -XGET 'http://localhost:9200/2014_5/quotes/_search?routing=eurusd 
 http://localhost:9200/2014_5/quotes/_search?routing=eurusd'*

 i get the following, which is good, what I expect.
 {took:1,timed_out:false,_shards:{total:1,successful:1,failed:0},hits:{total:3,max_score:1.0,hits:[{_index:2014_5,_type:quotes,_id:ZW5u1nCHTGW-xToRy8Yy5g,_score:1.0,
  
 _source : 
 { time_stamp:1391653001000, symbol:eurusd, side:a, 
 price:1.3456}},{_index:2014_5,_type:quotes,_id:ok4FLnrfR4u2CnJ3lVNKkg,_score:1.0,
  
 _source : 
 { time_stamp:1391653001000, symbol:eurusd, side:b, 
 price:1.3457}},{_index:2014_5,_type:quotes,_id:1eG5m0riSoiDEquQ3I-QSA,_score:1.0,
  
 _source : 
 { time_stamp:1391653001100, symbol:eurusd, side:b, 
 price:1.3458}}]}}

 however if you will notice the first entry is of side a.  by running 
 the following I get nothing.
 *url -XGET 'http://localhost:9200/2014_5/quotes/_search?routing=eurusd 
 http://localhost:9200/2014_5/quotes/_search?routing=eurusd' -d '*

 *{query:{filtered:{query:{match_all:{}},filter:{term:{side:a}'*

 however if I change side to b I get 2 as I would expect.  Is there 
 some reserved feature that would limit me searching the a or is there some 
 text search thing I am not thinking about.

 Finally, I have added 

Re: design advice for market data

2014-02-06 Thread Alexander Reelsen
Hey,

the side field as defined in your mapping (I assume you use elasticsearch
0.90.X) uses the standard analyzer, which by default removes stopwords. As
a is a stopword, it gets removed as part of the indexing process - and
that makes it impossible to search for. In order to find out more about
this, a good way is to play around with the analyze API. If you like a nice
UI on top of that, go with the inquisitor plugin.

The analyze API basically tells you, how a string is tokenized and stored
in the index, which parts are being removed or altered (due to stemming for
example).

See
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/indices-analyze.html


--Alex


On Thu, Feb 6, 2014 at 3:38 AM, Bobby Richards bobby.richa...@gmail.comwrote:

 So I have decided on using the week of year as the index and quotes as my
 type.  I want to clarfiy a couple of things that I am seeing.

 first I create my index *curl 'http://localhost:9200/2014_6/quotes
 http://localhost:9200/2014_6/quotes'*

 then I set my mapping:

 *curl -XPUT 'http://localhost:9200/2014_6/quotes/_mapping
 http://localhost:9200/2014_6/quotes/_mapping' -d '*

 *{*

 *  quotes : {*

 * properties : {*

 *time_stamp: {type:date},*

 *symbol: {type:string},*

 *side : {type:string},*

 *price : {type:double}*

 * },*

 *_routing : {*

 *   required: true,*

 *  path:symbol*

 *   },*

 * _timestamp : {*

 *enabled : true,*

 *path:  time_stamp,*

 *format: date_hour_minute_second_millis*

 * }*

 *  }*

 *}*

 *'*
 now because of this I understand when I am posting a new event to be
 indexed I do not need to specify quote?routing=symbol.  However my first
 question is that now I must include symbol in the json object I am posting,
 is this costing me more as far as storage?  If I do not do this via the
 mapping I have no problem adding the routing to the uri, especially if it
 saves me space.

 second I am seeing a couple of weird things...
 by running this:
 *curl -XGET 'http://localhost:9200/2014_5/quotes/_search?routing=eurusd
 http://localhost:9200/2014_5/quotes/_search?routing=eurusd'*

 i get the following, which is good, what I expect.
 {took:1,timed_out:false,_shards:{total:1,successful:1,failed:0},hits:{total:3,max_score:1.0,hits:[{_index:2014_5,_type:quotes,_id:ZW5u1nCHTGW-xToRy8Yy5g,_score:1.0,
 _source :
 { time_stamp:1391653001000, symbol:eurusd, side:a,
 price:1.3456}},{_index:2014_5,_type:quotes,_id:ok4FLnrfR4u2CnJ3lVNKkg,_score:1.0,
 _source :
 { time_stamp:1391653001000, symbol:eurusd, side:b,
 price:1.3457}},{_index:2014_5,_type:quotes,_id:1eG5m0riSoiDEquQ3I-QSA,_score:1.0,
 _source :
 { time_stamp:1391653001100, symbol:eurusd, side:b,
 price:1.3458}}]}}

 however if you will notice the first entry is of side a.  by running the
 following I get nothing.
 *url -XGET 'http://localhost:9200/2014_5/quotes/_search?routing=eurusd
 http://localhost:9200/2014_5/quotes/_search?routing=eurusd' -d '*

 *{query:{filtered:{query:{match_all:{}},filter:{term:{side:a}'*

 however if I change side to b I get 2 as I would expect.  Is there some
 reserved feature that would limit me searching the a or is there some text
 search thing I am not thinking about.

 Finally, I have added a few usdjpy quotes which are routed to a separate
 shard. In my query I accidentally type *usejpy *and I got the two eurusd
 events, even though it honored the side filter.
 correcting the symbol I get what I would expect.  Is this another text
 search 'thing'?  All I can think of is that by mistyping the e matches the
 eur in the other indexed items.

 I just want to understand fully what I have going on there, thanks.







 On Saturday, February 1, 2014 2:27:55 PM UTC-6, Bobby Richards wrote:

 Wanting to get some advice on how to go about design.  I have some
 currency market data and I get roughly 10 million events a week currently
 storing in postgres, it actually ends up being about 10 gigs, though I
 would like to work on getting this down obviously.  The data is seldom
 queried but I have all of my other data in elastic search which I love.  I
 am trying to determine the best way to store this.

 I would like to query by symbol and time and indexing by month so I can
 drop months whenever.  i guess that would mean 'month/symbol/(unixtime for
 minute).

 I am far from a data guy, so I am looking for direction, thoughts,
 etc...is this even a good use case for elastic search?

 Thanks,
 Bobby


  --
 You received this message because you are subscribed to the Google Groups
 elasticsearch group.
 To unsubscribe from this group and stop receiving emails from it, send an
 email to elasticsearch+unsubscr...@googlegroups.com.
 To view this discussion on the web visit
 https://groups.google.com/d/msgid/elasticsearch/24b53357-be8b-4401-95eb-3581765af41a%40googlegroups.com
 .

 For more options, visit https://groups.google.com/groups/opt_out.


-- 
You received this 

Re: design advice for market data

2014-02-05 Thread Bobby Richards
So I have decided on using the week of year as the index and quotes as my 
type.  I want to clarfiy a couple of things that I am seeing.

first I create my index *curl 'http://localhost:9200/2014_6/quotes'*

then I set my mapping:

*curl -XPUT 'http://localhost:9200/2014_6/quotes/_mapping' -d '*

*{*

*  quotes : {*

* properties : {*

*time_stamp: {type:date},*

*symbol: {type:string},*

*side : {type:string},*

*price : {type:double}*

* },*

*_routing : {*

*   required: true,*

*  path:symbol*

*   },*

* _timestamp : {*

*enabled : true,*

*path:  time_stamp,*

*format: date_hour_minute_second_millis*

* }*

*  }*

*}*

*'*
now because of this I understand when I am posting a new event to be 
indexed I do not need to specify quote?routing=symbol.  However my first 
question is that now I must include symbol in the json object I am posting, 
is this costing me more as far as storage?  If I do not do this via the 
mapping I have no problem adding the routing to the uri, especially if it 
saves me space.

second I am seeing a couple of weird things...
by running this:
*curl -XGET 'http://localhost:9200/2014_5/quotes/_search?routing=eurusd'*

i get the following, which is good, what I expect.
{took:1,timed_out:false,_shards:{total:1,successful:1,failed:0},hits:{total:3,max_score:1.0,hits:[{_index:2014_5,_type:quotes,_id:ZW5u1nCHTGW-xToRy8Yy5g,_score:1.0,
 
_source : 
{ time_stamp:1391653001000, symbol:eurusd, side:a, 
price:1.3456}},{_index:2014_5,_type:quotes,_id:ok4FLnrfR4u2CnJ3lVNKkg,_score:1.0,
 
_source : 
{ time_stamp:1391653001000, symbol:eurusd, side:b, 
price:1.3457}},{_index:2014_5,_type:quotes,_id:1eG5m0riSoiDEquQ3I-QSA,_score:1.0,
 
_source : 
{ time_stamp:1391653001100, symbol:eurusd, side:b, 
price:1.3458}}]}}

however if you will notice the first entry is of side a.  by running the 
following I get nothing.
*url -XGET 'http://localhost:9200/2014_5/quotes/_search?routing=eurusd' -d 
'*
*{query:{filtered:{query:{match_all:{}},filter:{term:{side:a}'*

however if I change side to b I get 2 as I would expect.  Is there some 
reserved feature that would limit me searching the a or is there some text 
search thing I am not thinking about.

Finally, I have added a few usdjpy quotes which are routed to a separate 
shard. In my query I accidentally type *usejpy *and I got the two eurusd 
events, even though it honored the side filter.
correcting the symbol I get what I would expect.  Is this another text 
search 'thing'?  All I can think of is that by mistyping the e matches the 
eur in the other indexed items.  

I just want to understand fully what I have going on there, thanks.







On Saturday, February 1, 2014 2:27:55 PM UTC-6, Bobby Richards wrote:

 Wanting to get some advice on how to go about design.  I have some 
 currency market data and I get roughly 10 million events a week currently 
 storing in postgres, it actually ends up being about 10 gigs, though I 
 would like to work on getting this down obviously.  The data is seldom 
 queried but I have all of my other data in elastic search which I love.  I 
 am trying to determine the best way to store this.

 I would like to query by symbol and time and indexing by month so I can 
 drop months whenever.  i guess that would mean 'month/symbol/(unixtime for 
 minute).

 I am far from a data guy, so I am looking for direction, thoughts, 
 etc...is this even a good use case for elastic search?

 Thanks,
 Bobby




-- 
You received this message because you are subscribed to the Google Groups 
elasticsearch group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/24b53357-be8b-4401-95eb-3581765af41a%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


Re: design advice for market data

2014-02-03 Thread Alexander Reelsen
Hey,

so index: month, type: symbol? Might make sense.. you could also use
routing and use the symbol for this, to ensure  that you only query one
shard for one symbol in order to have faster queries. This presentation
might be interesting for you regarding data flows

http://www.elasticsearch.org/videos/big-data-search-and-analytics/



--Alex



On Sat, Feb 1, 2014 at 9:27 PM, Bobby Richards bobby.richa...@gmail.comwrote:

 Wanting to get some advice on how to go about design.  I have some
 currency market data and I get roughly 10 million events a week currently
 storing in postgres, it actually ends up being about 10 gigs, though I
 would like to work on getting this down obviously.  The data is seldom
 queried but I have all of my other data in elastic search which I love.  I
 am trying to determine the best way to store this.

 I would like to query by symbol and time and indexing by month so I can
 drop months whenever.  i guess that would mean 'month/symbol/(unixtime for
 minute).

 I am far from a data guy, so I am looking for direction, thoughts,
 etc...is this even a good use case for elastic search?

 Thanks,
 Bobby


  --
 You received this message because you are subscribed to the Google Groups
 elasticsearch group.
 To unsubscribe from this group and stop receiving emails from it, send an
 email to elasticsearch+unsubscr...@googlegroups.com.
 To view this discussion on the web visit
 https://groups.google.com/d/msgid/elasticsearch/54f02434-37b8-4435-a846-8d20f7e9d723%40googlegroups.com
 .
 For more options, visit https://groups.google.com/groups/opt_out.


-- 
You received this message because you are subscribed to the Google Groups 
elasticsearch group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/CAGCwEM-z%3DV7YhiVO9Qxq_3TKHeze9LpvWaCh4FJae5Z%3D-Q_-6w%40mail.gmail.com.
For more options, visit https://groups.google.com/groups/opt_out.


design advice for market data

2014-02-01 Thread Bobby Richards
Wanting to get some advice on how to go about design.  I have some currency 
market data and I get roughly 10 million events a week currently storing in 
postgres, it actually ends up being about 10 gigs, though I would like to 
work on getting this down obviously.  The data is seldom queried but I have 
all of my other data in elastic search which I love.  I am trying to 
determine the best way to store this.

I would like to query by symbol and time and indexing by month so I can 
drop months whenever.  i guess that would mean 'month/symbol/(unixtime for 
minute).

I am far from a data guy, so I am looking for direction, thoughts, etc...is 
this even a good use case for elastic search?

Thanks,
Bobby


-- 
You received this message because you are subscribed to the Google Groups 
elasticsearch group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/54f02434-37b8-4435-a846-8d20f7e9d723%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.