Aggregation profiling?

2015-05-25 Thread Mike Sukmanowsky
I don't believe there are any current endpoints in the API that support
this, but are there plans to add better profiling information to ES
aggregation queries? We'll see some agg queries return in 11s, then <5s
then >11s again. Sometimes we can see associated filter cache expirations,
but it's really hard to line these up to one specific query in our
production environment since multiple users are executing queries
simultaneously.

It'd be really helpful to optionally see where aggregation queries are
spending the bulk of their time to help us understand what to improve in
the future.

Anything we can do here right now?

-- 
Mike Sukmanowsky
Aspiring Digital Carpenter

*e*: mike.sukmanow...@gmail.com

facebook <http://facebook.com/mike.sukmanowsky> | twitter
<http://twitter.com/msukmanowsky> | LinkedIn
<http://www.linkedin.com/profile/view?id=10897143> | github
<https://github.com/msukmanowsky>

-- 
Please update your bookmarks! We have moved to https://discuss.elastic.co/
--- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/CAOH6cu5WSGqQ%2BZ0_qrofXEvwo8JuSH9xoSbZgSwiT90MJ_wxdA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Shard query cache and filters with _cache:false

2015-05-25 Thread Mike Sukmanowsky
Is the shard query cache disabled if a filter is specified in
query.filter.filtered.bool.must clause has _cache: false? It makes some
intuitive sense if it's the case, but looking over the source code
<https://github.com/elastic/elasticsearch/blob/1.4/src/main/java/org/elasticsearch/indices/cache/query/IndicesQueryCache.java#L169-L200>
(we're
using 1.4.3), I can't see any reason why this would happen yet it seems to
on most of our queries (confirmed it's enabled in our settings at 2% of
heap).

-- 
Mike Sukmanowsky
Aspiring Digital Carpenter

*p*: +1 (416) 953-4248
*e*: mike.sukmanow...@gmail.com

facebook <http://facebook.com/mike.sukmanowsky> | twitter
<http://twitter.com/msukmanowsky> | LinkedIn
<http://www.linkedin.com/profile/view?id=10897143> | github
<https://github.com/msukmanowsky>

-- 
Please update your bookmarks! We have moved to https://discuss.elastic.co/
--- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/CAOH6cu5CdGB0HS%2BraR_4JZ%2BQwSZ--JVxcD6FQSYvewmwvH4q3A%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Estimating Filter Cache Needs

2015-05-24 Thread Mike Sukmanowsky
Hey all,

Is there an easy way (even if not entirely accurate) to estimate the size
of an individual filter in the filter cache if we know the approximate
number of documents the index holds? Realize it's a bit tricky as filter
cache is node-level, not index-level by default.

If it were straight non-sparse bitset then a filter would be n-bits in
size: 1B documents = 1B bits = 125MB but I'm guessing ES tries to use a
more clever implementation of a bitset.

Thanks!
Mike

-- 
Mike Sukmanowsky
Aspiring Digital Carpenter

*p*: +1 (416) 953-4248
*e*: mike.sukmanow...@gmail.com

facebook <http://facebook.com/mike.sukmanowsky> | twitter
<http://twitter.com/msukmanowsky> | LinkedIn
<http://www.linkedin.com/profile/view?id=10897143> | github
<https://github.com/msukmanowsky>

-- 
Please update your bookmarks! We have moved to https://discuss.elastic.co/
--- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/CAOH6cu7OyEVd2VQ2_ZzxJMhE3kshEuBXYRyoN0%2BSPJguxbHdCg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Timeouts ignored in multisearch?

2015-05-22 Thread Mike Sukmanowsky
Hi all,

We're doing an analysis of our slow queries (via slowlog) and noticing that 
any queries made with msearch seem to ignore the timeout parameter.  We've 
tried passing it three ways:

GET my_index/_msearch?timeout=100
{}
{big aggregation query}

and 

GET my_index/_msearch
{"timeout": 100}
{big aggregation query}

and finally:

GET my_index/_msearch?timeout=100
{"timeout": 100}
{big aggregation query}

​

Multisearch results always return long after 100ms with "timed_out": false 
indicating that they're ignoring our request here. This is very problematic 
as we need certain aggregation queries to timeout if they're too greedy to 
allow others to run.

Can anyone confirm if this is the case?

Mike

-- 
Please update your bookmarks! We have moved to https://discuss.elastic.co/
--- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/b94c31a3-c57a-403b-8d70-a8c2a47682af%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: Shard query cache size

2015-05-22 Thread Mike Sukmanowsky
Thanks for the responses guys.  We have an ES setup so that we have: hot, 
warm and cold ES nodes. Hot nodes are the only ones receiving realtime 
updates and have refresh intervals fairly low for indices there thus making 
a query cache pretty useless for data there.

Indices on warm nodes on the other hand are only updated every night and 
indices cold nodes are similar. Assuming we do have repetitive aggregation 
queries, sounds like bumping up query cache on warm/cold tier could have 
some significant speed ups for our more expensive aggregations.

On Thursday, 21 May 2015 19:12:44 UTC-4, Adrien Grand wrote:
>
> On Thu, May 21, 2015 at 11:49 PM, James Macdonald <
> james.m...@geofeedia.com > wrote:
>
>> Hi, I am a little confused by your response. Are you saying that 
>> query/filter caches are invalidated across all data in a shard every time 
>> the refresh interval ticks over? 
>>
>
> Sorry for the confusion:
>  - the query cache caches entire requests per index, and is competely 
> invalidated across all data every time the refresh interval ticks over AND 
> there have been changes since the last refresh
>  - the filter cache caches matching documents per segment, it is 
> invalidated per segment only when a segment goes away (typically because 
> it's been merged to a larger segment), which is unfrequent for large 
> segments
>  - the fielddata cache caches the document->value mapping per segment and 
> has the same invalidation rules as the filter cache
>  
>
>> I was under the impression that all field data and caching related 
>> operations were performed on a Lucene index segment level and that the 
>> caches would only be invalidated for a given segment if that segment had 
>> changed since the last refresh. Since most data is stored in large segments 
>> that don't take fresh writes and seldom merge this would mean that most 
>> caches are good for long periods of time; even if the shard is under 
>> constant indexing load. Am I mistaken? 
>>
>
> This is right for the fielddata and filter caches, but not for the query 
> cache.
>
> -- 
> Adrien
>  

-- 
Please update your bookmarks! We have moved to https://discuss.elastic.co/
--- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/c9ca4e07-42c4-41d4-a919-50cb9c56f8eb%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Shard query cache size

2015-05-21 Thread Mike Sukmanowsky
Hi all,

We store Marvel-style timeseries data in Elasticsearch and make very heavy 
use of aggregations (all queries are effectively aggregations).

We've been playing around with the shard query cache and have a question.

Is there a reason the shard query cache is set to such a low level of JVM 
heap by default? 1% seems awfully low unless ES assumes most people aren't 
making heavy use of aggregations? Any harm in us significantly boosting 
this from 1% to say 15% of heap? Most of our machines have 30GB of RAM and 
heap at 50% of that (15GB) so the query cache is 150MB by default. Think 
we'd like to experiment growing that to at least 10% of heap to have 1GB in 
use for this cache.

Mike

-- 
Please update your bookmarks! We have moved to https://discuss.elastic.co/
--- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/a3a0fa7b-49f8-4d78-a520-6eeb16d53de3%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Lots of segments per index

2015-05-20 Thread mike . giardinelli
Hi All,

We have monthly indexes (currently 11 months) with 22 shards per index.  I 
am seeing what seems to be a lot of segments per index (roughly 1200 to 
1380 segments).  The older indexes should have very little, if any updates 
occurring on them.  From everything I have read it sounds like ES should be 
automatically merging segments, but now I am a bit concerned that this is 
not occurring.  I know we can manually run an optimize, but will need to 
allocate another resource to do that work (as to not impact current 
system).  I am fairly new to ES (if that isn't obvious) and am really 
trying to understand we have an issue or not. 

Rough index stats: 

11 indexes
22 shards per index
65 million docs per index
350 GB per index

Any info, suggestions, etc. are greatly appreciated. 

Thanks!

-- 
Please update your bookmarks! We have moved to https://discuss.elastic.co/
--- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/c6bedad4-9c24-4e92-94ef-c4fc0e51a7ed%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Urgent Req : Java Technical Architect Houston TX Contract

2015-04-15 Thread mike agilees
Urgent Req : Java Technical Architect  Houston TX Contract

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/CAAmd_LUckbVmVKrTMvxb9uCx%2BBsfwAhwOevYa1HjLDW0Uw-WcA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Urgent Req : Java developer with Vendavo Experience San Ramon CA 6-12months+

2015-04-15 Thread mike agilees
HI,

This is MIKE from Agile Enterprise Solutions.

This is in reference to the following position.



* Please find below job description if you feel comfortable please
revert with updated resume, Rate and contact details ASAP *



Role:Java developer with Vendavo Experience

Location: San Ramon CA

Duration: 6 months+

Client : Tech Mahindra



Primary Skills :

Java developer with Vendavo Experience , JSP, Java, JavaScript,
Servlets, Java-J2EE, XML, Application /Web Server: Apache, WebLogic
Portal 10.3



Job Description:

We are looking for resources (Java with Vendavo Experienced) with
following skills.

• Must be strong in Java web application development
(hands-on programming)
• understanding of system hardware, application
architecture and environments
• Communication (written/verbal) and facilitation skills
with solid ability to convey technical information to a diverse
audience (business analysts, technical analysts, managers, etc.)
• Programming Languages:  JSP, Java, JavaScript, Servlets,
Java-J2EE, XML
• Application /Web Server: Apache, WebLogic Portal 10.3
• Frameworks: Struts 1 & 2
• Database: Oracle9i, 10g, 11g
• Tools: Eclipse IDE. TFS, SQL Navigator, Star UML



Thanks & Regards



Mike Michon,

Agile Enterprise Solutions Inc.,

Ensuring Client's Success

Ph: 630-315-9541

Fax: (630) 206-2397

Email: mike_mic...@aesinc.us.com

Web: www.agilees.com

Gtalk/YIM:" mikeagilees "

Note: If you have received this mail in error or prefer not to receive
such emails in the future, please reply with "REMOVE" in the subject
line and the email id(s) to be removed. All removal requests will be
honored ASAP. We sincerely apologize for any inconvenience caused.

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/CAAmd_LXZ1zPYvnrxAsLe4R0bw6kN2desVjOP8uQBuWuYeTO83A%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Urgent Req : Q Radar Security Consultant San Ramon , CA

2015-04-15 Thread mike agilees
HI,

This is MIKE from Agile Enterprise Solutions.

This is in reference to the following position.



* Please find below job description if you feel comfortable please
revert with updated resume, Rate and contact details ASAP *



Role:Q Radar Security Consultant

Location: San Ramon , CA

Duration: 6 months+

Client : Tech Mahindra





Primary Skills :

Must have at least one or more SIEM and/or Security Tools deployment experience



Job Description:

•  7-8+ years of experience in Information Security
specifically related to SIEM (Security Information & Event Management)
with demonstrated deep technical knowledge and skill in SIEM,
specifically QRadar, Information Security and Networking.

•  Excellent communication / interpersonal skills (written and
verbal skills at any business level); ability to maintain timely
communication, follow schedules and meet deliverables.

•  Must have at least one or more SIEM and/or Security Tools
deployment experience (ArcSight, Splunk, EndPoint Manager, Guardium,
AppScan, Optim, Fortify, X-Force, Trusteer, Atalla or WebInspect.)

•  3+ years software development or scripting experience

•  Possess deep process knowledge for SIEM design, development
and implementation

•  Possess the capacity to learn and assimilate new information steadily

•  Perform work successfully with little supervisory oversight

•  Customer focus and a strong commitment to customer satisfaction

•  IBM Certified Deployment Professional – Security QRadar SIEM V7.1

•  Bachelor’s Degree from four-year college or university in
Information Technology, Information Security, Engineering or related
area of study; or seven or more years related work experience

•  CISSP, GISP or similar certification is a bonus

•  A background in networking and/or system administration

•  Former presales involvement and/or experience



Thanks & Regards



Mike Michon,

Agile Enterprise Solutions Inc.,

Ensuring Client's Success

Ph: 630-315-9541

Fax: (630) 206-2397

Email: mike_mic...@aesinc.us.com

Web: www.agilees.com

Gtalk/YIM:" mikeagilees "

Note: If you have received this mail in error or prefer not to receive
such emails in the future, please reply with "REMOVE" in the subject
line and the email id(s) to be removed. All removal requests will be
honored ASAP. We sincerely apologize for any inconvenience caused.

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/CAAmd_LUJARznoGKHfc71PviG18hyvYvSYOmP6dkK66PahN%2BtYA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Urgent Req : Security Consultant with Qradar San Ramon , CA

2015-04-15 Thread mike agilees
 

HI,

This is MIKE from Agile Enterprise Solutions. 

This is in reference to the following position.

 

* Please find below job description if you feel comfortable please 
revert with updated resume, Rate and contact details ASAP * 

 

Role:Q Radar Security Consultant

Location: San Ramon , CA

Duration: 6 months+

Client : Tech Mahindra

 

 

Primary Skills : 

Must have at least one or more SIEM and/or Security Tools deployment 
experience 

 

Job Description:

•  7-8+ years of experience in Information Security specifically 
related to SIEM (Security Information & Event Management) with demonstrated 
deep technical knowledge and skill in SIEM, specifically QRadar, 
Information Security and Networking.

•  Excellent communication / interpersonal skills (written and 
verbal skills at any business level); ability to maintain timely 
communication, follow schedules and meet deliverables.

•  Must have at least one or more SIEM and/or Security Tools 
deployment experience (ArcSight, Splunk, EndPoint Manager, Guardium, 
AppScan, Optim, Fortify, X-Force, Trusteer, Atalla or WebInspect.)

•  3+ years software development or scripting experience

•  Possess deep process knowledge for SIEM design, development and 
implementation

•  Possess the capacity to learn and assimilate new information 
steadily

•  Perform work successfully with little supervisory oversight

•  Customer focus and a strong commitment to customer satisfaction

•  IBM Certified Deployment Professional – Security QRadar SIEM V7.1

•  Bachelor’s Degree from four-year college or university in 
Information Technology, Information Security, Engineering or related area 
of study; or seven or more years related work experience

•  CISSP, GISP or similar certification is a bonus

•  A background in networking and/or system administration

•  Former presales involvement and/or experience

 

Thanks & Regards

 

Mike Michon,

Agile Enterprise Solutions Inc.,

Ensuring Client's Success

Ph: 630-315-9541

Fax: (630) 206-2397

Email: mike_mic...@aesinc.us.com 

Web: www.agilees.com 

Gtalk/YIM:" mikeagilees "

Note: If you have received this mail in error or prefer not to receive such 
emails in the future, please reply with "REMOVE" in the subject line and 
the email id(s) to be removed. All removal requests will be honored ASAP. 
We sincerely apologize for any inconvenience caused.

 

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/978bd7a6-f63b-4a2d-a46a-7452da239506%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: elastisticsearch_dsl python to create pivot tables

2015-03-31 Thread Mike
tested it. works as expected. 
Thanks again for your help 

Am Dienstag, 31. März 2015 00:04:39 UTC+2 schrieb Mike:
>
>
> Thanks Honza. 
>
> You made my day (my night rather, it is midnight here in Brussels). 
>
> I quickly tested the code and it gives the same results as the manually 
> chained expression. 
>
> I will test with various metrics tomorrow, I will then mark the question 
> as “completed”. 
>
> Thanks a lot and have a nice day. 
>
> P.S:This opens the possibilty to perform any “pivot” table in ES . The 
> challenge will be to parse the resulting json resuts (see 
> http://stackoverflow.com/questions/29280480/), but I hope to find a way. 
>
>
>
>
>
>
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/6e22c9c6-b8d8-4d55-b0d7-57d7b17f45be%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: elastisticsearch_dsl python to create pivot tables

2015-03-30 Thread Mike


Thanks Honza. 

You made my day (my night rather, it is midnight here in Brussels). 

I quickly tested the code and it gives the same results as the manually 
chained expression. 

I will test with various metrics tomorrow, I will then mark the question as 
“completed”. 

Thanks a lot and have a nice day. 

P.S:This opens the possibilty to perform any “pivot” table in ES . The 
challenge will be to parse the resulting json resuts (see 
http://stackoverflow.com/questions/29280480/), but I hope to find a way. 







-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/a8e920e0-e00d-45ee-9466-2ff947752848%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: elastisticsearch_dsl python to create pivot tables

2015-03-30 Thread Mike
Thanks Honza. (also for the great work you are doing for the python 
community). 

I may have mistated my problem. 

what I am really looking for is to have a bucket, inside a bucket , inside 
a bucket and then metrics 

the following expression does this 

s1.aggs.bucket('xColor', a).bucket('xMake', b).bucket('xCity', 
c).metric('xMyPriceSum', 'sum', field = 'price').metric('xMyPriceAvg', 
'avg', field = 'price')


the problem is that it has to be written manually (at least I haven't fond 
a way to do this automatically). 

I tried the second approach you suggest and it gives me a different result 
: 
- bucket 1, bucket 2 , bucket 3 and their metrics but not one inside 
another one. 

I hope my question makes sense otherwise I am happy to provide a more 
complete example 

Best regards, 
Mike 


On Monday, March 30, 2015 at 11:06:28 PM UTC+2, Honza Král wrote:
>
> Hello,
>
> you can access buckets already created using ['name'] syntax, in your case 
> you can do (instead of the chaining):
>
> s.aggs['xColor']['xMake']['xCity'].metric(...)
> s.aggs['xColor']['xMake']['xCity'].metric(...)
>
> This way you can add aggregations to already created buckets.
>
> Also you can just use an approach where you keep the pointer to the 
> inner-most bucket 9start with s.aggs) and go from there in your case (bunch 
> of nested buckets and then metrics inside):
>
> b = s.aggs
> for bucket in xVarBuckets:
> b = s.aggs.bucket(bucket['label'], 'terms', field=bucket['field'])
>
> for metric in xVarMetrics:
> b.metric(metric['label'], metric['agg_function'], 
> field=metric['field'])
>
>
> Hope this helps,
>
> On Mon, Mar 30, 2015 at 10:55 PM, Mike  > wrote:
>
>> the python elasticsearch , elasticsearch dsl packages are life-saver and 
>> got me converted to ES. 
>>
>> Now I am trying to use elasticsearch dsl package to create pivot tables 
>> in ES  but am having hard time figuring out how to chain the buckets 
>> programmatically. 
>> while chaining the buckets / metrics manually works,  to chain them 
>> programmatically seems impossible 
>>
>> here is an example 
>>
>>
>> from elasticsearch import Elasticsearch
>> from elasticsearch_dsl import Search as dsl_search, A, Q, F 
>> # create client 
>> es = Elasticsearch('localhost:9200')
>> # data : from the definitive guide, slighlty modified 
>> #data from the definitive guide 
>> xData = [
>> {'doc_id' : 1, 'price' : 1, 'color' : 'red',   'make' : 'honda',  
>> 'sold' : '2014-10-28', 'city': 'ROME',   'insurance': 'y'},
>> {'doc_id' : 2, 'price' : 2, 'color' : 'red',   'make' : 'honda',  
>> 'sold' : '2014-11-05', 'city': 'ROME',   'insurance': 'n'},
>> {'doc_id' : 3, 'price' : 3, 'color' : 'green', 'make' : 'ford',   
>> 'sold' : '2014-05-18', 'city': 'Berlin', 'insurance': 'y'},
>> {'doc_id' : 4, 'price' : 15000, 'color' : 'blue',  'make' : 'toyota', 
>> 'sold' : '2014-07-02', 'city': 'Berlin', 'insurance': 'n'},
>> {'doc_id' : 5, 'price' : 12000, 'color' : 'green', 'make' : 'toyota', 
>> 'sold' : '2014-08-19', 'city': 'Berlin', 'insurance': 'n'},
>> {'doc_id' : 6, 'price' : 2, 'color' : 'red',   'make' : 'honda',  
>> 'sold' : '2014-11-05', 'city': 'Paris',  'insurance': 'n'},
>> {'doc_id' : 7, 'price' : 8, 'color' : 'red',   'make' : 'bmw',
>> 'sold' : '2014-01-01', 'city': 'Paris',  'insurance': 'y'},
>> {'doc_id' : 8, 'price' : 25000, 'color' : 'blue',  'make' : 'ford',   
>> 'sold' : '2014-02-12', 'city': 'Paris',  'insurance': 'y'}]
>>
>>

elastisticsearch_dsl python to create pivot tables

2015-03-30 Thread Mike
the python elasticsearch , elasticsearch dsl packages are life-saver and 
got me converted to ES. 

Now I am trying to use elasticsearch dsl package to create pivot tables in 
ES  but am having hard time figuring out how to chain the buckets 
programmatically. 
while chaining the buckets / metrics manually works,  to chain them 
programmatically seems impossible 

here is an example 


from elasticsearch import Elasticsearch
from elasticsearch_dsl import Search as dsl_search, A, Q, F 
# create client 
es = Elasticsearch('localhost:9200')
# data : from the definitive guide, slighlty modified 
#data from the definitive guide 
xData = [
{'doc_id' : 1, 'price' : 1, 'color' : 'red',   'make' : 'honda',  
'sold' : '2014-10-28', 'city': 'ROME',   'insurance': 'y'},
{'doc_id' : 2, 'price' : 2, 'color' : 'red',   'make' : 'honda',  
'sold' : '2014-11-05', 'city': 'ROME',   'insurance': 'n'},
{'doc_id' : 3, 'price' : 3, 'color' : 'green', 'make' : 'ford',   
'sold' : '2014-05-18', 'city': 'Berlin', 'insurance': 'y'},
{'doc_id' : 4, 'price' : 15000, 'color' : 'blue',  'make' : 'toyota', 
'sold' : '2014-07-02', 'city': 'Berlin', 'insurance': 'n'},
{'doc_id' : 5, 'price' : 12000, 'color' : 'green', 'make' : 'toyota', 
'sold' : '2014-08-19', 'city': 'Berlin', 'insurance': 'n'},
{'doc_id' : 6, 'price' : 2, 'color' : 'red',   'make' : 'honda',  
'sold' : '2014-11-05', 'city': 'Paris',  'insurance': 'n'},
{'doc_id' : 7, 'price' : 8, 'color' : 'red',   'make' : 'bmw',
'sold' : '2014-01-01', 'city': 'Paris',  'insurance': 'y'},
{'doc_id' : 8, 'price' : 25000, 'color' : 'blue',  'make' : 'ford',   
'sold' : '2014-02-12', 'city': 'Paris',  'insurance': 'y'}]

#create a mapping 
my_mapping = {
'my_example': {
'properties': {
'doc_id': {'type': 'integer'}, 
'price': {'type': 'integer'},
 'color': {'type': 'string', 'index': 'not_analyzed'},
 'make': {'type': 'string', 'index': 'not_analyzed'},
 'city': {'type': 'string', 'index': 'not_analyzed'},
 'insurance': {'type': 'string', 'index': 'not_analyzed'},
 'sold': {'type': 'date'}
}}}


#create an index and add the mapping 
if es.indices.exists('my_index_test'):
es.indices.delete(index="my_index_test")
es.indices.create('my_index_test')

# mapping for the document type 
if es.indices.exists_type(index = 'my_index_test', doc_type = 'my_example'):
es.indices.delete_mapping(index='my_index_test',doc_type='my_example')
es.indices.put_mapping(index='my_index_test',doc_type='my_example',body=my_mapping)

# indexing 
for xRow in xData:
es.index(index = 'my_index',
 doc_type= 'my_example',
 id = xRow['doc_id'],
 body = xRow
 )


### MANUALLY CHAINING WORKS 

a = A('terms', field = 'color')
b = A('terms', field = 'make')
c = A('terms', field = 'city')

s1 = dsl_search(es, index = 'my_index', doc_type= 'my_example')
s1.aggs.bucket('xColor', a).bucket('xMake', b).bucket('xCity', c)\
  .metric('xMyPriceSum', 'sum', field = 'price')\
  .metric('xMyPriceAvg', 'avg', field = 'price')
resp = s1.execute()
#get results 
q1 = resp.aggregations
q1



 but not PROGRAMMATICALLY 
# Programmatically chaining 

xVarBuckets = [{'field': 'color', 'label': 'xColor'},
   {'field': 'make',  'label': 'xMake'},
   {'field': 'city',  'label': 'xCity'}]

xVar_Metrics = [{'field': 'price', 'agg_function': 'sum', 'label': 
'xMyPriceSum'},
{'field': 'price', 'agg_function': 'avg', 'label': 
'xMyPriceAvg'}]


s2 = None 
s2 = dsl_search(es, index = 'my_index', doc_type = 'my_example')

#add buckets
for xBucketVar in xVarBuckets:
xAgg = A('terms', field= xBucketVar['field'])
s2.aggs.bucket(xBucketVar['label'], xAgg)
resp2 = s2.execute()
#get results 
q2 = resp2.aggregations   


I guess it has to do with the fact that the newly create bucket is 
overwritten by the new bucket, but how can append the new bucket to the 
previous one 


Any help appreciated 






-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/fa471fcf-9ed7-49f9-9e34-4cbefb90abb8%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: Sorting and range filtering semantic versions

2015-03-17 Thread Mike Turley
Did you ever find a good solution for this?  I am trying to solve the same 
problem (just sorting, not range filtering).

On Monday, January 26, 2015 at 2:47:30 AM UTC-5, Eric Smith wrote:
>
> I am trying to figure out some sort of indexing scheme where I can do 
> range filters on semantic versions .  Values look 
> like these:
>
> "1.0.2.5", "1.10.2.5", "2.3.434.1"
>
> I know that I can add a separate field with the numbers padded out, but I 
> was hoping to have a single field where I could do things like this:
>
> "version:>1.0" "version:1.0.2.5" "version:1.0" "version:[1.0 TO 2.0]"
>
> I have created some pattern capture filters to allow querying partial 
> version numbers. I even created some pattern replacement filters to pad the 
> values out so that they could be lexicographically sorted, but those 
> filters only control the tokens that are indexed and not the value that is 
> used for sorting and range filters.
>
> Is there a way to customize the value that is used for sorting and range 
> filters?  It seems like it just uses the original value and I don't have 
> any control of it?
>
> Any help would be greatly appreciated!
>

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/2a80f6c9-ae8e-4df9-a1df-30e3eda6697f%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Optimizing filter bitsets

2015-01-26 Thread Mike Sukmanowsky
We're storing Kibana-style time series documents across three indexes on a
10 node cluster (i2.xlarges). These indexes have between 20M-500M docs at
peak and we use bool filters extensively while querying.  Query volumes are
pretty low (maybe around 100 searches/sec at peak) versus index ops
(4K/sec).

Recently, I've been noticing a lot of churn in our filter cache and I'm
wondering if our bitsets are optimized or maybe if we're just hitting
memory limits because of too many documents.



I understand that the result of the bool is the bitset that's cached as
opposed to the individual term filters themselves. This had me concerned
that for certain complex bool filters (where we have >10 or so term filters
inside a "must" clause), were creating bitsets that have far too narrow an
application (basically the one query they were used for).

If we have certain terms (say customer ID, ) which update fairly
infrequently (only with new docs) and others that update fairly frequently
(say time-based fields), is there a way to optimize our bool queries to
create reusable bitsets for the infrequent term filters while also having
the benefit of caching the result of the entire bool filter?

Is it as simple as adding _cache: true to the terms filters that are fairly
static?

Anything else we can look at to help understand how to optimize our filter
cache?

Mike

-- 
Mike Sukmanowsky
Aspiring Digital Carpenter

*e*: mike.sukmanow...@gmail.com

facebook <http://facebook.com/mike.sukmanowsky> | twitter
<http://twitter.com/msukmanowsky> | LinkedIn
<http://www.linkedin.com/profile/view?id=10897143> | github
<https://github.com/msukmanowsky>

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/CAOH6cu5Xz8i9iV80onEN2R2yXA%3Dddk7uXqWCBYTo7X1dfOCvYw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: Is there a preferred config for Index / Shard configuration? Lots of indexes with lots of shards or fewer indexes and bigger shards?

2015-01-05 Thread mike . giardinelli
Hi Mark,

Thanks for the reply! We have roughly 13 TB of data and about 40 indexes (1 
index per week).  For each index we have 22 shards (one for every data 
node). 

On Monday, January 5, 2015 2:27:24 PM UTC-8, Mark Walkom wrote:
>
> One shard per node is ideal as you spread the load.
> Reducing the shard count can help but it depends on a few things.
>
> How much data do you have in your cluster, how many indexes?
>
> On 6 January 2015 at 08:51, > wrote:
>
>> Hi All,
>>
>> We have started noticing in our environment that our query performance is 
>> starting to suffer for some of our datasets that span the roughly 1 year of 
>> data we keep online.  We are looking into optimizations we can make to our 
>> Index / Shard configuration and I was wondering if there is a preferable 
>> way to configure our indexes / shards? Right now we create a new index for 
>> each week and have 22 shards per index (We have 22 data nodes).  Would it 
>> be more optimal to reduce the number of indexes (index by month) and have 
>> larger shards? Our documents are kb in size so they are not all that big, 
>> we just have a lot of them. 
>>
>> The feedback we typically get back from support is just test and see.  
>> That is something we can do, but there is a fair amount of effort / time 
>> that we would need to put in to only find that it doesn't give us any 
>> benefit.  I was just hoping some of the more experienced folks could 
>> provide some input on possible solutions.  If all else fails, we can always 
>> try to test different configs. 
>>
>> Thanks!
>>
>>
>>  -- 
>> You received this message because you are subscribed to the Google Groups 
>> "elasticsearch" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to elasticsearc...@googlegroups.com .
>> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/elasticsearch/12e73093-236b-4656-b6d5-960b31df7747%40googlegroups.com
>>  
>> 
>> .
>> For more options, visit https://groups.google.com/d/optout.
>>
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/65c2167b-905a-4403-bab8-fef301c57287%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Is there a preferred config for Index / Shard configuration? Lots of indexes with lots of shards or fewer indexes and bigger shards?

2015-01-05 Thread mike . giardinelli
Hi All,

We have started noticing in our environment that our query performance is 
starting to suffer for some of our datasets that span the roughly 1 year of 
data we keep online.  We are looking into optimizations we can make to our 
Index / Shard configuration and I was wondering if there is a preferable 
way to configure our indexes / shards? Right now we create a new index for 
each week and have 22 shards per index (We have 22 data nodes).  Would it 
be more optimal to reduce the number of indexes (index by month) and have 
larger shards? Our documents are kb in size so they are not all that big, 
we just have a lot of them. 

The feedback we typically get back from support is just test and see.  That 
is something we can do, but there is a fair amount of effort / time that we 
would need to put in to only find that it doesn't give us any benefit.  I 
was just hoping some of the more experienced folks could provide some input 
on possible solutions.  If all else fails, we can always try to test 
different configs. 

Thanks!


-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/12e73093-236b-4656-b6d5-960b31df7747%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


mass update of records - dns resolution

2015-01-01 Thread Mike Sheinberg
Howdy,

I'm not sure what the best way to tackle an issue I'm having in regards to 
DNS resolution of IP addresses in my ES documents. 

For the background, I'm using logstash as a netflow collector --> ES. I was 
previously using the dns filter of logstash to reverse lookup IP fields in 
realtime but that caused performance issues and it seems like records were 
being lost. So my question is - is it more efficient for me to continue 
trying to tackle this in logstash (before records are placed into ES) or 
would it make more sense for me to do something after the record is in ES? 
I don't have an issue with the delay of having the DNS resolution, so I 
imagine going through the previous hour, every hour to batch update records.

If someone could point me in the right direction I'd greatly appreciate it. 
I'm very new to the whole ELK stack so apologies if I'm missing something 
obvious.

Thanks in advance.

--Mike

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/ac8302b7-2975-4d55-9554-33368decb08c%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Marvel / ES query document count major discrepancy

2014-11-20 Thread Mike Seid
Howdy,

I have been hitting my ES cluster pretty hard of recent and I think it is 
holding up great. In the last few days, I have noticed a major discrepancy 
in the document count that Marvel shows versus that of doing a _count query 
of the actual ES cluster. Marvel is reporting about 43.9M documents while 
the ES query shows 8.7M. Where would this discrepancy come from? I would 
suspect it is a monitor error on Marvels part, but I'm not sure. Any ideas?

Marvel Screenshot:
https://www.dropbox.com/s/1y39wui96fpjc14/Screenshot%202014-11-20%2009.57.42.png?dl=0

ES Query:
http://x/pa-2014-11-19/_count
{
   
   - "count": 8781919,
   - 
   "_shards": {
  - "total": 5,
  - "successful": 5,
  - "failed": 0
   }

}


Thanks,

Mike

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/071a26ac-db82-42dc-880b-6165a0d38d30%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: Jagged Index Request Rate

2014-11-13 Thread Mike Seid
Thanks! This makes a lot of sense and it is pretty reaffirming to know that 
data won't be lost without an error. 

I'm guessing your right about the jagged index rate, as the graph is pretty 
consistent across different levels of load. I'll definitely be looking into 
bulk as load grows.

Cheers,

Mike 

On Thursday, November 13, 2014 10:40:48 AM UTC-8, Jörg Prante wrote:
>
> "10 index a second" is not too much.
>
> Note, the jagged rate may also result from a skew in the measurement 
> intervals, so it may be artificial and no need to worry too much.
>
> Bulk is definitely the way to go. With bulk mode, you can index around 
> 10k-20k docs per second (avg size of 1k) from a single machine to another 
> single machine, using concurrency.
>
> You can not lose data with correct coding, ES will reject requests with 
> exceptions if too heavy loaded automatically. In that case, close the 
> client immediately, and restart the indexing from the beginning with a more 
> modest configuration.
>
> Jörg
>
> On Thu, Nov 13, 2014 at 6:55 PM, Mike Seid 
> > wrote:
>
>> I'm using ES 1.3 to ingest information from multiple ( 25 ) servers. Our 
>> servers are indexing a single document at a time using the Java API, 
>>  totally about 100-200 documents per second depending on the time. In 
>> Marvel, I see the index requests as a very jagged graph ( 
>> https://www.dropbox.com/s/6snw2wxbv264c5w/Screenshot%202014-11-13%2009.42.47.png?dl=0
>>  
>> ) , and I'm trying to isolate the bottleneck. I'm pretty confident that 
>> data is coming in a more smooth line, so I'm not sure if there is an 
>> indexing bottleneck in my servers, or in my ES cluster. On each of the ES 
>> nodes, there isn't more than a 0.1 Load and about 20IOPS, so I don't think 
>> there is a bottleneck on the ES cluster, but I can't figure out where it 
>> would be on the client either. 
>>
>> Is 10 index a second too much for a single ElasticSearch Java Client? 
>> Should I consider switching to bulk inserts? I would image the java client 
>> could do much more throughput than that. Just want to make sure that i'm 
>> not losing any data anywhere.
>>
>> Thanks for the help!
>>
>>  -- 
>> You received this message because you are subscribed to the Google Groups 
>> "elasticsearch" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to elasticsearc...@googlegroups.com .
>> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/elasticsearch/5b1d076f-69bd-47c9-86de-dd6c419b23b6%40googlegroups.com
>>  
>> <https://groups.google.com/d/msgid/elasticsearch/5b1d076f-69bd-47c9-86de-dd6c419b23b6%40googlegroups.com?utm_medium=email&utm_source=footer>
>> .
>> For more options, visit https://groups.google.com/d/optout.
>>
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/5fb5e71b-9047-4589-90e6-f266382a181b%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Jagged Index Request Rate

2014-11-13 Thread Mike Seid
I'm using ES 1.3 to ingest information from multiple ( 25 ) servers. Our 
servers are indexing a single document at a time using the Java API, 
 totally about 100-200 documents per second depending on the time. In 
Marvel, I see the index requests as a very jagged graph 
( 
https://www.dropbox.com/s/6snw2wxbv264c5w/Screenshot%202014-11-13%2009.42.47.png?dl=0
 
) , and I'm trying to isolate the bottleneck. I'm pretty confident that 
data is coming in a more smooth line, so I'm not sure if there is an 
indexing bottleneck in my servers, or in my ES cluster. On each of the ES 
nodes, there isn't more than a 0.1 Load and about 20IOPS, so I don't think 
there is a bottleneck on the ES cluster, but I can't figure out where it 
would be on the client either. 

Is 10 index a second too much for a single ElasticSearch Java Client? 
Should I consider switching to bulk inserts? I would image the java client 
could do much more throughput than that. Just want to make sure that i'm 
not losing any data anywhere.

Thanks for the help!

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/5b1d076f-69bd-47c9-86de-dd6c419b23b6%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: search query / analyzer issue dealing with spaces

2014-10-30 Thread Mike Maddox
Jarod,

The format of your analyzer is wrong. Note that you have to set the filter 
property. Use this:

{
  "settings": {
"analysis": {
  "analyzer": {
"my_analyzer": {
  "type": "custom",
  "tokenizer": "keyword",
  "filter": "lowercase"
}
  }
}
  },
  "mappings": {
"car": {
  "properties": {
"color": {
  "type": "string",
  "analyzer": "my_analyzer"
}
  }
}
  }
}

On Thursday, October 30, 2014 12:52:45 PM UTC-7, Jarrod C wrote:
>
> Thanks Mike, it appears referencing the 'episode' instead of 'car' from a 
> previous example was the problem.  That seems to have progressed me further 
> however my queries are still case sensitive despite lowercase being true. 
>  Allow me to repost what I have for clarity.  Thanks
>
> // mapping
> curl -XPUT 'http://localhost:9200/myindex/' -d '{
>   "settings": {
> "analysis": {
>   "analyzer": {
> "my_analyzer": {
>   "type": "custom",
>   "tokenizer": "keyword",
>   "lowercase": true
> }
>   }
> }
>   },
>   "mappings": {
> "car": {
>   "_source": {
> "enabled": false
>   },
>   "properties": {
> "color": {
>   "type": "string",
>   "analyzer": "my_analyzer"
>   }
> }
>   }
> }
> }'
>
> //query matches 'Metallic RED' but not 'Metallic Red'
> GET /myindex/car/_search
> {
>"query": {
>"match": {
>   "color": "Metallic RED"
>}   
>}
> }
>  
>
> On Thursday, October 30, 2014 2:41:10 PM UTC-4, Mike Maddox wrote:
>>
>> Jarrod,
>>
>> I understand that you think the analyzer is not the problem. However, the 
>> original mapping wasn't correctly formatted so the color type was being 
>> analyzed using the default analyzer which would also cause the query to use 
>> the default analyzer as well. If you fix the syntax and then change color 
>> to use your analyzer it will work. One note, your mapping is also incorrect 
>> in that it references the "episode" type when you're actually adding data 
>> to the "car" type. Using your analyzer, it would be indexed as one lower 
>> case string. Now, your query does make a difference but if you have the 
>> analyzer set correctly, it will analyze the input string using the same 
>> analyzer that you set in the mapping. You would be better off just doing a 
>> term query or filter.
>>
>> Mike
>>
>>
>> On Thursday, October 30, 2014 8:06:36 AM UTC-7, Jarrod C wrote:
>>>
>>> Thanks for the replies.  Unfortunately the analyzer portion is not the 
>>> problem (I pasted the original text in the midst of experimentation).  When 
>>> I had "analyzer" : "my_analyzer" in the mapping it didn't make a 
>>> difference.  I get results from the analysis query below so I assume it was 
>>> configured properly:
>>> GET /myindex/_analyze?analyzer=my_analyzer
>>>
>>> However, it does not seem to make a difference between using my custom 
>>> "my_analyzer" or using "keyword", or even using "index" : "not_analyzed". 
>>>  In each case, if I search for "red" I get back all results when in fact I 
>>> only want 1.
>>>
>>> Perhaps my query is the problem?
>>>
>>> On Wednesday, October 29, 2014 8:17:40 PM UTC-4, Mike Maddox wrote:
>>>>
>>>> Actually, change it to "index": "not_analyzed" as shown in the JSON.
>>>>
>>>> On Wednesday, October 29, 2014 5:13:46 PM UTC-7, Mike Maddox wrote:
>>>>>
>>>>> Actually, there are two problems here. Change the analyzer to the name 
>>>>> of your custom analyzer and you are missing a curly brace to close out 
>>>>> the 
>>>>> "settings" property. Not sure why it doesn't cause an error but it 
>>>>> definitely doesn't create a mapping. You can check if there is a mapping 
>>>>> by 
>>>>> looking at: http://localhost:9200/myind

Re: search query / analyzer issue dealing with spaces

2014-10-30 Thread Mike Maddox
Jarrod,

I understand that you think the analyzer is not the problem. However, the 
original mapping wasn't correctly formatted so the color type was being 
analyzed using the default analyzer which would also cause the query to use 
the default analyzer as well. If you fix the syntax and then change color 
to use your analyzer it will work. One note, your mapping is also incorrect 
in that it references the "episode" type when you're actually adding data 
to the "car" type. Using your analyzer, it would be indexed as one lower 
case string. Now, your query does make a difference but if you have the 
analyzer set correctly, it will analyze the input string using the same 
analyzer that you set in the mapping. You would be better off just doing a 
term query or filter.

Mike


On Thursday, October 30, 2014 8:06:36 AM UTC-7, Jarrod C wrote:
>
> Thanks for the replies.  Unfortunately the analyzer portion is not the 
> problem (I pasted the original text in the midst of experimentation).  When 
> I had "analyzer" : "my_analyzer" in the mapping it didn't make a 
> difference.  I get results from the analysis query below so I assume it was 
> configured properly:
> GET /myindex/_analyze?analyzer=my_analyzer
>
> However, it does not seem to make a difference between using my custom 
> "my_analyzer" or using "keyword", or even using "index" : "not_analyzed". 
>  In each case, if I search for "red" I get back all results when in fact I 
> only want 1.
>
> Perhaps my query is the problem?
>
> On Wednesday, October 29, 2014 8:17:40 PM UTC-4, Mike Maddox wrote:
>>
>> Actually, change it to "index": "not_analyzed" as shown in the JSON.
>>
>> On Wednesday, October 29, 2014 5:13:46 PM UTC-7, Mike Maddox wrote:
>>>
>>> Actually, there are two problems here. Change the analyzer to the name 
>>> of your custom analyzer and you are missing a curly brace to close out the 
>>> "settings" property. Not sure why it doesn't cause an error but it 
>>> definitely doesn't create a mapping. You can check if there is a mapping by 
>>> looking at: http://localhost:9200/myindex/_mapping
>>>
>>> Here is how it should be:
>>>
>>>
>>> {
>>>   "settings": {
>>> "analysis": {
>>>   "analyzer": {
>>> "my_analyzer": {
>>>   "type": "custom",
>>>   "tokenizer": "keyword",
>>>   "lowercase": true
>>> }
>>>   }
>>> }
>>>   },
>>>   "mappings": {
>>> "episode": {
>>>   "_source": {
>>> "enabled": false
>>>   },
>>>   "properties": {
>>> "color": {
>>>   "type": "string",
>>>   "index": "not_analyzed"
>>> }
>>>   }
>>> }
>>>   }
>>> }
>>>
>>> On Wednesday, October 29, 2014 2:38:36 PM UTC-7, Jarrod C wrote:
>>>>
>>>> Hello, I am trying to run a query that distinguishes between spaces in 
>>>> values.  Let's say I have a field called 'color' in my index.  Record 1 
>>>> has 
>>>> "color" : "metallic red" whereas Record 2 has "color": "metallic" 
>>>>
>>>> I want to search for 'metallic' but NOT retrieve 'metallic red', and a 
>>>> search for 'metallic red' should not return 'red'.  
>>>>
>>>> The query below works for 'metallic red' but entering 'red' returns 
>>>> both records.  The query also appears to be bypassing Analyzers specified 
>>>> in the mappings (such as keyword) as they have no affect.  What should I 
>>>> change it to instead?
>>>>
>>>> //Query
>>>> GET /myindex/_search
>>>> {
>>>>"query": {
>>>>"match_phrase": {
>>>>   "color": "metallic red"
>>>>}   
>>>>}
>>>> }
>>>>
>>>> //Data
>>>> { "index" : { "_index" : "myindex", "_type" : "car", "_id" : "1" } }
>>>> { "color" : "metallic red" }

Re: search query / analyzer issue dealing with spaces

2014-10-29 Thread Mike Maddox
Actually, change it to "index": "not_analyzed" as shown in the JSON.

On Wednesday, October 29, 2014 5:13:46 PM UTC-7, Mike Maddox wrote:
>
> Actually, there are two problems here. Change the analyzer to the name of 
> your custom analyzer and you are missing a curly brace to close out the 
> "settings" property. Not sure why it doesn't cause an error but it 
> definitely doesn't create a mapping. You can check if there is a mapping by 
> looking at: http://localhost:9200/myindex/_mapping
>
> Here is how it should be:
>
>
> {
>   "settings": {
> "analysis": {
>   "analyzer": {
> "my_analyzer": {
>   "type": "custom",
>   "tokenizer": "keyword",
>   "lowercase": true
> }
>   }
> }
>   },
>   "mappings": {
> "episode": {
>   "_source": {
> "enabled": false
>   },
>   "properties": {
> "color": {
>   "type": "string",
>   "index": "not_analyzed"
> }
>   }
> }
>   }
> }
>
> On Wednesday, October 29, 2014 2:38:36 PM UTC-7, Jarrod C wrote:
>>
>> Hello, I am trying to run a query that distinguishes between spaces in 
>> values.  Let's say I have a field called 'color' in my index.  Record 1 has 
>> "color" : "metallic red" whereas Record 2 has "color": "metallic" 
>>
>> I want to search for 'metallic' but NOT retrieve 'metallic red', and a 
>> search for 'metallic red' should not return 'red'.  
>>
>> The query below works for 'metallic red' but entering 'red' returns both 
>> records.  The query also appears to be bypassing Analyzers specified in the 
>> mappings (such as keyword) as they have no affect.  What should I change it 
>> to instead?
>>
>> //Query
>> GET /myindex/_search
>> {
>>"query": {
>>"match_phrase": {
>>   "color": "metallic red"
>>}   
>>}
>> }
>>
>> //Data
>> { "index" : { "_index" : "myindex", "_type" : "car", "_id" : "1" } }
>> { "color" : "metallic red" }
>> { "index" : { "_index" : "myindex", "_type" : "car", "_id" : "2" } }
>> { "color" : "Metallic RED"}
>> { "index" : { "_index" : "myindex", "_type" : "car", "_id" : "3" } }
>> { "color" : "rEd" }
>>
>> //Mapping (no effect for query)
>> curl -XPUT 'http://localhost:9200/myindex/' -d '{
>> "settings" : {
>>   "analysis": {
>> "analyzer": {
>>   "my_analyzer":{
>> "type": "custom",
>> "tokenizer" : "keyword",
>> "lowercase" : true
>> }
>> }
>> },
>> "mappings" : {
>> "episode" : {
>> "_source" : { "enabled" : false },
>> "properties" : {
>> "color" : { "type" : "string", "analyzer" : 
>> "not_analyzed" }
>> }
>> }
>> }
>> }
>> }'
>>
>>
>> Thanks!
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/220d0be3-c86d-4473-b957-b90b35d3da80%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: search query / analyzer issue dealing with spaces

2014-10-29 Thread Mike Maddox
Actually, there are two problems here. Change the analyzer to the name of 
your custom analyzer and you are missing a curly brace to close out the 
"settings" property. Not sure why it doesn't cause an error but it 
definitely doesn't create a mapping. You can check if there is a mapping by 
looking at: http://localhost:9200/myindex/_mapping

Here is how it should be:


{
  "settings": {
"analysis": {
  "analyzer": {
"my_analyzer": {
  "type": "custom",
  "tokenizer": "keyword",
  "lowercase": true
}
  }
}
  },
  "mappings": {
"episode": {
  "_source": {
"enabled": false
  },
  "properties": {
"color": {
  "type": "string",
  "index": "not_analyzed"
}
  }
}
  }
}

On Wednesday, October 29, 2014 2:38:36 PM UTC-7, Jarrod C wrote:
>
> Hello, I am trying to run a query that distinguishes between spaces in 
> values.  Let's say I have a field called 'color' in my index.  Record 1 has 
> "color" : "metallic red" whereas Record 2 has "color": "metallic" 
>
> I want to search for 'metallic' but NOT retrieve 'metallic red', and a 
> search for 'metallic red' should not return 'red'.  
>
> The query below works for 'metallic red' but entering 'red' returns both 
> records.  The query also appears to be bypassing Analyzers specified in the 
> mappings (such as keyword) as they have no affect.  What should I change it 
> to instead?
>
> //Query
> GET /myindex/_search
> {
>"query": {
>"match_phrase": {
>   "color": "metallic red"
>}   
>}
> }
>
> //Data
> { "index" : { "_index" : "myindex", "_type" : "car", "_id" : "1" } }
> { "color" : "metallic red" }
> { "index" : { "_index" : "myindex", "_type" : "car", "_id" : "2" } }
> { "color" : "Metallic RED"}
> { "index" : { "_index" : "myindex", "_type" : "car", "_id" : "3" } }
> { "color" : "rEd" }
>
> //Mapping (no effect for query)
> curl -XPUT 'http://localhost:9200/myindex/' -d '{
> "settings" : {
>   "analysis": {
> "analyzer": {
>   "my_analyzer":{
> "type": "custom",
> "tokenizer" : "keyword",
> "lowercase" : true
> }
> }
> },
> "mappings" : {
> "episode" : {
> "_source" : { "enabled" : false },
> "properties" : {
> "color" : { "type" : "string", "analyzer" : "not_analyzed" 
> }
> }
> }
> }
> }
> }'
>
>
> Thanks!
>

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/99dfc5ad-5efe-409b-a54c-5bde5ad7685b%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: no clustering on ec2 due to ClassNotFoundException[org.elasticsearch.gateway.blobstore.BlobStoreGatewayModule]

2014-10-29 Thread Mike Ressler
Elasticsearch gurus,

Just getting started with elasticsearch and ran across the same gateway 
issue brought up in this thread.  Seems like the documentation really ought 
to be updated.  It's several major versions old at this point.

http://www.elasticsearch.org/tutorials/elasticsearch-on-ec2/

That's the documentation I've been using as I get started.

On Friday, July 25, 2014 2:47:57 PM UTC-4, David Pilato wrote:
>
> Automatic master node election.
>
> HTH
> --
> David ;-)
> Twitter : @dadoonet / @elasticsearchfr / @scrutmydocs
>
>
> Le 25 juil. 2014 à 20:07, Anthony Oleary  > a écrit :
>
> hi David
> May I ask,
> if i want to use opsworks,  copy my ec2 image onto opswork , and auto 
> scale, how can i specify a master node or is it automatic? 
>
> anthony 
>
> On Thursday, July 24, 2014 5:36:51 PM UTC+1, David Pilato wrote:
>>
>> It's a matter of EC2 configuration. Elasticsearch does not really care to 
>> know.
>>
>> I'd probably start with local disks if possible. Replication is done by 
>> elasticsearch. So if you have more than one node, your data could be 
>> replicated on another machine.
>>
>> -- 
>> *David Pilato* | *Technical Advocate* | *Elasticsearch.com 
>> *
>> @dadoonet  | @elasticsearchfr 
>> 
>>
>>
>> Le 24 juillet 2014 à 17:23:08, Anthony Oleary (anthony...@kweekweek.com) 
>> a écrit:
>>
>> Thanks again David, 
>> One last question, how would i tell elasticsearch to use the EBS? or is 
>> it when i create a EC2 to link it to EBS and it works away without telling 
>> eleasticsearch.
>>
>> would you recommend EBS in order for the data not to be lost.
>>
>>
>>
>> On Thursday, July 24, 2014 3:48:53 PM UTC+1, David Pilato wrote: 
>>>
>>>  Just use local disk or EBS with provisioned IOs.
>>>  You don't need to store your indices on S3. If you want to do that for 
>>> backup purpose, have a look at snapshot and restore API.
>>>  
>>>  Basically, in elasticsearch.yml file, remove:
>>>  
>>>  gateway.type: s3
>>> gateway.s3.bucket: codetest
>>>
>>> Gateway has been removed since 1.2.0: 
>>> http://www.elasticsearch.org/blog/elasticsearch-1-2-0-released/
>>>  
>>>  -- 
>>> *David Pilato* | *Technical Advocate* | *Elasticsearch.com 
>>> * 
>>>  @dadoonet  | @elasticsearchfr 
>>> 
>>>  
>>>
>>> Le 24 juillet 2014 à 16:27:27, Anthony Oleary (anthony...@kweekweek.com) 
>>> a écrit:
>>>
>>>  thanks David 
>>> I got the link about gateways from 
>>>  http://www.elasticsearch.org/tutorials/elasticsearch-on-ec2/
>>>
>>>
>>> what would you recommend i use for clustering on EC2 instead of gateways 
>>> or how to configure the config.yml ?
>>>
>>> Anthony
>>>
>>>
>>> On Thursday, July 24, 2014 2:47:32 PM UTC+1, David Pilato wrote: 

  Gateways have been removed. You can't use that anymore.

  -- 
 *David Pilato* | *Technical Advocate* | *Elasticsearch.com 
 * 
  @dadoonet  | @elasticsearchfr 
 
  

 Le 24 juillet 2014 à 14:47:48, Anthony Oleary (anthony...@kweekweek.com) 
 a écrit:

   Hello,
 in EC@,i installed elasticsearch with
 wget 
 http://download.elasticsearch.org/elasticsearch/elasticsearch/elasticsearch-1.3.0.zip
 sudo unzip elasticsearch-1.3.0.zip -d /usr/local/elasticsearch
 cd /usr/local/elasticsearch/elasticsearch-1.3.0
 sudo bin/plugin -install elasticsearch/elasticsearch-cloud-aws/2.1.1

 config.yml is

  cluster.name: elasticsearch-demo-js
 cloud.aws.access_key: xxx..
 cloud.aws.secret_key: yyy...
 cloud.aws.discovery.type: ec2
 gateway.type: s3
 gateway.s3.bucket: codetest

 but i get this error as soon as i add the gateway configuration
 {1.3.0}: Initialization Failed ...

- 
 NoClassDefFoundError[org/elasticsearch/gateway/blobstore/BlobStoreGatewayModule]
  

 ClassNotFoundException[org.elasticsearch.gateway.blobstore.BlobStoreGatewayModule]
  

  any ideas ?

 i then tried:  sudo bin/plugin -install 
 elasticsearch/elasticsearch-cloud-aws/2.2.0   but got 

 - NoClassSettingsException[Failed to load class setting [gateway.type] 
 with value [s3]]

  ClassNotFoundException[org.elasticsearch.gateway.s3.S3GatewayModule]


 bit lost now!!!
 --
 You received this message because you are subscribed to the Google 
 Groups "elasticsearch" group.
 To unsubscribe from this group and stop receiving emails from it, send 
 an email to elasticsearc...@googlegroups.com.
 To view this discussion on the web visit 
 https://groups.google.com/d/msgid/elasticsearch/c46bf6f1-ac0d-4437-810a-d2900252d52a%40googlegroups.com
  
 

Re: Error Restoring Snapshot

2014-10-28 Thread Mike Tolman
Nevermind, turns out that I wasn't copying the full directory structure 
correctly when copying the repository files from Server1 to Server2.

On Tuesday, October 28, 2014 11:29:11 AM UTC-6, Mike Tolman wrote:
>
> Hi,
>
> I've been trying to restore an index snapshot and am getting this error in 
> the response:
>
> ElasticsearchParseException[unexpected token  [FIELD_NAME]]
>
> I'm sure I'm just doing something stupid, but I can't figure out what. 
> Does anyone have any idea what I might be doing wrong? 
>
> Here is my basic workflow:
>
> (using ES 1.3.2 -- Server1 and Server2 are separate ES clusters)
>
> 1. Create fs snapshot repository on Server1
> 2. Create snapshot of index 'x' on Server1
> 3. Create fs snapshot repository on Server2
> 4. Copy files from Server1 repository to Server2 repository
> 5. Close index on Server2
> 6. Restore snapshot on Server2
>
> Step 6 always fails for me with the "unexpected token" error.
>
> Thanks in advance,
> Mike
>

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/2ba371f3-9546-4f0d-a432-bf6f6361bbd1%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Returning the main document and the tag that matched in a sub document

2014-10-28 Thread Mike Maddox
Any help would be appreciated. I have a parent document that has an array 
of sub documents which are tags that are associated with an objectid which 
is used to identify another element. I am able to search on the tags and 
get a response which returns the parent document which is exactly what I 
want. However, since the tags map to an objectid, I'd like to know which 
keywords matched the tag so I can get the objectid. I can compare the tags 
on the client to figure out which matched, however, if using a stemming 
analyzer, this wouldn't work and I'd like to find a better way if possible. 
For example, if I search for "friend and families" I would get a return of 
the document with id 249184, but I want to find out that we matched a tag 
related to objectid='7'. Any suggestions on if I'm going in the right 
direction to get the results I need or would there be another way to 
structure this. 


{
  "_index": "myindex",
  "_type": "mytype",
  "_id": "249184",
  "_version": 1,
  "_score": 1,
  "_source": {
"id": 249184,
"info":"I love elasticsearch",
"mytags": [
  {
"objectid": 7,
"tags": [
  "friend and families",
  "brother"
]
  },
  {
"objectid": 3,
"tags": [
  "sister"
]
  }
]
  }
}


The index is defined as follows (has been simplified for this example):

{
"mydata": {
"properties": {
"id": {
"type": "integer",
"index": "not_analyzed"
},
"info": {
"type": "string",
"analyzer": "standard"
},
"mytags": {
"type" : "object",
"properties": {
"objectid": { "type": "integer", "index": 
"not_analyzed" },
"tags":  { "type": "string", "analyzer": "standard"}
}
}
}
}
}

Thanks much


-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/00fb58c8-dee5-47d5-a463-e754d28d7b33%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Error Restoring Snapshot

2014-10-28 Thread Mike Tolman
Hi,

I've been trying to restore an index snapshot and am getting this error in 
the response:

ElasticsearchParseException[unexpected token  [FIELD_NAME]]

I'm sure I'm just doing something stupid, but I can't figure out what. Does 
anyone have any idea what I might be doing wrong? 

Here is my basic workflow:

(using ES 1.3.2 -- Server1 and Server2 are separate ES clusters)

1. Create fs snapshot repository on Server1
2. Create snapshot of index 'x' on Server1
3. Create fs snapshot repository on Server2
4. Copy files from Server1 repository to Server2 repository
5. Close index on Server2
6. Restore snapshot on Server2

Step 6 always fails for me with the "unexpected token" error.

Thanks in advance,
Mike

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/8edd8603-0387-4d15-af36-30965b89ee84%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: analyzer settings for breaking up words on hyphens

2014-10-27 Thread Mike Topper
Thanks!  i'll go ahead and try the pattern tokenizer route.



On Mon, Oct 27, 2014 at 1:22 PM, Ivan Brusic  wrote:

> You can either use a pattern tokenizer with your patterns being whitespace
> + hypen, or further decompose your token post tokenization with the word
> delimiter token filter, which is much harder to use (and might be an
> overkill for your use case).
>
>
> http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/analysis-pattern-tokenizer.html
>
> http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/analysis-word-delimiter-tokenfilter.html
>
> Cheers,
>
> Ivan
>
> On Mon, Oct 27, 2014 at 7:55 AM, Mike Topper  wrote:
>
>> Hello,
>>
>> I have a field that is using the whitespace tokenizer, but I also want to
>> tokenize on hyphens (-) like the standard analyzer does.  I'm having
>> trouble figuring out what additional custom settings I would have to put in
>> there in order to be able to tokenize off of hyphens as well.
>>
>> Thanks,
>> Mike
>>
>> --
>> You received this message because you are subscribed to the Google Groups
>> "elasticsearch" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to elasticsearch+unsubscr...@googlegroups.com.
>> To view this discussion on the web visit
>> https://groups.google.com/d/msgid/elasticsearch/CALdNedLtdAWEiQN%2BoUV17J5e8DowMbDva2pJn1S%3Dr9w1qtP9bA%40mail.gmail.com
>> <https://groups.google.com/d/msgid/elasticsearch/CALdNedLtdAWEiQN%2BoUV17J5e8DowMbDva2pJn1S%3Dr9w1qtP9bA%40mail.gmail.com?utm_medium=email&utm_source=footer>
>> .
>> For more options, visit https://groups.google.com/d/optout.
>>
>
>  --
> You received this message because you are subscribed to the Google Groups
> "elasticsearch" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to elasticsearch+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/elasticsearch/CALY%3DcQDeFdP4-imY0ReSZTkSAnfQ8o6_hWp9MAB0YcMOgDo9rA%40mail.gmail.com
> <https://groups.google.com/d/msgid/elasticsearch/CALY%3DcQDeFdP4-imY0ReSZTkSAnfQ8o6_hWp9MAB0YcMOgDo9rA%40mail.gmail.com?utm_medium=email&utm_source=footer>
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/CALdNedK9EfeL-FGbavnKO4t%3DkrQ%2BxeQ-O2p2wL-P_iqGSrhrsg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


analyzer settings for breaking up words on hyphens

2014-10-27 Thread Mike Topper
Hello,

I have a field that is using the whitespace tokenizer, but I also want to
tokenize on hyphens (-) like the standard analyzer does.  I'm having
trouble figuring out what additional custom settings I would have to put in
there in order to be able to tokenize off of hyphens as well.

Thanks,
Mike

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/CALdNedLtdAWEiQN%2BoUV17J5e8DowMbDva2pJn1S%3Dr9w1qtP9bA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


indexing and searching for string '???'

2014-10-27 Thread Mike Topper
Hello,

when trying a match query on a string field to match the string '???' i am
getting nothing back from elasticsearch.

It seems like the standard analyzer is just stripping this string out when
tokenizing. probably because its treating a ? as a end of word and
filtering it out?

when doing _analyze?analyzer=standard&pretty' -d 'this is a ???  test'


I get back the response below which seems to confirm that.  Is there any
way where I could still be filtering out "?" at the end of words, but if
there are multiple '??' it doesn't strip them?

{

  "tokens" : [ {

"token" : "this",

"start_offset" : 0,

"end_offset" : 4,

"type" : "",

"position" : 1

  }, {

"token" : "is",

"start_offset" : 5,

"end_offset" : 7,

"type" : "",

"position" : 2

  }, {

"token" : "a",

"start_offset" : 8,

"end_offset" : 9,

"type" : "",

"position" : 3

  }, {

"token" : "test",

"start_offset" : 15,

"end_offset" : 19,

"type" : "",

"position" : 4

  } ]

}

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/CALdNed%2BCGeR_92B%3DH%2BnS3FY%3DuiXH0Q6ShJV_Jg_awbQ2bH3sbQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Bulk insert vs Single insert

2014-10-10 Thread mike . giardinelli


Hi All,

The primary dev managing our ES cluster has made the statement that single 
document writes to ES will only provide us with roughly 30 / 40 writes a 
second. Whereas the bulk operations will give us more in the range of a 
1,000+. I realize that bulk is always faster (or is generally) and there 
are hardware / environment constraints to any process. However, with other 
technologies you do not pay such a heavy price for single insertions. I am 
obviously ignorant when it comes to ES, but why do you pay such a heavy 
price for document writes in ES? Or are we just not properly informed?

Environment:

   - Apache Storm writes to our ES cluster
   - Currently all of the writes are processed in bulk operations.

ES Configuration:


   - 11 data nodes
  - 
  
  2x AMD Opteron(TM) Processor 6272 (16 cores @ 2.1/3.0 GHz, 16 MB L3 
  cache)
  - 256 GB RAM
  - 12 TB (7200 RPM platter disks in LVM ext4 configuration)
   - ES configuration
  - two instances per node (16 cores per instance)
  - 30 GB RAM lock-in per instance (max recommended by ES)
  - 18 shards per index (empirically best combo of RAM vs. shard 
  trade-off)
   

Any information / suggestions would be greatly appreciated.


Thanks!

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/8ad4c98d-34ca-4205-b763-88e1392cf57c%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Significant terms aggregation with non tokenized text

2014-09-25 Thread Mike
I just tried using the significant terms aggregation on two text fields I 
have, and noticed that it doesn't seem to work on "non tokenized" fields. 
 On my keyword tokenized field, I get 0 for the bg_count, and it looks the 
same as a regular terms query with slighly different counts.  When I used 
my regular tokenized query, I see the results differ, and I have bg_counts. 
 Why is this?

Here are my 2 fields and analyzer:

"properties":{
"query"  : {   
   
"type" : "multi_field", 
  
"fields" : {   
   
"query"  : { "type" : "string" },   
  
"queryUntouched" : { "type" : "string", "analyzer" : 
"myLowercaseAnalyzer" }  
}   
  
}
}

"analyzer" : { 
   
"myLowercaseAnalyzer" : {   
  
"tokenizer" : "keyword",   
   
"filter" : ["lowercase"]   
   
}   
 
}

When I send the significant terms aggregation against queryUntouched it 
looks the same as a regular terms agg, with bg_count set to 0:

"aggs": {
"pop": {
  "terms": {
"field": "queryUntouched",
"size": 3
  }
},
"sig": {
  "significant_terms": {
"field": "queryUntouched",
"size": 3
  }
}
}


aggregations: {
   
   - pop: {
  - buckets: [
 - {
- key: yield curve
- doc_count: 102
 }
 - {
- key: gdp
- doc_count: 70
 }
  ]
   }
   - sig: {
  - doc_count: 62804
  - buckets: [
 - {
- key: yield curve
- doc_count: 102
- score: 7.200895615143776
- bg_count: 0
 }
 - {
- key: gdp
- doc_count: 81
- score: 4.540783692447051
- bg_count: 0
 }
  ]
   }


When I use the tokenized field, I get results that I would expect:
"aggs": {
"pop": {
  "terms": {
"field": "query",
"size": 2
  }
},
"sig": {
  "significant_terms": {
"field": "query",
"size": 2
  }
}
  }


aggregations: {
   
   - pop: {
  - buckets: [
 - {
- key: bank
- doc_count: 1423
 }
 - {
- key: of
- doc_count: 641
 }
  ]
   }
   - sig: {
  - doc_count: 62804
  - buckets: [
 - {
- key: bank
- doc_count: 1423
- score: 0.03191767117787348
- bg_count: 25686
 }
 - {
- key: id
- doc_count: 715
- score: 0.017449718916743313
- bg_count: 12274
 }
  ]
   }







-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/e7a41870-bb42-46f5-9161-dbeb6c847ad2%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


multi_match query spanning multiple analyzer fields

2014-09-24 Thread Mike Topper
Hello,

I'm trying to come up with a query that will search multiple fields, using
the 'and' operator and the cross_fieds type.  My goal is that a search for
"quick brown fox" will match documents where all 3 of those words are found
but not necessarily in the same field.  This works fine if the field list i
give the multi_match query are all the same analyzer, although does NOT
work when some of the fields are different analyzers.  I've read in the
documentation that this is because ES will group by analyzer and then do a
bool query.

Is there some other query/parameter that would allow me to do what I want
here?  I'm currently not worried about the scoring and just the matching if
that makes any difference.


-Mike

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/CALdNed%2BXtfHSKX8QqRuA8FB8vrajW-ECxk1cGOX24nTZF%3D5%2BBg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Changing tokenizer

2014-09-03 Thread Mike Jones
Hi,

I'm using the PHP library of ES to store and query my data. Currently I'm 
just indexing data like so, without any manual definitions for tokenizers. 
  

$params['body']  = array(
'userid'=> 3 ,
'username'  => 'frank' ,
'postname'  => 'hello_world' ,
'likes' => 33 ,
'created_at'=> '2014-12-12' ,
'data' => array(
'item1' => array(
'type'  => 'tweet' ,
'order' => '1' ,
'id'=> '32343' ,
'created_at'=> '2014-12-12' ,
'text'  => 'blah from twitter' ,
'latitude'  => '45' ,
'longitude' => '23'
)
)
);
   $ret = $client->index($params);

I don't actually know which tokenizers I want right now, so if I add them 
later, will the data need to be re-indexed? 

Also, reading 
http://www.elasticsearch.org/guide/en/elasticsearch/client/php-api/current/_index_operations.html
 
do I need to include the tokenizers with the $param array for every ingest 
process? or can this be set up when I create a new index?

Thanks

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/23016ef1-736b-490f-8748-fd24f1c8e743%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


using a nested object field in a multi_match query

2014-08-11 Thread Mike Topper
Hello,

I'm having trouble coming up with how to supply a field within a nested
object in the multi_match fields list.  I'm using the multi_match query in
order to perform query time field boosting, but something like:


  "query": {
"multi_match": {
  "query": "China Mieville",
  "operator": "and",
  "fields": [
"_all", "title^2", "author.name^1.5"
  ]
}
  }

doesn't seem to work.  the title is boosted fine but in fact if i take out
the "_all" field then i can see that author.name is never being used.  is
there a way to supply nested fields within a multi_match query?

-Mike

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/CALdNed%2B4VHQZE%2B%3DCqihZtH223DXR5MR9u49%2BessQ6ybpYbB%3DNg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


field boosting at query time

2014-08-08 Thread Mike Topper
Hello,

sorry if this question is already explained in the docs but i tried to go
through all the boosting and custom scoring documentation and didn't find
what i was looking for.

I'm trying to write a query that will give preference to certain fields
when doing a match query on "_all".  so for example to be able to  score
higher when the query matches the title and author fields than it would
when it matches on a subject field.  I've seen various ways to boost scores
based on field values as well as how to add a boost to the field mappings
on index time, but what if I want to do the field boost on query time
instead?

here's a query i'm running.  what would be the best way for example to add
a 2.0 boost to the title field?

http://pastebin.com/XBwat3zy

-Mike

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/CALdNedLu0q3c-vyeqbyDN_vaR%3DoBTA45Bk5vd6Fj2X8f0BbcAA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Filtering Aggs by most terms

2014-08-04 Thread Mike M
>From a list of terms, I'm trying to find all the records that have those 
terms in a specific field and aggregate on that field. As an example, I 
have a list of records that contain a city name. I'd like the user to free 
form type words (some of which would be the city name). Then I'd like to 
see the top cities that have the most terms that match in them. This sort 
of works but it seems to return all the records that have the most single 
terms (lake) rather than the most terms that match. I do eventually get 
"Mayfield Lake" but it can be far at the bottom since there is only a few 
mayfields. Any suggestions on the best way to filter or only return records 
that contain the most terms. Should this be part of the query? Can I sort 
results in an Agg by relevance from the query?
 

  "aggs": {
"cities": {
  "filter": {
"bool": {
  "should": [

[{"term":{"city":"mayfield"}},{"term":{"city":"lake"}},{"term":{"city":"capitol"}},{"term":{"city":"the"}}]
  ]
}
  },
  "aggs": {
"cities": {
  "terms": {
"field": "cityState.raw"
  }
}
  }
},


example result:

aggregations: {
   
   - -
   cities: {
  - doc_count: 750
  - -
  cities: {
 - -
 buckets: [
- -
{
   - key: "Moses Lake, WA"
   - doc_count: 250
}
- -
{
   - key: "Bonney Lake, WA"
   - doc_count: 200
}
- -

 




-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/9ee290e0-9014-41a9-ae7d-fb8596616ca5%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


architecture and performance question on searching small subsets of documents

2014-07-15 Thread Mike Topper
Hello,

I'm new elasticsearch, so this might be a stupid question but i'd love some
input before I get started creating my elasticsearch cluster.

Basically I will be indexing documents with a few fields (documents are
pretty small in size).  there are ~90million documents total.

On the search side of things, each search will be limited by the small
subset of documents that the user doing the search owns.

my initial thought was to just have one large index for all documents and
have a multi-value field that held the user ids of each user that owned
that document.  then when searching across the index i would do a filter
query to limit by that user id.  My only concern here is that this might be
slow query times because you are always having to filter down by user id
from a large data set to a very small subset (on average a user probably
owns less than 1k documents).

The other option I had is that i could create an index for each user and
just index their documents into their index, but this would duplicate a
massive amount of data and just seems hacky.

Any suggestions?

Thanks,
Mike

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/CALdNedL-sLM%3DyMWsHHzriBmMwfe08mxVG%3D%3D9tSwxLwiWzfAcyw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


ElasticSearch scaling issue

2014-07-14 Thread Mike
Hello,

We recently had an issue adding a node to our Elastic Search cluster, and 
would be interested in anything I could do to troubleshoot the issue.

We currently have a 2 node cluster that has been running for some time.  I 
am looking to add a third node to this cluster.  All nodes are running on 
AWS, and we use ec2 discovery to allow nodes to discover each other:

discovery:
type: ec2
ec2:
groups: vpc-elastic-search4


Both nodes are part of the security group vpc-elastic-search4.

We attempted to bring up a new node that was an AMI of one of the existing 
two nodes, and recreated the partition we utilized for storing index data. 
 We then added the node to the vpc-elastic-search4 security group.

However, when we brought up the node, something strange happened causing us 
to abort the attempt.  The status of the nodes showed:

Node1 - Status: green, number of nodes: 2
Node2 - Status: yellow: number of nodes: 2 (relocating and initializing 
shards)
Node3 - Status: yellow: number of nodes: 2 (relocating and initializing 
shards) - This was the new node.

We immediately shutdown node 3.  After that point, Node 2 had the number of 
nodes set to 1, while Node 1 still showed 2 still showed green.

We shut down and restarted Node2, and it properly recovered from Node 1.

It almost seems that starting Node 3 caused Node 2 to become partitioned 
from Node 1, without Node 1 also reflecting that change?  However, that is 
a guess on my part.  I'm reluctant to try again for fear of causing damage.

We are running 0.90.11.  Any insight would be appreciated,

-Mike

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/8c86a371-d955-4fe9-b714-4383b075d9d3%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: Node works slow about 10 seconds after initialization

2014-07-11 Thread Mike Theairkit
Thanks for answer!
I will test with this options.

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/90d6b137-8d88-4e36-a876-c01203b7baed%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: Node works slow about 10 seconds after initialization

2014-07-11 Thread Mike Theairkit
Typical query: https://gist.github.com/anonymous/20fc650ca2ada3928b0b

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/81aaa013-e4a7-4377-9b34-bdc158c49835%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: Node works slow about 10 seconds after initialization

2014-07-11 Thread Mike Theairkit
I ran tests again, and get hot_threads when problem occurs
See attachment.
So, when node is slow, there are threads which using more than 60%cpu vs. 
in normal work they using ~2-5%cpu
May you help me to interpret this log?

About warmers - I see slow work, when using warmers
For now  - there are no wamers.

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/f948cd4a-33ed-442b-8bf4-e4b30a61d3a6%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
::: 
[Kafka][5yLycO0mRS2JlNKog3PElA][search2][inet[/172.16.76.22:9300]]{master=true}
   
   51.9% (259.3ms out of 500ms) cpu usage by thread 
'elasticsearch[Kafka][search][T#37]'
 2/10 snapshots sharing following 19 elements
   
org.apache.lucene.search.DisjunctionScorer.heapAdjust(DisjunctionScorer.java:55)
   
org.apache.lucene.search.DisjunctionScorer.nextDoc(DisjunctionScorer.java:131)
   
org.apache.lucene.search.FilteredQuery$QueryFirstScorer.score(FilteredQuery.java:165)
   org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:621)
   
org.elasticsearch.search.internal.ContextIndexSearcher.search(ContextIndexSearcher.java:173)
   org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:581)
   org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:533)
   org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:510)
   org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:345)
   org.elasticsearch.search.query.QueryPhase.execute(QueryPhase.java:116)
   
org.elasticsearch.search.SearchService.executeFetchPhase(SearchService.java:330)
   
org.elasticsearch.search.action.SearchServiceTransportAction.sendExecuteFetch(SearchServiceTransportAction.java:304)
   
org.elasticsearch.action.search.type.TransportSearchQueryAndFetchAction$AsyncAction.sendExecuteFirstPhase(TransportSearchQueryAndFetchAction.java:71)
   
org.elasticsearch.action.search.type.TransportSearchTypeAction$BaseAsyncAction.performFirstPhase(TransportSearchTypeAction.java:216)
   
org.elasticsearch.action.search.type.TransportSearchTypeAction$BaseAsyncAction.performFirstPhase(TransportSearchTypeAction.java:203)
   
org.elasticsearch.action.search.type.TransportSearchTypeAction$BaseAsyncAction$2.run(TransportSearchTypeAction.java:186)
   
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
   
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
   java.lang.Thread.run(Thread.java:744)
 8/10 snapshots sharing following 18 elements
   
org.apache.lucene.search.DisjunctionScorer.nextDoc(DisjunctionScorer.java:128)
   
org.apache.lucene.search.FilteredQuery$QueryFirstScorer.score(FilteredQuery.java:165)
   org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:621)
   
org.elasticsearch.search.internal.ContextIndexSearcher.search(ContextIndexSearcher.java:173)
   org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:581)
   org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:533)
   org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:510)
   org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:345)
   org.elasticsearch.search.query.QueryPhase.execute(QueryPhase.java:116)
   
org.elasticsearch.search.SearchService.executeFetchPhase(SearchService.java:330)
   
org.elasticsearch.search.action.SearchServiceTransportAction.sendExecuteFetch(SearchServiceTransportAction.java:304)
   
org.elasticsearch.action.search.type.TransportSearchQueryAndFetchAction$AsyncAction.sendExecuteFirstPhase(TransportSearchQueryAndFetchAction.java:71)
   
org.elasticsearch.action.search.type.TransportSearchTypeAction$BaseAsyncAction.performFirstPhase(TransportSearchTypeAction.java:216)
   
org.elasticsearch.action.search.type.TransportSearchTypeAction$BaseAsyncAction.performFirstPhase(TransportSearchTypeAction.java:203)
   
org.elasticsearch.action.search.type.TransportSearchTypeAction$BaseAsyncAction$2.run(TransportSearchTypeAction.java:186)
   
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
   
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
   java.lang.Thread.run(Thread.java:744)
   
   50.5% (252.3ms out of 500ms) cpu usage by thread 
'elasticsearch[Kafka][search][T#12]'
 4/10 snapshots sharing following 17 elements
   
org.apache.lucene.search.DisjunctionScorer.heapAdjust(DisjunctionScorer.java:55)
   
org.apache.lucene.search.DisjunctionScorer.nextDoc(DisjunctionScorer.java:131)
   
org.apache.lucene.search.FilteredQuery$QueryFirstScorer.sc

Re: Node works slow about 10 seconds after initialization

2014-07-10 Thread Mike Theairkit
I check my configs, now are identical, HEAP set to 4GB

Problem with slow work after initialization still persists.

Attach new cluster_setting to this message.

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/467da7e4-e9f5-46a4-a0a8-e9bb1a530c7b%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
{
  "cluster_name" : "rabota-es",
  "nodes" : {
"VwQ5CEuqSue3ezoOQRcMTA" : {
  "name" : "Kafka",
  "transport_address" : "inet[/172.16.76.22:9300]",
  "host" : "search2",
  "ip" : "172.16.76.22",
  "version" : "1.1.0",
  "build" : "2181e11",
  "http_address" : "inet[/172.16.76.22:9200]",
  "attributes" : {
"master" : "true"
  },
  "settings" : {
"node" : {
  "data" : "true",
  "master" : "true",
  "name" : "Kafka"
},
"index" : {
  "number_of_replicas" : "1",
  "store" : {
"type" : "mmapfs"
  },
  "translog" : {
"flush_threshold_ops" : "5000",
"flush_threshold_period" : "30m",
"flush_threshold_size" : "200mb",
"disable_flush" : "false"
  },
  "search" : {
"slowlog" : {
  "threshold" : {
"fetch" : {
  "trace" : "1s"
},
"query" : {
  "trace" : "1s"
}
  }
}
  },
  "refresh_interval" : "1s"
},
"bootstrap" : {
  "mlockall" : "true"
},
"http" : {
  "port" : "9200",
  "max_content_length" : "10mb"
},
"transport" : {
  "tcp" : {
"port" : "9300",
"connect_timeout" : "1s"
  }
},
"name" : "Kafka",
"action" : {
  "replication_type" : "sync"
},
"pidfile" : "/var/run/elasticsearch.pid",
"path" : {
  "data" : "/data/elasticsearch",
  "work" : "/tmp/elasticsearch",
  "home" : "/usr/share/elasticsearch",
  "conf" : "/etc/elasticsearch",
  "logs" : "/data/logs/elasticsearch"
},
"cluster" : {
  "name" : "rabota-es"
},
"config" : "/etc/elasticsearch/elasticsearch.yml",
"discovery" : {
  "fd" : {
"ping_interval" : "0.5s",
"ping_timeout" : "0.5s",
"ping_retries" : "3"
  },
  "zen" : {
"minimum_master_nodes" : "1",
"ping" : {
  "unicast" : {
"hosts" : [ "172.16.76.21:9300", "172.16.76.22:9300" ]
  },
  "multicast" : {
"enabled" : "false"
  },
  "timeout" : "1s"
},
"publish_timeout" : "1s"
  }
},
"network" : {
  "host" : "172.16.76.22"
}
  },
  "os" : {
"refresh_interval" : 1000,
"available_processors" : 16,
"cpu" : {
  "vendor" : "Intel",
  "model" : "Xeon",
  "mhz" : 2268,
  "total_cores" : 16,
  "total_sockets" : 1,
  "cores_per_socket" : 16,
  "cache_size_in_bytes" : 8192
},
"mem" : {
  "total_in_bytes" : 101374992384
},
"swap" : {
  "total_in_bytes" : 0
}
  },
  "process" : {
"refresh_interval" : 1000,
"id" : 6516,
"max_file_descriptors" : 65535,
"mlockall" : true
  },
  "jvm" : {
"pid" : 6516,
"version" : "1.7.0_51",
"vm_name" : "Java HotSpot(TM) 64-Bit Server VM",
"vm_version" : "24.51-b03",
"vm_vendor" : "Oracle Corporation",
"start_time" : 1405060169907,
"mem" : {
  "heap_init_in_bytes" : 4294967296,
  "heap_max_in_bytes" : 4181590016,
  "non_heap_init_in_bytes" : 24313856,
  "non_heap_max_in_bytes" : 136314880,
  "direct_max_in_bytes" : 4294967296
},
"gc_collectors" : [ "ParNew", "ConcurrentMarkSweep" ],
"memory_pools" : [ "Code Cache", "Par Eden Space", "Par Survivor 
Space", "CMS Old Gen", "CMS Perm Gen" ]
  },
  "thread_pool" : {
"generic" : {
  "type" : "cached",
  "keep_alive" : "30s"
},
"index" : {
  "type" : "fixed",
  "min" : 16,
  "max" : 16,
  "queue_size" : "200"
},
"get" : {
  "type" : "fixed",
  "min" : 16,
  "max" : 16,
  "queue_size" : "1k"
},
"snapshot" : {
  "type" : "scaling",
  "min" : 1,
  "max" 

Re: Node works slow about 10 seconds after initialization

2014-07-10 Thread Mike Theairkit
Did you mean ES_HEAP_SIZE?

It set to 4GB (and index size, using in my tests is about 2GB).

# cat /etc/default/elasticsearch | grep HEAP
ES_HEAP_SIZE=4g

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/db484658-0049-4406-865d-1a5697940771%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Node works slow about 10 seconds after initialization

2014-07-10 Thread Mike Theairkit
Hi

I testing Elastcisearch in very simple setup:

2 nodes: 1 master, 1 slave (placed on different physical servers)
1 index with no sharding
Project (PHP code) access to elastcisearch cluster via haproxy.

Immediately after initialization node, when node starts to process requests,
there are short period (about 10 seconds), when node process queries very 
slow: >1s
normal query time in my tests: <100ms

In this period there are CPU usage peak (java uses up to 600% vs 30% in 
normal work)
In this period there are no disk overload (Disk subsystem on server is 
enough fast, RAID1: 2 x ssd)


After this period, node process queries fast (<100ms).


In case of using fixed thread pools - behavior has not changed.

State of queries in this period (search2 - slave):
hostip   bulk.active bulk.queue bulk.rejected index.active 
index.queue index.rejected search.active search.queue search.rejected 
search1 172.16.76.21   0  0 00 
  0  0 00   0 
search2 172.16.76.22   0  0 00 
  0  048   75   0 


Slow queries:

[2014-07-10 12:59:18,456][TRACE][index.search.slowlog.fetch] [Kafka] 
[v_v2][0] took[1s], took_millis[1014], types[default-type], stats[], 
search_type[QUERY_AND_FETCH], total_shards[1] ...
[2014-07-10 12:59:18,784][TRACE][index.search.slowlog.fetch] [Kafka] 
[v_v2][0] took[1.3s], took_millis[1313], types[default-type], stats[], 
search_type[QUERY_AND_FETCH], total_shards[1] ...

My questions:
- Is this well-known behavior of elasticsearch?
- If not - what settings can affect and eliminate this behavior?

Thanks in advance!

P.S.: cluster config in attachment.

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/199c6774-87eb-4eca-acc4-6229fc613949%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
{
  "cluster_name" : "rabota-es",
  "nodes" : {
"5kbU94sqTm-D-QAZFyg81Q" : {
  "name" : "ZX Spectrum",
  "transport_address" : "inet[/172.16.76.21:9300]",
  "host" : "search1",
  "ip" : "172.16.76.21",
  "version" : "1.1.0",
  "build" : "2181e11",
  "http_address" : "inet[/172.16.76.21:9200]",
  "attributes" : {
"master" : "true"
  },
  "settings" : {
"node" : {
  "data" : "true",
  "master" : "true",
  "name" : "ZX Spectrum"
},
"index" : {
  "number_of_replicas" : "1",
  "store" : {
"type" : "mmapfs"
  },
  "translog" : {
"flush_threshold_ops" : "5000",
"flush_threshold_period" : "30m",
"flush_threshold_size" : "200mb",
"disable_flush" : "false"
  },
  "search" : {
"slowlog" : {
  "threshold" : {
"fetch" : {
  "trace" : "1s"
},
"query" : {
  "trace" : "1s"
}
  }
}
  },
  "refresh_interval" : "1s"
},
"bootstrap" : {
  "mlockall" : "true"
},
"http" : {
  "port" : "9200",
  "max_content_length" : "10mb"
},
"transport" : {
  "tcp" : {
"port" : "9300",
"connect_timeout" : "1s"
  }
},
"name" : "ZX Spectrum",
"action" : {
  "replication_type" : "sync"
},
"pidfile" : "/var/run/elasticsearch.pid",
"path" : {
  "data" : "/data/elasticsearch",
  "work" : "/tmp/elasticsearch",
  "home" : "/usr/share/elasticsearch",
  "conf" : "/etc/elasticsearch",
  "logs" : "/data/logs/elasticsearch"
},
"cluster" : {
  "name" : "rabota-es"
},
"config" : "/etc/elasticsearch/elasticsearch.yml",
"discovery" : {
  "fd" : {
"ping_interval" : "0.5s",
"ping_timeout" : "0.5s",
"ping_retries" : "3"
  },
  "zen" : {
"minimum_master_nodes" : "1",
"ping" : {
  "unicast" : {
"hosts" : [ "172.16.76.21:9300", "172.16.76.22:9300" ]
  },
  "multicast" : {
"enabled" : "false"
  },
  "timeout" : "1s"
},
"publish_timeout" : "1s"
  }
},
"network" : {
  "host" : "172.16.76.21"
}
  },
  "os" : {
"refresh_interval" : 1000,
"available_processors" : 16,
"cpu" : {
  "vendor" : "Intel",
 

Re: How to find the number of authors who have written between 2-3 books?

2014-06-20 Thread Mike
I'm ok with the count returned being some estimate.  Say in this simple 
example if it returned 1 for just Joe, or 3 for John, Joe, and Jack that 
would be ok too.  I am also ok with restructuring my data in any way to 
more efficiently get this number.  

You mentioned creating a reference count document.  How would that look?  1 
doc per unique author, with a count of the total number of books he wrote 
so then I can do a range aggregation on that number?  What if I wanted to 
find "the number of authors who have written between 2-3 books that have a 
title containing E, F, H, or I" (still 2 in this case, John and Joe) ?  



On Thursday, June 19, 2014 6:43:41 PM UTC-4, Itamar Syn-Hershko wrote:
>
> This is a Map/Reduce operation, you'll be better off maintaining a 
> ref-count document IMO then trying to hack the aggregations framework to 
> support this
>
> Another reason for doing it that way is in a distributed environment some 
> aggregations can't be computed to an exact value - the Terms bucketing is 
> one example. So if you need exact values, I'd go for a model that does it.
>
> --
>
> Itamar Syn-Hershko
> http://code972.com | @synhershko <https://twitter.com/synhershko>
> Freelance Developer & Consultant
> Author of RavenDB in Action <http://manning.com/synhershko/>
>
>
> On Fri, Jun 20, 2014 at 1:34 AM, Mike > 
> wrote:
>
>> Assume each document is a book:  
>> { title: "A", author: "Mike" }
>> { title: "B", author: "Mike" }
>> { title: "C", author: "Mike" }
>> { title: "D", author: "Mike" }
>>
>> { title: "E", author: "John" }
>> { title: "F", author: "John" }
>> { title: "G", author: "John" }
>>
>> { title: "H", author: "Joe" }
>> { title: "I", author: "Joe" }
>>
>> { title: "J", author: "Jack" }
>>
>>
>> What is the best way to fin the number of authors who have written 
>> between 2-3 books?  In this case it would be 2, John and Joe.
>>
>> I know I can do a terms aggregation on author, set size to be very very 
>> large, and then on the client side traverse through the thousands of 
>> authors and count how many had between 2-3.  Is there a more efficient way 
>> to do this?  The cardinality aggregation is almost what I want, if only I 
>> could specify a min and max term count. 
>>
>>
>>  -- 
>> You received this message because you are subscribed to the Google Groups 
>> "elasticsearch" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to elasticsearc...@googlegroups.com .
>> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/elasticsearch/22fc4e6d-bcac-426c-a343-ff1d36fc25de%40googlegroups.com
>>  
>> <https://groups.google.com/d/msgid/elasticsearch/22fc4e6d-bcac-426c-a343-ff1d36fc25de%40googlegroups.com?utm_medium=email&utm_source=footer>
>> .
>> For more options, visit https://groups.google.com/d/optout.
>>
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/2cab8d84-7c65-4f6e-ab39-3e2a0e859a87%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


How to find the number of authors who have written between 2-3 books?

2014-06-19 Thread Mike
Assume each document is a book:  
{ title: "A", author: "Mike" }
{ title: "B", author: "Mike" }
{ title: "C", author: "Mike" }
{ title: "D", author: "Mike" }

{ title: "E", author: "John" }
{ title: "F", author: "John" }
{ title: "G", author: "John" }

{ title: "H", author: "Joe" }
{ title: "I", author: "Joe" }

{ title: "J", author: "Jack" }


What is the best way to fin the number of authors who have written between 
2-3 books?  In this case it would be 2, John and Joe.

I know I can do a terms aggregation on author, set size to be very very 
large, and then on the client side traverse through the thousands of 
authors and count how many had between 2-3.  Is there a more efficient way 
to do this?  The cardinality aggregation is almost what I want, if only I 
could specify a min and max term count. 


-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/22fc4e6d-bcac-426c-a343-ff1d36fc25de%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Cardinality Aggregation Hashes Only

2014-06-07 Thread Mike Sukmanowsky
Hi there,

We're using ES for web analytics purposes and so far, have loved the 
experience.  We create hourly indexes that contain only one type of "url" 
document which has multiple metrics fields like "page_views".  We've 
recently begun looking into how to store more complex metrics that require 
set arithmetic such as "unique views" or "unique visitors".

While the cardinality aggregation 
<http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/search-aggregations-metrics-cardinality-aggregation.html>
 is 
awesome, it seems like it'd be crazy for us to store all the user IDs that 
we saw even for an hour on certain URLs as the number could grow to be very 
large, very quickly.  Just to clarify, this is the document schema I'm 
saying would probably be silly:

{
"url": "http://example.com/";,
"hour": "2014-05-31T03:00:00"
"user_ids": [
"e4c88ac4-ccc7-49e0-9a2e-34ab24420d2b",
"252d0f6e-2e9d-487d-95f4-ac3d53cce977",
"90b5d83b-44d6-4462-9f4b-3ab41e75143e",
"b6c9d0f8-5e4f-4308-92eb-be68d7b06d78",
"7a097ac1-7410-4918-a780-0020197d0b14"
],
"metrics": {
"page_views": 100
}
}

Being fairly new to Lucene and ES, I don't really know what a massive (> 
100K) user_ids array per document would do to ES/Lucene at indexing or 
query time. In addition, although that structure would allow us to query 
for hourly URLs that contained a certain user_id, it's probably beyond our 
current scope.  Precomputing the unique number per hour doesn't help us 
when we want to perform aggregations at query time and know unique users 
across a series of hours.

Toying around with two approaches in my head, and I wanted to get some 
feedback:


   1. Find a way to store only the HLL object in ES but without the actual 
   array of distinct values.  This way, we have the benefit of the cardinality 
   aggregations, but without storing the full set of user_ids.  Is there a way 
   to do this?
   2. Store a binary blob which represents a custom HLL that we'll create 
   and index.  Create a new aggregation for a bitwise OR operation on that 
   binary object which would allow us to union the HLLs in the aggregation and 
   return that result

I lean a little bit more to solution #2 only because we'd prefer to have 
the HLL's accuracy tuneable instead of rely in ES defaults.

Would love to hear some thoughts on how to solve this kind of issue.

Mike

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/1af2370f-c402-44ac-b05d-fe0b1bee00a8%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Is it possible to get a bucketed aggregation based on the count of values for a field?

2014-05-30 Thread Mike
For example, assume I have the following docs:

{user:"Mike"}
{user:"John"}
{user:"Mike"}
{user:"Sara"}
{user:"Sara"}
{user:"Sara"}

I can do a terms agg on user and get:
Sara: 3
Mike: 2
John: 1

What if I didn't care about the actual total number of terms per value, and 
instead just wanted them bucketed into say 2 bins, those that had counts 
<=1, and those >=2?
Users Showing up >= 2 Times: 2
Users Showing up < 2 Times: 1

The range agg seems to give me the flexibility that I want in creating 
buckets, that that is based on the actual numeric value of a field like 
score or price.  Is there a way to do the above without iterating through 
the thousands of terms myself on the client side?  

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/4f36c603-eb3b-45a0-ad73-dd6f97a5a0fa%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: What is the difference between common terms query vs match query with cutoff_frequency set

2014-05-30 Thread Mike
Bump

On Monday, May 12, 2014 4:22:51 PM UTC-4, Mike wrote:
>
> I was reading up on the match query and noticed that it has a 
> cutoff_frequency parameter, which seems to do pretty much what the common 
> terms query does.  
>
>1. What is the difference between the common and match queries?
>2. When would I want to use common terms over match?
>3. Ultimately, would the direction be to have common terms query roll 
>up into the match query (with any differences added to match)?
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/cace77ef-9f36-4308-a3b3-2b99aaa17844%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


"All shards failed for phase: [query]" message

2014-05-28 Thread Mike Tolman
Hi,

I have written some integration tests to run against the 1.2.0 version of 
elasticsearch. The tests create indexes, add data to indexes, perform 
searches, and delete indexes. My tests seem to be passing and working 
correctly, but I'm concerned that I often see these messages showing up in 
the logs:

[2014-05-28 12:15:37,604][DEBUG][action.search.type   ] 
[ESIntegrationTestNode] All shards failed for phase: [query]
[2014-05-28 12:15:37,609][DEBUG][action.search.type   ] 
[ESIntegrationTestNode] All shards failed for phase: [query]
[2014-05-28 12:15:37,614][DEBUG][action.search.type   ] 
[ESIntegrationTestNode] All shards failed for phase: [query]
[2014-05-28 12:15:37,635][DEBUG][action.search.type   ] 
[ESIntegrationTestNode] All shards failed for phase: [query]
[2014-05-28 12:15:37,638][DEBUG][action.search.type   ] 
[ESIntegrationTestNode] All shards failed for phase: [query]
[2014-05-28 12:15:37,643][DEBUG][action.search.type   ] 
[ESIntegrationTestNode] All shards failed for phase: [query]
[2014-05-28 12:15:37,644][DEBUG][action.search.type   ] 
[ESIntegrationTestNode] All shards failed for phase: [query]
[2014-05-28 12:15:37,646][DEBUG][action.search.type   ] 
[ESIntegrationTestNode] All shards failed for phase: [query]
[2014-05-28 12:15:37,649][DEBUG][action.search.type   ] 
[ESIntegrationTestNode] All shards failed for phase: [query]

Can anyone help me understand what these messages mean? I'm just worried 
that they may be pointing to some error in my implementation.

Thanks,
Mike

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/6263ae56-eace-4cd9-a1db-1c19fa52d6da%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Debugging scripts

2014-05-16 Thread Mike Snare
I'm trying to debug an MVEL script that's failing in production but working 
locally, and I've tried to do logging a couple different ways, but I can 
only get logging to work locally.

Both local and production are running 1.1.1, same exact build.  Both are 
running with java 1.7._55.

I've tried the logger approach based 
on 
https://github.com/imotov/elasticsearch-test-scripts/blob/master/logging_from_script.sh,
 
but that only works when the server is running under java 1.6.  Under 1.7, 
both local and production fail.

I've tried using System.out.println, but that only works locally even under 
java 1.7.  In production I just get "error": 
"ElasticsearchIllegalArgumentException[failed to execute script]; nested: 
PropertyAccessException[[Error: unresolvable property or identifier: 
]\n[Near : {... System.out.println(\"SCRIPT: }]\n ^\n[Line: 12, Column: 
1]]; "

Does anyone have any pointers as to how to debug MVEL or get log output 
from MVEL scripts in ES?

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/c6914ca7-a4da-4bb5-97c6-4a299d68d2c3%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


What is the difference between common terms query vs match query with cutoff_frequency set

2014-05-12 Thread Mike
I was reading up on the match query and noticed that it has a 
cutoff_frequency parameter, which seems to do pretty much what the common 
terms query does.  

   1. What is the difference between the common and match queries?
   2. When would I want to use common terms over match?
   3. Ultimately, would the direction be to have common terms query roll up 
   into the match query (with any differences added to match)?

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/1bcf87e3-3b65-45bc-8578-dde77b10c37f%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: Filter any documents containing any of several values in a number of fields

2014-04-29 Thread Mike Snare
Bump

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/fad5147a-4a5d-444c-8ac5-9f404e098e9a%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Filter any documents containing any of several values in a number of fields

2014-04-28 Thread Mike Snare
Is there any way to construct a filter that will filter out any documents 
that match any values in an array from, say, 2 different fields?

Imagine a document that stores friendships/recommendations/etc, where the 
document might store ids for each user as user_1 and user_2, and then other 
information in whatever other fields you'd need.  The important thing is 
the user_1 and user_2 fields, but here's an idea of what the doc might look 
like:

{
  user_1: 12345,
  user_2: 23456,
  facebook_friends: true,
  twitter_friends: false,
  comments: "blahblahblahh"
}

I would like to execute a query that returns a set of documents but 
excludes any friendships for a list of user ids.  These excluded user_ids 
might be listed as user_1 in some docs, but might be listed as user_2 in 
others.  I can easily construct a terms query for each field that will say 
"exclude where user_1 is any of these values AND THEN exclude where user_2 
is any of these values" but what I'd like to do is say "exclude where 
either user_1 or user_2 are any of these values".

Right now I have to do this:

{
  query: {
filtered: {
  query: {
// whatever...
  },
  filter: {
bool: {
  must_not: [
{
  terms: { 
user_1: [12345, 23456]
},
{
  terms: { 
user_2: [12345, 23456]
}
  ]
}
  }
}
  }
}

Some of these filters end up getting pretty large, and I'd like to be able 
to avoid duplicating the array within the same query if I can.  Something 
like this would be awesome, but I can't figure out if there's any way to do 
it:

{
  query: {
filtered: {
  query: {
// whatever...
  },
  filter: {
not: {
  multi_match: {
terms: ["user_1", "user_2"],
    values: [12345, 23456]
  }
}
  }
}
  }
}

Thanks,
-Mike

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/4fcef91a-3f44-4452-acb3-dc704c8c2005%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: ElasticSearch for higher frequency time series data

2014-04-08 Thread Mike Sam
Well, we are thinking to store both logs and metrics in one place. In
theory it could work but I am wondering if anybody has actually tired that?


On Tue, Apr 8, 2014 at 10:37 PM, Mark Walkom wrote:

> Any reason for ES and not something like graphite?
>
> ES should be able to do it though.
>
> Regards,
> Mark Walkom
>
> Infrastructure Engineer
> Campaign Monitor
> email: ma...@campaignmonitor.com
> web: www.campaignmonitor.com
>
>
> On 9 April 2014 15:30,  wrote:
>
>> 1> Let's assume the time series data comes from distributed nodes like we
>> want to collect CPU counter on 100's of servers in the DC at high
>> frequency(4 times a sec). Does anybody have any experience using
>> ElasticSearch as the time series data storage and query engine? Any gotcha
>> doing so?
>>
>> 2> Also, is logstash the way to collect and send the data at each host to
>> ES?
>>
>> Thank you,
>>
>> Mike
>>
>>
>>  --
>> You received this message because you are subscribed to the Google Groups
>> "elasticsearch" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to elasticsearch+unsubscr...@googlegroups.com.
>>
>> To view this discussion on the web visit
>> https://groups.google.com/d/msgid/elasticsearch/0bc154cb-f0ad-4480-9177-8f29adb3da4c%40googlegroups.com<https://groups.google.com/d/msgid/elasticsearch/0bc154cb-f0ad-4480-9177-8f29adb3da4c%40googlegroups.com?utm_medium=email&utm_source=footer>
>> .
>> For more options, visit https://groups.google.com/d/optout.
>>
>
>  --
> You received this message because you are subscribed to a topic in the
> Google Groups "elasticsearch" group.
> To unsubscribe from this topic, visit
> https://groups.google.com/d/topic/elasticsearch/qkPiqXFTlXw/unsubscribe.
> To unsubscribe from this group and all its topics, send an email to
> elasticsearch+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/elasticsearch/CAEM624Zn1XTG3MHBk72Oq4xa_j9q0qkRDNYD0YRktL0%3Dy%3DRt8A%40mail.gmail.com<https://groups.google.com/d/msgid/elasticsearch/CAEM624Zn1XTG3MHBk72Oq4xa_j9q0qkRDNYD0YRktL0%3Dy%3DRt8A%40mail.gmail.com?utm_medium=email&utm_source=footer>
> .
>
> For more options, visit https://groups.google.com/d/optout.
>



-- 
Thanks,
Mike

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/CACiRYOef%2BSf1khEowoHMTXXNzkqFxfnmhzERXEeL%2BAq9-O0S1w%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: Rolling restart of a cluster?

2014-04-02 Thread Mike Deeks
That is exactly what I'm doing. For some reason the cluster reports as 
green even though an entire node is down. The cluster doesn't seem to 
notice the node is gone and change to yellow until many seconds later. By 
then my rolling restart script has already gotten to the second node and 
killed it because the cluster was still green for some reason.

On Wednesday, April 2, 2014 4:23:32 AM UTC-7, Petter Abrahamsson wrote:
>
> Mike,
>
> Your script needs to check for the status of the cluster before shutting 
> down a node, ie if the state is yellow wait until it becomes green again 
> before shutting down the next node. You'll probably want do disable 
> allocation of shards while each node is being restarted (enable when node 
> comes back) in order to minimize the amount of data that needs to be 
> rebalanced.
> Also make sure to have 'discovery.zen.minimum_master_nodes' correctly set 
> in your elasticsearch.yml file.
>
> Meta code
>
> for node in $cluster_nodes; do
>   if [ $cluster_status == 'green' ]; then
> cluster_disable_allocation()
> shutdown_node($node)
> wait_for_node_to_rejoin()
> cluster_enable_allocation()
> wait_for_cluster_status_green()
>   fi
> done
>
>
> http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/modules-cluster.html
>
> /petter
>
>
> On Tue, Apr 1, 2014 at 6:19 PM, Mike Deeks 
> > wrote:
>
>> What is the proper way of performing a rolling restart of a cluster? I 
>> currently have my stop script check for the cluster health to be green 
>> before stopping itself. Unfortunately this doesn't appear to be working.
>>
>> My setup:
>> ES 1.0.0
>> 3 node cluster w/ 1 replica.
>>
>> When I perform the rolling restart I see the cluster still reporting a 
>> green state when a node is down. In theory that should be a yellow state 
>> since some shards will be unallocated. My script output during a rolling 
>> restart:
>> 1396388310 21:38:30 dev_cluster green 3 3 1202 601 2 0 0
>> 1396388310 21:38:30 dev_cluster green 3 3 1202 601 2 0 0
>> 1396388310 21:38:30 dev_cluster green 3 3 1202 601 2 0 0
>>
>> 1396388312 21:38:32 dev_cluster green 3 3 1202 601 2 0 0
>> 1396388312 21:38:32 dev_cluster green 3 3 1202 601 2 0 0
>> 1396388312 21:38:32 dev_cluster green 3 3 1202 601 2 0 0
>>
>> curl: (52) Empty reply from server
>> 1396388313 21:38:33 dev_cluster green 3 3 1202 601 2 0 0
>> 1396388313 21:38:33 dev_cluster green 3 3 1202 601 2 0 0
>>
>> curl: (52) Empty reply from server
>> 1396388314 21:38:34 dev_cluster green 3 3 1202 601 2 0 0
>> 1396388314 21:38:34 dev_cluster green 3 3 1202 601 2 0 0
>> ... continues as green for many more seconds...
>>
>> Since it is reporting as green, the second node thinks it can stop and 
>> ends up putting the cluster into a broken red state:
>> curl: (52) Empty reply from server
>> curl: (52) Empty reply from server
>> 1396388339 21:38:59 dev_cluster green 2 2 1202 601 2 0 0
>>
>> curl: (52) Empty reply from server
>> curl: (52) Empty reply from server
>> 1396388341 21:39:01 dev_cluster yellow 2 2 664 601 2 8 530
>>
>> curl: (52) Empty reply from server
>> curl: (52) Empty reply from server
>> 1396388342 21:39:02 dev_cluster yellow 2 2 664 601 2 8 530
>>
>> curl: (52) Empty reply from server
>> curl: (52) Empty reply from server
>> 1396388343 21:39:03 dev_cluster yellow 2 2 664 601 2 8 530
>>
>> curl: (52) Empty reply from server
>> curl: (52) Empty reply from server
>> 1396388345 21:39:05 dev_cluster yellow 1 1 664 601 2 8 530
>>
>> curl: (52) Empty reply from server
>> curl: (52) Empty reply from server
>> 1396388346 21:39:06 dev_cluster yellow 1 1 664 601 2 8 530
>>
>> curl: (52) Empty reply from server
>> curl: (52) Empty reply from server
>> 1396388347 21:39:07 dev_cluster red 1 1 156 156 0 0 1046
>>
>> My stop script issues a call to 
>> http://localhost:9200/_cluster/nodes/_local/_shutdown to kill the node. 
>> Is it possible the other nodes are waiting to timeout the down node before 
>> moving into the yellow state? I would assume the shutdown API call would 
>> inform the other nodes that it is going down.
>>
>> Appreciate any help on how to do this properly.
>>
>> -- 
>> You received this message because you are subscribed to the Google Groups 
>> "elasticsearch" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to elasticsearc...@googlegroups.com .
>> To view this discussion on t

Rolling restart of a cluster?

2014-04-01 Thread Mike Deeks
What is the proper way of performing a rolling restart of a cluster? I 
currently have my stop script check for the cluster health to be green 
before stopping itself. Unfortunately this doesn't appear to be working.

My setup:
ES 1.0.0
3 node cluster w/ 1 replica.

When I perform the rolling restart I see the cluster still reporting a 
green state when a node is down. In theory that should be a yellow state 
since some shards will be unallocated. My script output during a rolling 
restart:
1396388310 21:38:30 dev_cluster green 3 3 1202 601 2 0 0
1396388310 21:38:30 dev_cluster green 3 3 1202 601 2 0 0
1396388310 21:38:30 dev_cluster green 3 3 1202 601 2 0 0

1396388312 21:38:32 dev_cluster green 3 3 1202 601 2 0 0
1396388312 21:38:32 dev_cluster green 3 3 1202 601 2 0 0
1396388312 21:38:32 dev_cluster green 3 3 1202 601 2 0 0

curl: (52) Empty reply from server
1396388313 21:38:33 dev_cluster green 3 3 1202 601 2 0 0
1396388313 21:38:33 dev_cluster green 3 3 1202 601 2 0 0

curl: (52) Empty reply from server
1396388314 21:38:34 dev_cluster green 3 3 1202 601 2 0 0
1396388314 21:38:34 dev_cluster green 3 3 1202 601 2 0 0
... continues as green for many more seconds...

Since it is reporting as green, the second node thinks it can stop and ends 
up putting the cluster into a broken red state:
curl: (52) Empty reply from server
curl: (52) Empty reply from server
1396388339 21:38:59 dev_cluster green 2 2 1202 601 2 0 0

curl: (52) Empty reply from server
curl: (52) Empty reply from server
1396388341 21:39:01 dev_cluster yellow 2 2 664 601 2 8 530

curl: (52) Empty reply from server
curl: (52) Empty reply from server
1396388342 21:39:02 dev_cluster yellow 2 2 664 601 2 8 530

curl: (52) Empty reply from server
curl: (52) Empty reply from server
1396388343 21:39:03 dev_cluster yellow 2 2 664 601 2 8 530

curl: (52) Empty reply from server
curl: (52) Empty reply from server
1396388345 21:39:05 dev_cluster yellow 1 1 664 601 2 8 530

curl: (52) Empty reply from server
curl: (52) Empty reply from server
1396388346 21:39:06 dev_cluster yellow 1 1 664 601 2 8 530

curl: (52) Empty reply from server
curl: (52) Empty reply from server
1396388347 21:39:07 dev_cluster red 1 1 156 156 0 0 1046

My stop script issues a call 
to http://localhost:9200/_cluster/nodes/_local/_shutdown to kill the node. 
Is it possible the other nodes are waiting to timeout the down node before 
moving into the yellow state? I would assume the shutdown API call would 
inform the other nodes that it is going down.

Appreciate any help on how to do this properly.

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/baba0a96-a991-42e3-a827-43881240e889%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Histogram of high-cardinality aggregate

2014-02-25 Thread Mike Kaplinskiy
Hey folks,

Playing around with the aggregation API, I was wondering whether this is 
possible. Taking the example 
at 
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/search-aggregations-bucket-nested-aggregation.html
 
, how would I get the histogram of the minimum price [not all prices] of 
all the products?

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/fee3c6f5-1341-4590-a682-bdc7bcdc595e%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.