Re: Do I tell the world to hit one node? Or many? Or load balance?

2014-12-04 Thread Christopher Ambler
I'm curious why no data. Wouldn't having the data local mean faster lookups?

On Wednesday, December 3, 2014 1:14:10 PM UTC-8, Christian Hedegaard wrote:
>
>  In our environment our cluster is inside EC2/VPC. We have an ELB in 
> front of the cluster. We use DNS to assign a CNAME to the ELB for easier 
> internal use. The cluster is currently at 15 nodes, 3 of which are “master 
> only, no data” and associate themselves with the ELB. The ELB balances 
> requests to/from the master nodes. The master nodes are slightly smaller in 
> memory, but faster in CPU than the rest of the nodes so they can quickly 
> serve requests. The rest of the nodes are “data only” nodes. They are not 
> master eligible and they just store and serve data to/from the masters via 
> the ELB.
>  

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/a7b92cd3-1cd9-4d58-bb66-17e7aa02647b%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: Do I tell the world to hit one node? Or many? Or load balance?

2014-12-03 Thread Christopher Ambler
So you're saying, in essence, YES, I should try to balance search requests 
across all nodes and not just one node.

The method may be debatable, but the underlying answer is YES, distribute 
among machines.

(Just being sure I understand).

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/622e1a1a-5ed7-4cb3-8873-f993bea1d847%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Do I tell the world to hit one node? Or many? Or load balance?

2014-12-03 Thread Christopher Ambler
I have a nice, performant cluster of 5 nodes. They're all on separate 
machines on the same switch. Life is good.

Now...

Do I tell the consumers of my Elasticsearch cluster to hit any of the five 
nodes as suits their fancy? Or do I give them the name of ONE node? If so, 
is that node configured any differently?

Or do I put all five behind a virtual IP and load balance them?

I can't find any documentation on best practices here.

Thoughts?

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/51237fd8-eefa-4d2b-9c1a-296dd391e98d%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


How to craft a data transformation - use script?

2014-11-25 Thread Christopher Ambler
Pretty simple requirement:

Find all documents where the field "vendorid" has no value and give it a 
value of "6"

Would I do this with an update call and use script? I'm a little unclear 
how. Or is there a more canonical way to do something like this?

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/c4bf0d53-4cf8-4a58-934d-10e062a477ee%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: Odd behavior of bulk loading speed - good riddle?

2014-11-25 Thread Christopher Ambler
Refactoring my statement from script to a straight update { doc, 
upsert_as_doc } seems to have done the trick. So rather than diagnose 
what's odd about the script, this has resolved my issue. Yeah, lazy 
solution, but a more optimal one ;)

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/756a3c30-5567-4ba8-9056-a775d55d4fa0%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: Odd behavior of bulk loading speed - good riddle?

2014-11-23 Thread Christopher Ambler
Nobody?

No ideas why bulk upserts slow down over time?

Loading 9 million documents starts off at 2000+ per second and, by hour 
three, is down to 300 per second. The whole job takes the better part of 8 
hours, with this linear slowdown.

Nobody has an idea? I'm drawing a blank, myself!

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/fb09e1cc-c4a3-4484-ba21-128d60b68ee7%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Node spikes to 1000 threads and hangs, once or twice a day. Help?

2014-11-21 Thread Christopher Ambler
Odd behavior - our 5-node cluster hums along happily but then, once or 
twice a day, one node pops to all 1000 threads (the default limit) being 
hit and the node becomes unresponsive, causing our whole cluster to become 
extremely slow.

Has anyone experienced this? Any good way to diagnose this?

Memory and CPU appear normal, if that helps... I'm not even sure where to 
start here.

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/0a92150b-ea74-410a-a5e1-8c9de24c461b%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: Odd behavior of bulk loading speed - good riddle?

2014-11-20 Thread Christopher Ambler
The statement, if that helps (this is a line of PHP, hence the $ variables):

"{\"script\" : \"ctx._source.auctionid=$auctionID; 
ctx._source.auctiontype=$auctionType; 
ctx._source.auctionstatus=$auctionStatus; 
ctx._source.auctionprice=$auctionPrice; 
ctx._source.auctionendtime='$auctionEndTime'; 
ctx._source.auctionadult=$adultListingFlag;\", \"upsert\": { \"auctionid\": 
$auctionID, \"auctiontype\": $auctionType, \"auctionstatus\": 
$auctionStatus, \"auctionprice\": $auctionPrice, \"auctionendtime\": 
\"$auctionEndTime\", \"auctionadult\": $adultListingFlag, \"domaintype\": 
\"auction\", \"fqdn\": \"$fqdn\", \"sld\": \"$sld\", \"tld\": \"$tld\", 
\"vendorid\": 6, \"price\": 0, \"commissionrate\": 0, \"isfasttransfer\": 
false, \"isadult\": $aFlag, \"istaboo\": $tFlag, \"sldlen\": $sldlen, 
\"numhyphens\": $numhyphens, \"numdigits\": $numdigits, \"tokens\": " . 
(($tokens == null) ? '""' : json_encode($tokens)) . "}}"

Creates a document if it doesn't exist, updates it if it does.

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/4173f9b5-1d46-49a8-9647-c01618ee97e9%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Odd behavior of bulk loading speed - good riddle?

2014-11-20 Thread Christopher Ambler
So this has me perplexed.

I have a bulk data loading job that creates an upsert statement and batches 
500 of them in a bulk operation using the _bulk interface.

I send the bulk insert via HTTP (on 9200) and wait for the response before 
sending the next one, which I do immediately.

I do not hit any thread pool limits.

I have replicas set to zero and refresh interval set to -1 to make the 
loading as lightweight as possible.

Timing these, they start out pretty fast and run about 2000 documents per 
second. Four or so HTTP round trips.

This lasts for a few minutes and then it starts to slow. Within an hour, 
it's running about 1200 per second. In another hour, it's down to about 600 
per second. Then it seems to flatten-out about 400 per second until the job 
is done, some 8 million documents later.

So my question is - why the slowdown? It's very consistent, seems 
reasonably linear, and happens 100% of the time.

Any clues?


-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/a787d461-f467-4f79-943b-e65e12492783%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: Rip apart my cluster architecture, would you? ;)

2014-11-18 Thread Christopher Ambler
Our data set changes constantly, but a refresh every 24 hours is sufficient 
to our needs.

I could use a river or some kind of data loader to keep up to date, but 
it's really not necessary if I just create a new index once a day.

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/27d5a875-270d-4bff-88b1-d3a6049241fe%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Rip apart my cluster architecture, would you? ;)

2014-11-18 Thread Christopher Ambler
So we've been using ES for a while now, and I have an architecture I've set 
up that I'm absolutely not 100% sure is right. I'd like to lay it out and 
see if anyone can tell me where I might be going wrong.

We have, as our data set, roughly 10 million documents. Each one represents 
a product and then a bunch of data on that product suitable for queries. 
Our queries are pretty good (because someone else writes them :-)) and we 
get the results we want.

We have five nodes. Three are in one data center (call it data center M) 
and two are in another (call it data center B). There is a nice, fat pipe 
between the two so communication is acceptable.

I replicate every shard on every node. We have plenty of disk space, the 
data set isn't so huge that it fills up memory, and I really do want to 
optimize for reads. The reason for that is that we re-load our index once 
per day in the middle of the night.

To do this, I create a new index, load all the data, and then move an index 
alias from the old to the new. No downtime. I wrote a job that loads the 
data via the bulk API. I'm pretty happy with this, too.

In the M data center, machine M1 is the one I use to load the data. It is 
NOT in our load balancing rotation for reads. Machines M2 and M3 are, as 
are both machines in data center B.

All M machines are master=true data=true. All B machines are master=false 
data=true. The reason I made B machines master=false was so that while 
building the new index nightly on M1, it doesn't have to go to a B machine 
as the master. I presume this is wise. I'm not sure.

I write in batches of 2000 documents and get about 1300 documents per 
second on write speeds.

I also have ONE job that does scripted upserts in batches of 1000 each that 
gets about 300 documents per second. This is slower than I'd like. I'm 
unsure how I might speed this up.

So... anything stand out as bad?

Could I maybe speed up writes by turning replication off while writing and 
then back on when done, so that my cluster isn't updating every node during 
the writes? Since I keep the index alias pointed at the previous index 
until the new one is ready, this should be okay, right?

Anything I might be missing?

THANK YOU TONS if you can chime in. ES is wonderful, but as we all know, 
there's a lot to learn!

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/9348104d-efa7-42ae-baac-f1c63d849e6c%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: Why not have maximum replication

2014-11-17 Thread Christopher Ambler
I don't see how that's an issue.

Data needs to be stored somewhere. If I'm hitting disk often, I'm going to 
do so no matter what, no matter where.

So why not have the data on all nodes?

On Monday, November 17, 2014 12:11:15 PM UTC-8, Christopher Ambler wrote:
>
> Can someone explain to me why, if disk space is not an issue, I don't want 
> maximum replication such that every node has every shard?
>
> It seems to me that there would be no real downside here as long as I'm 
> not worried about filling up a disk and my updates happen infrequently and 
> in a timely manner.
>
> Am I missing the obvious? ;)
>

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/6e54e3dc-c9f3-4428-977e-fc3f0e64da8d%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: Diagnosing why one node is getting no shards

2014-11-17 Thread Christopher Ambler
I found my own answer - but for anyone who might have this issue, ensure 
that all nodes are running the same version of ES. Apparently when they're 
not, older versions will not get shards.

Who knew? :-)

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/9789c9d0-4b41-481b-826a-b11ee45b649a%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Why not have maximum replication

2014-11-17 Thread Christopher Ambler
Can someone explain to me why, if disk space is not an issue, I don't want 
maximum replication such that every node has every shard?

It seems to me that there would be no real downside here as long as I'm not 
worried about filling up a disk and my updates happen infrequently and in a 
timely manner.

Am I missing the obvious? ;)

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/897baef9-225e-450f-ab5d-a114ff54be9d%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Diagnosing why one node is getting no shards

2014-11-13 Thread Christopher Ambler
I have an Elasticsearch cluster of five nodes. I have the configuration on 
each set the same with 5 shards per index and 4 replicas. The idea is that 
every node will have every shard.

Four of my nodes have five shards. One node has ALL of them primary. One 
node has NOTHING. And then, of course, I have 5 unallocated shards.

I reload a new index every day and this is exactly how it allocates shards 
every time.

I need to find out why the one node isn't getting any shards. That's bad.

It would be easy for me to ask why this is happening - and if anyone knows, 
that would be fantastic. But as I can't seem to find ANYTHING online or in 
the documentation to explain this, I must ask, perhaps, how I can diagnose 
it? Any clues? Anything I can look at to give a clue here?

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/76d46f15-447c-4797-860e-fc8ae3ec09dd%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Writing an article - topics to cover?

2014-08-14 Thread Christopher Ambler
I've been tasked to write an article (that will be public-facing) on my 
experience setting up ElasticSearch as part of the project I'm working on 
at GoDaddy.

I'd like to solicit input on any topics that I might want to make sure I 
don't miss.

The scope is installation, configuration, and basic monitoring (head and 
marvel) as well as the decent number of issues I came across and how I 
solved them. Indeed, I'll also cover some of the issues that I'm *still* 
having and how I'm working on them.

Since this will be public-facing, though, I'd love to know what others 
would like to see in such an article. If I can, I'll try to cover those 
topics to give it a bit more value.

Thanks!

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/0a8f4d62-0f3f-4cf9-ab1f-3d8043245b7d%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: Unallocated shards with empty nodes

2014-08-13 Thread Christopher Ambler
Nobody?

I've seen many people posting with similar issues - has nobody encountered 
this?

Anyone from ES care to comment?

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/92917452-a079-45fe-9f34-a200a855c8f6%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: Unallocated shards with empty nodes

2014-08-12 Thread Christopher Ambler
To add a little more information, the six nodes are broken up into three 
groups. The first two have node.zone: first, the second are node.zone: 
second and the third are node.zone: third.

I also have "cluster.routing.allocation.awareness.attributes: zone" in my 
config.

So as you can see, the empty nodes are the two that are "node.zone: second" 
in their config file.

Could this be involved?

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/452f4419-265b-4e37-aaeb-094031914bc3%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Unallocated shards with empty nodes

2014-08-12 Thread Christopher Ambler



As you can see by the image snapped from 'head,' I have unallocated shards 
while a couple of nodes are empty. This happened when I recycled some 
nodes. The last two indexes are freshly-created and look good. (5 shards, 5 
replicas + master).

Nothing I do gets those unallocated shards back on a node.

Anyone have a clue as to what I'm doing wrong?


-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/7936bcbf-a1f6-421a-9935-4618e056daf2%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: Bulk - insert if not exists, update otherwise?

2014-08-11 Thread Christopher Ambler
Okay, so now here's where I am - I read up on upsert and crafted my bulk 
stack. It seems to work:

{"update":{"_index":"aftermarket-2014-08-11_02-38-19","_type":"premium","_id":"kryptonblue.com"}}
{"script" : "ctx._source.auctionid=6623102; ctx._source.auctiontype=18; 
ctx._source.auctionstatus=4; ctx._source.auctionprice=4488; 
ctx._source.auctionendtime='Oct 7 2014 09:10:00:000AM'; 
ctx._source.auctionadult=false;", "upsert": { "auctionid": 6623102, 
"auctiontype": 18, "auctionstatus": 4, "auctionprice": 4488, "auctionendtime": 
"Oct 7 2014 09:10:00:000AM", "auctionadult": false, "domaintype": "auction", 
"sld": "kryptonblue", "tld": "com", "vendorid": 0, "price": 0, 
"commissionrate": 0, "isfasttransfer": false, "tokens": ["krypton","blue"]}}


This seems to do the right thing. Here's the result I get back:

[0] => stdClass Object
(
[update] => stdClass Object
(
[_index] => aftermarket-2014-08-11_02-38-19
[_type] => premium
[_id] => kryptonblue.com
[_version] => 1
[status] => 201
)

)

This was done as an insert - there was nothing to update. So 201 seems right. 
What can I expect on an update? A straight 200?


-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/7adfadae-d06e-4523-b860-a2d3e2cba1c4%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: Bulk - insert if not exists, update otherwise?

2014-08-11 Thread Christopher Ambler
Wait, hang on - I just saw this in the docs. Are you suggesting a solution 
that's being deprecated? If so, that's likely not a good idea.

I'm now confused ;)

[image: Warning]
Deprecated in 1.3.0.

Mvel has been deprecated and will be removed in 1.4.0.

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/a4d85bb9-c9ed-4470-bc54-0a28b8f36bad%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: Bulk - insert if not exists, update otherwise?

2014-08-11 Thread Christopher Ambler
Excellent - are there any examples I can see? Where can I read and learn 
how to do this?

On Monday, August 11, 2014 10:19:51 AM UTC-7, eune...@gmail.com wrote:
>
> What you're describing is the upsert functionality in the mvel scripting.  
>
> The upsert will create and populate when the key doesn't exist.  
>
> And the update api will add to the document if it does already exist.  
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/c19a03d4-8729-473e-bf71-8ede73b9ddda%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Bulk - insert if not exists, update otherwise?

2014-08-11 Thread Christopher Ambler
I have a situation where I have code that does 3000 bulk inserts of data. 
This works just fine.

What I now need to do is 3000 bulk inserts OR updates.

Specifically, if the key already exists, I need to UPDATE fields A, B and 
C. If the key does NOT exists, I need to INSERT all fields. But again, if 
I'm doing the UPDATE, it's just modifying the values of three fields and I 
want to leave the others alone.

I can easily see how I'd do this in two API calls if I were doing each item 
at a time. A HEAD on the key to see if it exists, and then either the 
INSERT or UPDATE depending on the result.

But is there a way to do this in the bulk API for efficiency? Doing it one 
at a time will be orders of magnitude slower.

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/e86adebf-cfef-4c91-97df-9613cb386ab1%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: How to transform SQL columns using JDBC River?

2014-08-05 Thread Christopher Ambler
Awesome! Getting this on my priority to-do list to try out!

On Tuesday, August 5, 2014 2:50:47 PM UTC-7, Jörg Prante wrote:
>
> Just released - stored procedures are available in JDBC plugin 1.3.0.4
>
> https://github.com/jprante/elasticsearch-river-jdbc/
>
> Jörg
>

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/bca95414-be9b-4e19-8038-b5eeacc8ae8f%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: How to transform SQL columns using JDBC River?

2014-08-04 Thread Christopher Ambler
Well, I can surely help test it out as it becomes ready for consumption, 
given a little guidance on usage (being undocumented and all :-)). But 
yeah, mapping will be key. Specifically, I have a column coming out of the 
SP (the first column, called "domain") that will need to be mapped to the 
_id field.

Normally, I'd "SELECT domain AS _id ..." but with an SP, of course, life is 
more difficult.

On Monday, August 4, 2014 3:26:20 PM UTC-7, Jörg Prante wrote:
>
> At the moment it is an undocumented (incomplete) feature that can register 
> field names to callable statement result parameters. You hit the nail - how 
> to map result set output to field names is not done yet.
>

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/eac8c53b-d7a0-4b30-bc62-173a517d9f42%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


How to transform SQL columns using JDBC River?

2014-08-04 Thread Christopher Ambler
I would like to use the JDBC river to consume a queue implemented in an 
MSSQL server.

I have been given a stored procedure to call that hands me the top 100 
items in the queue. I'm given a number of columns as output, and the names 
of the columns do not correspond to the names of the fields in 
Elasticsearch.

Can anyone give me some adult supervision on how I can (or even IF I CAN) 
use the JDBC river to call this SP and properly transform the column names, 
as they're coming out of the SP's result set, into the proper field names 
in my index?

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/45ff2a25-6e36-4edd-a849-b5bba0f60449%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: Diagnosing a slow query

2014-07-31 Thread Christopher Ambler
Suspecting this, we tried taking things down to a single server and still 
have the exact same response.

That said, optimizing the query to get rid of some of those ORs has helped, 
so I think that's the path we're taking.

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/17cdcddd-fa02-40bf-b59c-dfee613c3a84%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: Diagnosing a slow query

2014-07-31 Thread Christopher Ambler
It has been suggested that what I'm seeing is a CPU-bound issue in that the 
large number of OR directives in our query could make many of these queries 
take a long time.

As I'm not an expert on crafting queries, any expert opinions?

Because I'm feeling pretty good about my configuration about now...

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/ca810684-5fa3-4f0e-92ca-38b3fe359fd7%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Diagnosing a slow query

2014-07-31 Thread Christopher Ambler
Okay, let's attack this directly. We have a cluster of 6 machines (6 
nodes). We have an index of just under 3.5 million documents. Each document 
represents an Internet domain name. We are performing queries against this 
index to see names that exist in our index. Most queries are coming back in 
the sub-50ms range. But a bunch are taking 600ms to 900ms and, thus, 
showing up in our slow query log. If they ALL were performing at this 
speed, I'd wouldn't be nearly as confused, but it looks like only about 10% 
to 20% of the queries are "slow." That's clearly too much.

Head reports that this index looks like this:

aftermarket-2014-07-31_02-38-19
size: 424Mi (2.47Gi)
docs: 3,428,471 (3,428,471)

Here is the configuration for a typical node (they're all pretty-much the 
same). We have 2 machines in a dev data center, 2 machines in a mesa data 
center and 2 machines in a phx data center. Each of the two machines in a 
data center has a "node.zone" tag set, and, as you can see, I have the 
cluster routing awareness set to see "zone" as its marching orders. The 
data pipes between the data centers are beefy, and while I acknowledge that 
cross-DC isn't something that's generally smiled-upon, it appears to work 
fine.

Each machine has 96G of RAM. We start ES giving it 30G for the heap size. 
File descriptors are set at 64,000. Note that I've selected the memory 
mapped file system.

#
# Server-specific settings for cluster domainiq-es
#
cluster.name: domainiq-es
node.name: "Mesa-03"
node.zone: es-mesa-prod
discovery.zen.ping.unicast.hosts: ["dev2.glbt1.gdg", "m1p1.mesa1.gdg", 
"m1p4.mesa1.gdg", "p3p3.phx3.gdg", "p3p4.phx3.gdg"]
#
# The following configuration items should be the same for all ES servers
#
node.master: true
node.data: true
index.number_of_shards: 5
index.number_of_replicas: 5
index.store.type: mmapfs
index.memory.index_buffer_size: 30%
index.translog.flush_threshold_ops: 25000
index.refresh_interval: 30s
bootstrap.mlockall: true
cluster.routing.allocation.awareness.attributes: zone
gateway.recover_after_nodes: 4
gateway.recover_after_time: 2m
gateway.expected_nodes: 6
discovery.zen.minimum_master_nodes: 3
discovery.zen.ping.timeout: 10s
discovery.zen.ping.retries: 3
discovery.zen.ping.interval: 15s
discovery.zen.ping.multicast.enabled: false

And here is a typical slow query:

[2014-07-31 07:35:31,530][WARN ][index.search.slowlog.query] [Mesa-03] 
[aftermarket-2014-07-31_02-38-19][2] took[707.6ms], took_millis[707], 
types[premium], stats[], search_type[QUERY_THEN_FETCH], total_shards[5], 
source[], 
extra_source[{"size":35,"query":{"query_string":{"query":"sld:petusies^20.0 
OR tokens:(((pet^1.2 pets^1.0 *^1.0)AND(us^1.2 *^0.8)AND(ie^1.2 
*^0.6)AND(s^1.2 *^0.4)) OR((pet^1.2 pets^1.0)AND(us^1.2)AND(ie^1.2))^3.0) 
AND tld:(com^1.001 OR in^0.99 OR co.in^0.941174367459617 OR 
net.in^0.8848832474555992 OR us^0.85 OR org.in^0.8397882862729736 OR 
gen.in^0.785829669672289 OR firm.in^0.7414549824163524 OR ind.in^0.7 OR 
org^0.6) OR 
_id:petusi.es^5.0-domaintype:partner","lowercase_expanded_terms":true,"analyze_wildcard":false}}}],
 


So note that I create 5 shards and 5 replicas, so that each node has all 5 
shards at all times. I THOUGHT THIS MEANT BETTER PERFORMANCE. That is, I 
thought having all 5 shards on every node meant that a query to a node 
didn't have to ask another node for data. IS THIS NOT TRUE?

Here's where it also gets interesting: I tried setting the number of shards 
to 2 (with 5 replicas) and my slow queries went to almost 2 seconds 
(2000ms). This is also terribly counter-intuitive! I thought fewer shards 
meant less lookup time.

Clearly, I want to optimize for read here. I don't care if indexing is 
three times as slow, we need our queries to be sub-100ms.

Any help is SERIOUSLY appreciated (and if you're in the Bay Area, I'm not 
above bribes of beer :-))

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/43b8bd8a-b20f-49de-a99d-825168095d6a%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: Configuration Brain Wobbles

2014-07-30 Thread Christopher Ambler
Yes, I'm aware of the multi-DC issue ;)

So yeah, this morning I dove into index shard allocation, and did just that.

3 zones (using node.zone as my tag) and then I set things to have 1 primary 
and 2 replications and then set it such that each zone should play along 
with this.

This worked as advertised. One zone gets the primary and the other two 
zones get the two replications. Each zone as 2 nodes and I have 5 shards, 
so one node gets two and the other gets three.

Head shows me this and it all makes sense.

We're now doing about 30 searches per second, and I'm still seeing, about 
every 5 or 6 seconds, a single "slow query" in the 600ms to 900ms range. 
All other queries are sub-50ms.

I need to find out why I'm seeing these consistent 600ms+ queries and 
eliminate them if I can.

Everything else looks good.

Oh, and on the unallocated shards issue, I tracked that down to having had 
more replications when I built the index and then taking my replications 
down. I had shards that had nowhere to go. So I just removed those indexes 
(they were old) and everything is green.

So the consistent 600ms slow queries is my only issue now.

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/99286c26-9816-4174-8358-a44438b448aa%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Configuration Brain Wobbles

2014-07-30 Thread Christopher Ambler
I have a cluster with six nodes. The nodes are in different data centers, 
but I don't think that matters, as the connectivity is beefy and thick. I 
have turned multicast off and unicast on. Each node knows about all the 
others explicitly. When I bring up a visualization of the cluster using the 
"head" plugin, I see them all. This appears to work as it should. My 
cluster looks like this:

DEV-02 (development data center)
MESA-01 (mesa data center)
MESA-02 (mesa data center)
MESA-03 (mesa data center)
BUCK-01 (buck data center)
BUCK-02 (buck data center)

I have each node configured for 5 shards.
I have each node set to be master true and data true

I do ALL of my document addition using MESA-01 and I can do queries on ANY 
node and get a result, so that's working. But I notice two things and have 
one requirement I can't figure out:

1. Most queries come in sub-30ms. But every now and again I get a query 
that is longer. I set my slow query log to complain over 100ms and I see 
that maybe one query out of 15 or so takes 800ms to 1200ms. This is on any 
node.

2. I have unassigned shards. I presume this is bad, yes? How do I get them 
to allocate? When I stop and start the service on any of the nodes, the 
shards are shuffled around, but rarely are the unassigned shards put on a 
node. Why? How do I resolve this?

And my requirement - Is there a way to say, "Look, Elasticsearch, I don't 
want you shuffling shards around here and there, I'd like EVERY node to 
have a COMPLETE replica of the data, and you just keep it up to date. That 
way, you see, a query on a buck data center node won't have to ask a mesa 
data center for a document if it doesn't have it."

Solving #2 is important, but solving my requirement is somewhat critical. I 
think fixing these two things will take care of issue #1.

At least it'll get me configured right so if #1 is still there, I can 
diagnose from a position of not wondering if misconfiguration is my problem.

Help?

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/33e7db26-ed5e-4c9e-abe5-fd656a73e978%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.